AWS Certified Solutions Architect – Associate (SAA-C02) — Question 597
A company stores millions of objects in Amazon S3. The data is in JSON format and Apache Parquet format. The data is partitioned, and new objects are added daily. A solutions architect needs to create a solution so that employees can use SQL to perform one-time queries against all the data. The solution must avoid code changes and must minimize operational overhead.
Which solution will meet these requirements?
Answer options
- A. Use S3 Select to perform queries against all the S3 objects.
- B. Create an AWS Glue table and an AWS Glue crawler. Schedule the crawler to run daily. Perform queries with Amazon Athena.
- C. Create an Amazon EMR cluster. Set up EMR File System (EMRFS) to access the S3 bucket. Perform queries with Apache Spark.
- D. Create an Amazon Redshift cluster. Schedule an AWS Lambda function to perform the COPY command on the Redshift cluster to load the S3 data. Perform queries on the Redshift cluster.
Correct answer: B
Explanation
Amazon Athena combined with AWS Glue is a serverless solution that allows querying data directly in Amazon S3 using standard SQL with zero infrastructure to manage, satisfying the requirement for minimal operational overhead and no code changes. S3 Select is limited to querying single objects rather than aggregating across millions of files. Setting up Amazon EMR or Amazon Redshift clusters introduces significant operational overhead and management complexity compared to a serverless Athena approach.