AWS Certified Data Engineer – Associate (DEA-C01) — Question 220
A data engineer is optimizing query performance in Amazon Athena notebooks that use Apache Spark to analyze large datasets that are stored in Amazon S3. The data is partitioned.
An AWS Glue crawler updates the partitions.
The data engineer wants to minimize the amount of data that is scanned to improve efficiency of Athena queries.
Which solution will meet these requirements?
Answer options
- A. Apply partition filters in the queries.
- B. Increase the frequency of AWS Glue crawler invocations to update the data catalog more often.
- C. Organize the data that is in Amazon S3 by using a nested directory structure.
- D. Configure Spark to use in-memory caching for frequently accessed data.
Correct answer: A
Explanation
Using partition filters in the queries is the best approach as it directly reduces the amount of data scanned by only accessing relevant partitions. Increasing the frequency of AWS Glue crawler invocations or organizing data in a nested structure may improve data management but do not directly minimize the scanned data. Configuring Spark for in-memory caching can enhance performance but does not address the scanning of data in Athena queries.