AWS Certified Data Engineer – Associate (DEA-C01) — Question 6
A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column.
Which solution will MOST speed up the Athena query performance?
Answer options
- A. Change the data format from .csv to JSON format. Apply Snappy compression.
- B. Compress the .csv files by using Snappy compression.
- C. Change the data format from .csv to Apache Parquet. Apply Snappy compression.
- D. Compress the .csv files by using gzip compression.
Correct answer: C
Explanation
The correct answer is C because Apache Parquet is a columnar storage format that optimizes query performance, especially for column-specific queries, and applying Snappy compression further reduces the data size for faster access. Option A is incorrect as JSON does not provide the same performance benefits as Parquet for columnar queries. Option B only compresses data without changing the format, which does not enhance query speed significantly. Option D, while it compresses the files, gzip is generally less efficient for performance compared to Snappy with a columnar format.