AWS Certified Data Engineer – Associate (DEA-C01) — Question 245
A company stores Apache Parquet files in an Amazon S3 data lake. The data lake receives thousands of files from multiple sources every hour. The files range in size from 50 KB to 100 KB.
The company is evaluating the implementation of Apache Iceberg tables for the data lake. The company is using AWS Glue Data Catalog as part of the evaluation. The company needs a solution to optimize query performance in Iceberg. The solution must ensure that Iceberg table performance does not degrade when more files are added over time.
Which solution will meet these requirements?
Answer options
- A. Use an AWS Glue job to compact the files into a standard size of 512 MB at the end of each day. Run an AWS Glue crawler to update the Data Catalog.
- B. Configure the Data Catalog to automatically compact the files every minute.
- C. Configure Iceberg table properties to enable automatic compaction based on thresholds for file size and the number of files.
- D. Implement a partition strategy in Amazon S3. Run an AWS Glue crawler to update the Data Catalog every 5 minutes.
Correct answer: C
Explanation
Option C is correct because configuring Iceberg table properties for automatic compaction based on thresholds helps maintain optimal performance as files are added. Option A involves daily compaction which may not be timely enough, and Option B's frequent compaction could lead to unnecessary overhead. Option D focuses on partitioning but does not address the critical need for compaction, which is essential for performance optimization.