AWS Certified Data Engineer – Associate (DEA-C01) — Question 211
A company wants to build a dimension table in an Amazon S3 bucket. The bucket contains historical data that includes 10 million records. The historical data is 1 TB in size.
A data engineer needs a solution to update changes for up to 10,000 records in the base table every day.
Which solution will meet this requirement with the LOWEST runtime?
Answer options
- A. Develop an Apache Spark job in Amazon EMR to read the historical data and the new changes into two Spark DataFrames. Use the Spark update method to update the base table.
- B. Develop an AWS Glue Python job to read the historical data and new changes into two Pandas DataFrames. Use the Pandas update method to update the base table.
- C. Develop an AWS Glue Apache Spark job to read the historical data and new changes into two Spark DataFrames. Use the Spark update method to update the base table.
- D. Develop an Amazon EMR job to read new changes into Apache Spark DataFrames. Use the Apache Hudi framework to create the base table in Amazon S3. Use the Spark update method to update the base table.
Correct answer: D
Explanation
The correct answer is D because using Apache Hudi with Amazon EMR allows for efficient handling of updates and optimizations that are tailored for large datasets, leading to lower runtime. The other options, while utilizing Spark or Pandas, do not provide the same level of efficiency and performance for updating the base table, especially with such significant historical data.