A company wants to build a dimension table in an Amazon S3 bucket. The bucket contains hi…

Question

A company wants to build a dimension table in an Amazon S3 bucket. The bucket contains historical data that includes 10 million records. The historical data is 1 TB in size. A data engineer needs a solution to update changes for up to 10,000 records in the base table every day. Which solution will meet this requirement with the LOWEST runtime?

Accepted Answer

Correct answer: D. D. Develop an Amazon EMR job to read new changes into Apache Spark DataFrames. Use the Apache Hudi framework to create the base table in Amazon S3. Use the Spark update method to update the base table. — The correct answer is D because using Apache Hudi with Amazon EMR allows for efficient handling of updates and optimizations that are tailored for large datasets, leading to lower runtime. The other options, while utilizing Spark or Pandas, do not provide the same level of efficiency and performance for updating the base table, especially with such significant historical data.

AWS Certified Data Engineer – Associate (DEA-C01) — Question 211

Answer options

Correct answer: D

Explanation