An ML engineer needs to merge and transform data from two sources to retrain an existing…

Question

An ML engineer needs to merge and transform data from two sources to retrain an existing ML model. One data source consists of .csv files that are stored in an Amazon S3 bucket. Each .csv file consists of millions of records. The other data source is an Amazon Aurora DB cluster. The result of the merge process must be written to a second S3 bucket. The ML engineer needs to perform this merge-and-transform task every week. Which solution will meet these requirements with the LEAST operational overhead?

Accepted Answer

Correct answer: B. B. Create a weekly AWS Glue job that uses the Apache Spark engine. Use DynamicFrame native operations to merge and transform the data. — Option B is the best choice because AWS Glue is a fully managed service that simplifies the process of data integration and transformation, reducing operational overhead significantly. Option A, while functional, requires manual setup of an EMR cluster each week, which adds complexity. Option C suggests using AWS Lambda, but it is not suited for processing large data sets like those in millions of records. Option D involves AWS Batch, which adds unnecessary complexity compared to the simplicity of using AWS Glue for this task.

AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 108

Answer options

Correct answer: B

Explanation