AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 108
An ML engineer needs to merge and transform data from two sources to retrain an existing ML model. One data source consists of .csv files that are stored in an Amazon S3 bucket. Each .csv file consists of millions of records. The other data source is an Amazon Aurora DB cluster.
The result of the merge process must be written to a second S3 bucket. The ML engineer needs to perform this merge-and-transform task every week.
Which solution will meet these requirements with the LEAST operational overhead?
Answer options
- A. Create a transient Amazon EMR cluster every week. Use the cluster to run an Apache Spark job to merge and transform the data.
- B. Create a weekly AWS Glue job that uses the Apache Spark engine. Use DynamicFrame native operations to merge and transform the data.
- C. Create an AWS Lambda function that runs Apache Spark code every week to merge and transform the data. Configure the Lambda function to connect to the initial S3 bucket and the DB cluster.
- D. Create an AWS Batch job that runs Apache Spark code on Amazon EC2 instances every week. Configure the Spark code to save the data from the EC2 instances to the second S3 bucket.
Correct answer: B
Explanation
Option B is the best choice because AWS Glue is a fully managed service that simplifies the process of data integration and transformation, reducing operational overhead significantly. Option A, while functional, requires manual setup of an EMR cluster each week, which adds complexity. Option C suggests using AWS Lambda, but it is not suited for processing large data sets like those in millions of records. Option D involves AWS Batch, which adds unnecessary complexity compared to the simplicity of using AWS Glue for this task.