AWS Certified Machine Learning – Specialty — Question 193

A data engineer needs to provide a team of data scientists with the appropriate dataset to run machine learning training jobs. The data will be stored in Amazon S3. The data engineer is obtaining the data from an Amazon Redshift database and is using join queries to extract a single tabular dataset. A portion of the schema is as follows:

TransactionTimestamp (Timestamp)
CardName (Varchar)
CardNo (Varchar)

The data engineer must provide the data so that any row with a CardNo value of NULL is removed. Also, the TransactionTimestamp column must be separated into a TransactionDate column and a TransactionTime column. Finally, the CardName column must be renamed to NameOnCard.

The data will be extracted on a monthly basis and will be loaded into an S3 bucket. The solution must minimize the effort that is needed to set up infrastructure for the ingestion and transformation. The solution also must be automated and must minimize the load on the Amazon Redshift cluster.

Which solution meets these requirements?

Answer options

Correct answer: C

Explanation

Option C is the correct choice because AWS Glue is designed for such ETL tasks, providing built-in transformations which simplify the process and automate it effectively. Options A and B involve more manual setup and maintenance, with option A requiring an EMR cluster which adds complexity, and option B needing an EC2 instance and manual file handling. Option D, while using Redshift Spectrum, does not provide the same level of transformation capability as Glue for this scenario.