AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 173
A company has significantly increased the amount of data that is stored as .csv files in an Amazon S3 bucket. Data transformation scripts and queries are now taking much longer than they used to take.
An ML engineer must implement a solution to optimize the data for query performance.
Which solution will meet this requirement with the LEAST operational overhead?
Answer options
- A. Configure an AWS Lambda function to split the .csv files into smaller objects in the S3 bucket.
- B. Configure an AWS Glue job to drop columns that have string type values and to save the results to the S3 bucket.
- C. Configure an AWS Glue extract, transform, and load (ETL) job to convert the .csv files to Apache Parquet format.
- D. Configure an Amazon EMR cluster to process the data that is in the S3 bucket.
Correct answer: C
Explanation
The correct answer is C because converting .csv files to Apache Parquet format optimizes storage and enhances query performance due to Parquet's efficient columnar storage. Options A and B do not address the need for overall data optimization for query performance, and option D introduces more operational overhead by requiring the management of an EMR cluster.