AWS Certified Machine Learning – Specialty — Question 62

A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800,000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1.5 MB in size. Most queries will span 5 to 10 columns only.
How should the Machine Learning Specialist transform the dataset to minimize query runtime?

Answer options

Correct answer: A

Explanation

The correct answer is A, as Apache Parquet is a columnar storage format that allows for efficient reading of only the required columns, significantly speeding up query times. The other formats, like JSON and XML, do not offer the same level of performance optimization for querying specific columns, while GZIP CSV may reduce file size but does not improve query performance as effectively as Parquet.