AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 41
An ML engineer needs to use data with Amazon SageMaker Canvas to train an ML model. The data is stored in Amazon S3 and is complex in structure. The ML engineer must use a file format that minimizes processing time for the data.
Which file format will meet these requirements?
Answer options
- A. CSV files compressed with Snappy
- B. JSON objects in JSONL format
- C. JSON files compressed with gzip
- D. Apache Parquet files
Correct answer: D
Explanation
Apache Parquet files are optimized for both storage and processing speed, making them ideal for complex data in ML tasks. While CSV, JSON, and gzip formats can be used, they typically do not offer the same level of efficiency as Parquet when it comes to handling large datasets and improving processing times.