An ML engineer is building an ML pipeline. The pipeline must process a dataset in two way…

Question

An ML engineer is building an ML pipeline. The pipeline must process a dataset in two ways by using Amazon Athena. The pipeline must use batch processing to perform large-scale data transformations and for model training. The pipeline must also use near real-time processing to perform low-latency queries for inference and analytics. Which file format will provide the LEAST latency for both types of processing?

Accepted Answer

Correct answer: B. B. Apache Parquet — Apache Parquet is optimized for both batch and real-time processing, providing columnar storage that minimizes latency during queries. In contrast, CSV and JSON formats may lead to higher latency due to their row-based structure and lack of efficient data compression, making them less suitable for high-performance analytics.

AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 155

Answer options

Correct answer: B

Explanation