AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 172
An ML engineer is using AWS Glue to transform proprietary data from a third-party vendor to a format that the ML engineer intends to use with the Amazon SageMaker DeepAR forecasting algorithm. The data includes several similar time series data files that the ML engineer must convert to the appropriate format. The ML engineer must compress the files to optimize storage costs.
Which solution will meet these requirements?
Answer options
- A. Use Snappy to convert the files to RecordIO-Protobuf and to compress the files.
- B. Use XZ to convert the files to RecordIO-Protobuf and to compress the files.
- C. Use XZ to convert the files to Apache Parquet format and to compress the files.
- D. Use gzip to convert the files to Apache Parquet and to compress the files.
Correct answer: D
Explanation
Option D is correct because gzip is a widely used compression algorithm that works well with Apache Parquet, making it suitable for optimizing storage costs while preparing data for SageMaker. Options A and B are incorrect as they use RecordIO-Protobuf, which is not the preferred format for DeepAR. Option C is incorrect because while XZ can compress files, the transformation to Apache Parquet should ideally use gzip for better compatibility with SageMaker.