AWS Certified Machine Learning – Specialty — Question 307
A data scientist receives a new dataset in .csv format and stores the dataset in Amazon S3. The data scientist will use the dataset to train a machine learning (ML) model.
The data scientist first needs to identify any potential data quality issues in the dataset. The data scientist must identify values that are missing or values that are not valid. The data scientist must also identify the number of outliers in the dataset.
Which solution will meet these requirements with the LEAST operational effort?
Answer options
- A. Create an AWS Glue job to transform the data from .csv format to Apache Parquet format. Use an AWS Glue crawler and Amazon Athena with appropriate SQL queries to retrieve the required information.
- B. Leave the dataset in .csv format. Use an AWS Glue crawler and Amazon Athena with appropriate SQL queries to retrieve the required information.
- C. Create an AWS Glue job to transform the data from .csv format to Apache Parquet format. Import the data into Amazon SageMaker Data Wrangler. Use the Data Quality and Insights Report to retrieve the required information.
- D. Leave the dataset in .csv format. Import the data into Amazon SageMaker Data Wrangler. Use the Data Quality and Insights Report to retrieve the required information.
Correct answer: D
Explanation
Using Amazon SageMaker Data Wrangler's Data Quality and Insights Report provides a built-in, low-effort way to automatically detect missing values, invalid values, and outliers without writing custom SQL queries. Keeping the dataset in its original .csv format avoids the operational overhead of creating and running an AWS Glue job to convert the format to Apache Parquet, making Option D the most efficient solution.