A data scientist receives a new dataset in .csv format and stores the dataset in Amazon S…

Question

A data scientist receives a new dataset in .csv format and stores the dataset in Amazon S3. The data scientist will use the dataset to train a machine learning (ML) model. The data scientist first needs to identify any potential data quality issues in the dataset. The data scientist must identify values that are missing or values that are not valid. The data scientist must also identify the number of outliers in the dataset. Which solution will meet these requirements with the LEAST operational effort?

Accepted Answer

Correct answer: D. D. Leave the dataset in .csv format. Import the data into Amazon SageMaker Data Wrangler. Use the Data Quality and Insights Report to retrieve the required information. — Using Amazon SageMaker Data Wrangler's Data Quality and Insights Report provides a built-in, low-effort way to automatically detect missing values, invalid values, and outliers without writing custom SQL queries. Keeping the dataset in its original .csv format avoids the operational overhead of creating and running an AWS Glue job to convert the format to Apache Parquet, making Option D the most efficient solution.

AWS Certified Machine Learning – Specialty — Question 307

Answer options

Correct answer: D

Explanation