AWS Certified Machine Learning – Specialty — Question 332

A data scientist needs to create a model for predictive maintenance. The model will be based on historical data to identify rare anomalies in the data.

The historical data is stored in an Amazon S3 bucket. The data scientist needs to use Amazon SageMaker Data Wrangler to ingest the data. The data scientist also needs to perform exploratory data analysis (EDA) to understand the statistical properties of the data.

Which solution will meet these requirements with the LEAST amount of compute resources?

Answer options

Correct answer: C

Explanation

Using the 'First K' sampling option in Amazon SageMaker Data Wrangler is the most resource-efficient method because it loads only the initial block of rows, avoiding the full dataset scans required by Randomized or Stratified options. The 'None' option imports the entire dataset, which consumes the most compute resources. Determining K based on domain knowledge ensures the sample is large enough for exploratory data analysis without over-consuming resources.