AWS Certified Machine Learning – Specialty — Question 332
A data scientist needs to create a model for predictive maintenance. The model will be based on historical data to identify rare anomalies in the data.
The historical data is stored in an Amazon S3 bucket. The data scientist needs to use Amazon SageMaker Data Wrangler to ingest the data. The data scientist also needs to perform exploratory data analysis (EDA) to understand the statistical properties of the data.
Which solution will meet these requirements with the LEAST amount of compute resources?
Answer options
- A. Import the data by using the None option.
- B. Import the data by using the Stratified option.
- C. Import the data by using the First K option. Infer the value of K from domain knowledge.
- D. Import the data by using the Randomized option. Infer the random size from domain knowledge.
Correct answer: C
Explanation
Using the 'First K' sampling option in Amazon SageMaker Data Wrangler is the most resource-efficient method because it loads only the initial block of rows, avoiding the full dataset scans required by Randomized or Stratified options. The 'None' option imports the entire dataset, which consumes the most compute resources. Determining K based on domain knowledge ensures the sample is large enough for exploratory data analysis without over-consuming resources.