AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 210
A company is using Amazon SageMaker AI to create a classification model to categorize the company’s sales performance for each month of the previous 20 years on a scale from 1 to 5. The dataset includes fields for month, sales region, regional aggregate sales, and the number of stores in each sales region. The company notices that during two months of every year, the aggregate sales values are unexpectedly high. The company performs one-hot encoding on all non-numerical features in the training and validation datasets. The company uses the training dataset to train the classification model. When the company evaluates the model against the validation dataset, the results are less accurate than expected.
The company must improve the model’s accuracy on the validation dataset.
Which solution will meet this requirement?
Answer options
- A. Remove records that include outliers across all features.
- B. Use a stratified split on the month and sales region features.
- C. Perform normalization on the aggregate sales feature.
- D. Perform normalization on the aggregate sales feature for each sales region.
Correct answer: B
Explanation
Using a stratified split on the month and sales region features ensures that the training and validation datasets maintain a similar distribution of these important factors, which can lead to improved model accuracy. Removing outliers may not address the underlying issue of data distribution. Normalizing the aggregate sales feature is beneficial, but without stratification, the model may still perform poorly due to imbalanced representation of sales regions and months.