AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 14
Case study -
An ML engineer is developing a fraud detection model on AWS. The training dataset includes transaction logs, customer profiles, and tables from an on-premises MySQL database. The transaction logs and customer profiles are stored in Amazon S3.
The dataset has a class imbalance that affects the learning of the model's algorithm. Additionally, many of the features have interdependencies. The algorithm is not capturing all the desired underlying patterns in the data.
Before the ML engineer trains the model, the ML engineer must resolve the issue of the imbalanced data.
Which solution will meet this requirement with the LEAST operational effort?
Answer options
- A. Use Amazon Athena to identify patterns that contribute to the imbalance. Adjust the dataset accordingly.
- B. Use Amazon SageMaker Studio Classic built-in algorithms to process the imbalanced dataset.
- C. Use AWS Glue DataBrew built-in features to oversample the minority class.
- D. Use the Amazon SageMaker Data Wrangler balance data operation to oversample the minority class.
Correct answer: D
Explanation
The correct answer is D because Amazon SageMaker Data Wrangler provides a straightforward method to balance the dataset with minimal effort. Options A and B involve more complex processes that require additional steps, while option C, while helpful, does not offer the same level of integration and simplicity as using Data Wrangler.