AWS Certified Machine Learning – Specialty — Question 308

An ecommerce company has developed a XGBoost model in Amazon SageMaker to predict whether a customer will return a purchased item. The dataset is imbalanced. Only 5% of customers return items.

A data scientist must find the hyperparameters to capture as many instances of returned items as possible. The company has a small budget for compute.

How should the data scientist meet these requirements MOST cost-effectively?

Answer options

Correct answer: B

Explanation

Tuning only a select few hyperparameters like scale_pos_weight and csv_weight is much more cost-effective than tuning all hyperparameters because it reduces the search space and requires fewer training runs. Furthermore, maximizing validation AUC (which the intended metric in option B represents) is ideal for imbalanced datasets, whereas accuracy is a poor metric for imbalanced data, and minimizing F1 would yield sub-optimal model performance.