AWS Certified Machine Learning – Specialty — Question 134

A data scientist has explored and sanitized a dataset in preparation for the modeling phase of a supervised learning task. The statistical dispersion can vary widely between features, sometimes by several orders of magnitude. Before moving on to the modeling phase, the data scientist wants to ensure that the prediction performance on the production data is as accurate as possible.
Which sequence of steps should the data scientist take to meet these requirements?

Answer options

Correct answer: B

Explanation

The correct answer is B because it ensures that the model is trained on data that has been appropriately scaled, and the same scaling is applied to validation and test sets, maintaining consistency. Option A fails to address scaling, while option C applies rescaling before splitting, which can lead to data leakage. Option D incorrectly suggests rescaling the validation and test sets independently, which can introduce inconsistencies and bias.