AWS Certified Machine Learning – Specialty — Question 285

A data scientist obtains a tabular dataset that contains 150 correlated features with different ranges to build a regression model. The data scientist needs to achieve more efficient model training by implementing a solution that minimizes impact on the model’s performance. The data scientist decides to perform a principal component analysis (PCA) preprocessing step to reduce the number of features to a smaller set of independent features before the data scientist uses the new features in the regression model.

Which preprocessing step will meet these requirements?

Answer options

Correct answer: B

Explanation

PCA is highly sensitive to the relative scaling of input features, so features with larger ranges must be scaled first (e.g., using a Min Max Scaler in SageMaker Data Wrangler) so they do not disproportionately dominate the principal components. Manually removing correlated features prior to PCA, as suggested in options C and D, is unnecessary and counterproductive because PCA is specifically designed to handle multicollinearity automatically. Directly applying PCA without scaling, as in option A, would lead to biased principal components due to the different feature ranges.