A data scientist obtains a tabular dataset that contains 150 correlated features with dif…

Question

A data scientist obtains a tabular dataset that contains 150 correlated features with different ranges to build a regression model. The data scientist needs to achieve more efficient model training by implementing a solution that minimizes impact on the model’s performance. The data scientist decides to perform a principal component analysis (PCA) preprocessing step to reduce the number of features to a smaller set of independent features before the data scientist uses the new features in the regression model. Which preprocessing step will meet these requirements?

Accepted Answer

Correct answer: B. B. Load the data into Amazon SageMaker Data Wrangler. Scale the data with a Min Max Scaler transformation step. Use the SageMaker built-in algorithm for PCA on the scaled dataset to transform the data. — PCA is highly sensitive to the relative scaling of input features, so features with larger ranges must be scaled first (e.g., using a Min Max Scaler in SageMaker Data Wrangler) so they do not disproportionately dominate the principal components. Manually removing correlated features prior to PCA, as suggested in options C and D, is unnecessary and counterproductive because PCA is specifically designed to handle multicollinearity automatically. Directly applying PCA without scaling, as in option A, would lead to biased principal components due to the different feature ranges.

AWS Certified Machine Learning – Specialty — Question 285

Answer options

Correct answer: B

Explanation