AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 194

An ML engineer wants to use Amazon SageMaker Data Wrangler to perform preprocessing on a dataset. The ML engineer wants to use the processed dataset to train a classification model. During preprocessing, the ML engineer notices that a text feature has a range of thousands of values that differ only by spelling errors. The ML engineer needs to apply an encoding method so that after preprocessing is complete, the text feature can be used to train the model.

Which solution will meet these requirements?

Answer options

Correct answer: B

Explanation

The correct answer is B, as similarity encoding is specifically designed to handle cases where categories have slight variations, such as spelling errors. Ordinal encoding (A) is not suitable as it implies a rank order that doesn't exist in the data, one-hot encoding (C) creates too many binary columns for similar values, and target encoding (D) focuses on the target variable rather than the feature's categorical values.