AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 178
An ML engineer is building a logistic regression model to predict customer churn for subscription services. The ML engineer is using a dataset that contains two string variables: location and job_seniority_level. The location variable has 3 distinct values, and the job_seniority_level variable has over 10 distinct values.
The ML engineer must perform preprocessing on the variables.
Which solution will meet this requirement?
Answer options
- A. Apply tokenization to location. Apply ordinal encoding to job_seniority_level.
- B. Apply one-hot encoding to location. Apply ordinal encoding to job_seniority_level
- C. Apply binning to location. Apply standard scaling to job_seniority_level.
- D. Apply one-hot encoding to location. Apply standard scaling to job_seniority_level.
Correct answer: B
Explanation
Option B is correct because one-hot encoding is suitable for the location variable with a small number of distinct values, allowing it to be represented as binary columns. Ordinal encoding is appropriate for job_seniority_level since it has more than 10 distinct values and can maintain a meaningful order. The other options either use inappropriate encoding methods or scaling methods that do not fit the nature of the categorical data.