A medical device company is building a machine learning (ML) model to predict the likelih…

Question

A medical device company is building a machine learning (ML) model to predict the likelihood of device recall based on customer data that the company collects from a plain text survey. One of the survey questions asks which medications the customer is taking. The data for this field contains the names of medications that customers enter manually. Customers misspell some of the medication names. The column that contains the medication name data gives a categorical feature with high cardinality but redundancy. What is the MOST effective way to encode this categorical feature into a numeric feature?

Accepted Answer

Correct answer: C. C. Use Amazon SageMaker Data Wrangler similarity encoding on the column to create embeddings of vectors of real numbers. — Amazon SageMaker Data Wrangler similarity encoding is ideal for high-cardinality categorical features containing spelling variations and errors, as it converts text strings into numerical vector embeddings based on string similarity. One-hot encoding (Options A and B) would result in a massive, sparse feature space because of the high cardinality and spelling mistakes. Ordinal encoding (Option D) is unsuitable here because it implies an arbitrary ordered relationship between medications and does not resolve spelling redundancies.

AWS Certified Machine Learning – Specialty — Question 272

Answer options

Correct answer: C

Explanation