AWS Certified Machine Learning – Specialty — Question 358

A data scientist is conducting exploratory data analysis (EDA) on a dataset that contains information about product suppliers. The dataset records the country where each product supplier is located as a two-letter text code. For example, the code for New Zealand is "NZ."

The data scientist needs to transform the country codes for model training. The data scientist must choose the solution that will result in the smallest increase in dimensionality. The solution must not result in any information loss.

Which solution will meet these requirements?

Answer options

Correct answer: B

Explanation

Similarity encoding maps high-cardinality string categories to numeric vectors based on string similarity, minimizing dimensionality expansion while retaining all original distinct information. One-hot encoding would create a separate column for each country code, leading to a massive increase in dimensionality. Mapping to continents causes information loss as multiple countries are grouped together, and adding full country names leaves the data in a non-numeric text format unsuitable for direct model training.