Google Cloud Professional Machine Learning Engineer — Question 331
You are creating a deep neural network classification model using a dataset with categorical input values. Certain columns have a cardinality greater than 10,000 unique values. How should you encode these categorical values as input into the model?
Answer options
- A. Convert each categorical value into an integer value.
- B. Convert the categorical string data to one-hot hash buckets.
- C. Map the categorical variables into a vector of boolean values.
- D. Convert each categorical value into a run-length encoded string.
Correct answer: B
Explanation
One-hot hash bucketing is effective for handling high cardinality categorical features as it reduces dimensionality while preserving information. Converting to integer values (A) could lead to ordinal relationships that don't exist, and mapping to boolean vectors (C) may not capture the distinct categories effectively. Run-length encoding (D) is not suitable for categorical data as it is primarily used for compressing sequences.