Databricks Certified Machine Learning Associate — Question 8
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.
Which of the following explanations justifies this suggestion?
Answer options
- A. One-hot encoding is not supported by most machine learning libraries.
- B. One-hot encoding is dependent on the target variable’s values which differ for each application.
- C. One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.
- D. One-hot encoding is not a common strategy for representing categorical feature variables numerically.
- E. One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.
Correct answer: E
Explanation
The correct answer is E because one-hot encoding can create a high-dimensional feature space that some algorithms struggle with, potentially leading to overfitting or poor performance. Options A, B, C, and D present inaccuracies or misunderstandings about one-hot encoding and its application in machine learning. For instance, one-hot encoding is widely supported (A), is not inherently tied to the target variable's values (B), and is commonly used in various contexts (D).