Databricks Certified Machine Learning Associate — Question 13
A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.
Which of the following approaches can they take to include as much information as possible in the feature set?
Answer options
- A. Impute the missing values using each respective feature variable’s mean value instead of the median value
- B. Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them
- C. Remove all feature variables that originally contained missing values from the feature set
- D. Create a binary feature variable for each feature that contained missing values indicating whether each row’s value has been imputed
- E. Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing
Correct answer: D
Explanation
The correct answer is D because creating a binary feature to indicate imputation preserves information about which values were originally missing, which may be valuable for the model. Options A and C either change the data distribution or remove potentially useful features entirely. Option B neglects the missing data without any alternative strategy, and option E provides some context but does not directly inform the model about the imputed values.