Databricks Certified Machine Learning Associate — Question 9

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.
Which of the following feature engineering tasks will be the least efficient to distribute?

Answer options

Correct answer: D

Explanation

Imputing missing feature values with the true median is least efficient to distribute because it requires access to the entire dataset to accurately compute the median value. In contrast, tasks like one-hot encoding and creating binary indicator features can be performed independently on subsets of data, making them more suitable for distribution.