Databricks Certified Machine Learning Associate — Question 29

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Answer options

Correct answer: D

Explanation

Spark ML is designed for large-scale data processing and can handle extensive feature engineering efficiently, making it the correct choice. Keras, PyTorch, and Scikit-learn are primarily focused on model training and do not provide built-in support for distributed feature engineering in the same way. Pandas, while powerful for data manipulation, is not suitable for large-scale feature engineering without a distributed system.