Databricks Certified Machine Learning Associate — Question 29
Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?
Answer options
- A. Keras
- B. pandas
- C. PyTorch
- D. Spark ML
- E. Scikit-learn
Correct answer: D
Explanation
Spark ML is designed for large-scale data processing and can handle extensive feature engineering efficiently, making it the correct choice. Keras, PyTorch, and Scikit-learn are primarily focused on model training and do not provide built-in support for distributed feature engineering in the same way. Pandas, while powerful for data manipulation, is not suitable for large-scale feature engineering without a distributed system.