Databricks Certified Machine Learning Associate — Question 26
A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.
Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?
Answer options
- A. They can refactor their notebook to process the data in parallel.
- B. They can refactor their notebook to use the PySpark DataFrame API.
- C. They can refactor their notebook to use the Scala Dataset API.
- D. They can refactor their notebook to use Spark SQL.
- E. They can refactor their notebook to utilize the pandas API on Spark.
Correct answer: E
Explanation
The correct answer is E because utilizing the pandas API on Spark allows the data scientist to leverage their existing knowledge of pandas while scaling for big data. Options A, B, C, and D involve more substantial changes to the notebook and require learning new APIs, which would take more time to implement.