Databricks Certified Associate Developer for Apache Spark — Question 206
A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.
Which combination of Apache Spark modules should the data scientist use in this scenario?
Answer options
- A. Spark DataFrames, Structured Streaming, and Graphx
- B. Spark SQL, Pandas API on Spark, and Structured Streaming
- C. Spark Streaming, GraphX, and Pandas API on Spark
- D. Spark DataFrames, Spark SQL, and MLIib
Correct answer: D
Explanation
The correct answer is D because Spark DataFrames and Spark SQL are specifically designed for structured data processing and executing SQL queries efficiently. MLlib is the machine learning library in Spark, making it essential for applying machine learning algorithms. The other options either include irrelevant modules or lack the comprehensive use of Spark SQL and MLlib for the described tasks.