Databricks Certified Associate Developer for Apache Spark — Question 194

A Developer is working with a pandas DataFrame containing user behaviour data from a web application.

Which approach should be used for executing groupby operation in parallel across all workers of Apache Spark 3.5?

Answer options

Correct answer: A

Explanation

The correct choice is A because the applyInPandas API is specifically designed for applying pandas functions in parallel on data grouped by keys in Spark. Option B, while also valid, is not as efficient for executing groupby operations in parallel compared to applyInPandas. Option C does not support parallel execution across workers, and option D, although it defines a Pandas UDF, does not utilize it in the most effective manner for the groupby operation in Spark.