A Developer is working with a pandas DataFrame containing user behaviour data from a web…

Question

A Developer is working with a pandas DataFrame containing user behaviour data from a web application. Which approach should be used for executing groupby operation in parallel across all workers of Apache Spark 3.5?

Accepted Answer

Correct answer: A. A. Use the applyInPandas API:

import pandas as pd
def mean_func(key, pdf):
return pd.DataFrame([(key + (pdf['value'].mean(),)])

df .groupby (‘user_id') .applyInPandas (mean_func, schema= “user_id long, value double") .show() — The correct choice is A because the applyInPandas API is specifically designed for applying pandas functions in parallel on data grouped by keys in Spark. Option B, while also valid, is not as efficient for executing groupby operations in parallel compared to applyInPandas. Option C does not support parallel execution across workers, and option D, although it defines a Pandas UDF, does not utilize it in the most effective manner for the groupby operation in Spark.

Databricks Certified Associate Developer for Apache Spark — Question 194

Answer options

Correct answer: A

Explanation