Databricks Certified Associate Developer for Apache Spark — Question 194
A Developer is working with a pandas DataFrame containing user behaviour data from a web application.
Which approach should be used for executing groupby operation in parallel across all workers of Apache Spark 3.5?
Answer options
- A. Use the applyInPandas API: import pandas as pd def mean_func(key, pdf): return pd.DataFrame([(key + (pdf['value'].mean(),)]) df .groupby (‘user_id') .applyInPandas (mean_func, schema= “user_id long, value double") .show()
- B. Utilize the mapInPandas API: def mean_func(pdf_iter): for pdf in pdf_iter: yield pdf.groupby(‘user_id’).agg({‘value’: ‘mean’}).reset_index() df.mapInPandas(mean_func, schema=”user_id long, value double”).show()
- C. Use a regular Spark UDF (User-Defined Function): from pyspark.sql.functions import men df.groupBy (‘user_id’).agg(mean(‘value’)).show()
- D. Create a Pandas UDF (User-Defined Function): from pyspark.sql.functions import pandas_udf @pandas_udf(“double”) def mean_func(value: pd.Series) –> float: return value.mean() df.groupby(“user_id”).agg(mean_func(df[‘value’])).show()
Correct answer: A
Explanation
The correct choice is A because the applyInPandas API is specifically designed for applying pandas functions in parallel on data grouped by keys in Spark. Option B, while also valid, is not as efficient for executing groupby operations in parallel compared to applyInPandas. Option C does not support parallel execution across workers, and option D, although it defines a Pandas UDF, does not utilize it in the most effective manner for the groupby operation in Spark.