Databricks Certified Associate Developer for Apache Spark — Question 205
A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns ‘user_id’, 'product_id’, and ‘purchase_amount’ and needs to perform some operations on this data efficiently.
Which sequence of operations results in transformations that require shuffle followed by transformations that do not?
Answer options
- A. df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")
- B. df.withcolumn("discount", df.purchase_amount * 0.1).select ("discount")
- C. df.withcolumn("purchase_date", current_date()).where("total_purchase > 50")
- D. df.groupBy("user_id").agg (sum ("purchase_amount").alias("total_purchase")).repartition(10)
Correct answer: A
Explanation
Option A is correct because filtering performs a narrow transformation followed by a groupBy which necessitates a shuffle. The other options either perform operations that do not cause a shuffle (B and C) or repartition data without needing to shuffle first (D).