A data scientist is working on a large dataset in Apache Spark using PySpark. The data sc…

Question

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns ‘user_id’, 'product_id’, and ‘purchase_amount’ and needs to perform some operations on this data efficiently. Which sequence of operations results in transformations that require shuffle followed by transformations that do not?

Accepted Answer

Correct answer: A. A. df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount") — Option A is correct because filtering performs a narrow transformation followed by a groupBy which necessitates a shuffle. The other options either perform operations that do not cause a shuffle (B and C) or repartition data without needing to shuffle first (D).

Databricks Certified Associate Developer for Apache Spark — Question 205

Answer options

Correct answer: A

Explanation