Databricks Certified Associate Developer for Apache Spark — Question 205

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns ‘user_id’, 'product_id’, and ‘purchase_amount’ and needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require shuffle followed by transformations that do not?

Answer options

Correct answer: A

Explanation

Option A is correct because filtering performs a narrow transformation followed by a groupBy which necessitates a shuffle. The other options either perform operations that do not cause a shuffle (B and C) or repartition data without needing to shuffle first (D).