A data engineer is optimizing a Spark application that performs the following operations…

Question

A data engineer is optimizing a Spark application that performs the following operations in sequence:
1. Reads data from distributed storage into a DataFrame.
2. Applies a filter transformation to remove certain records.
3. Performs a join operation with another large DataFrame.
4. Applies a groupBy and agg to aggregate data.
5. Writes the result to storage.

Accepted Answer

Correct answer: A, D. A. The groupBy and agg operations may cause a shuffle, resulting in additional stages. — D. The join operation causes a shuffle, leading to a new stage in the execution plan. — The correct options A and D highlight that both the groupBy and agg operations, as well as the join operation, can cause shuffles, creating additional stages in the execution plan. Option B is incorrect because the application can consist of multiple jobs, and option C is misleading as not all shuffles can be eliminated through automatic optimization. Option E is also incorrect since the join operation is a wide transformation, and thus cannot be executed in the same stage as narrow transformations.

Databricks Certified Associate Developer for Apache Spark — Question 196

Answer options

Correct answer: A, D

Explanation