Databricks Certified Associate Developer for Apache Spark — Question 196

A data engineer is optimizing a Spark application that performs the following operations in sequence:
1. Reads data from distributed storage into a DataFrame.
2. Applies a filter transformation to remove certain records.
3. Performs a join operation with another large DataFrame.
4. Applies a groupBy and agg to aggregate data.
5. Writes the result to storage.

Which two statements describe how Apache Spark constructs the execution hierarchy for this application? (Choose two.)

Answer options

Correct answer: A, D

Explanation

The correct options A and D highlight that both the groupBy and agg operations, as well as the join operation, can cause shuffles, creating additional stages in the execution plan. Option B is incorrect because the application can consist of multiple jobs, and option C is misleading as not all shuffles can be eliminated through automatic optimization. Option E is also incorrect since the join operation is a wide transformation, and thus cannot be executed in the same stage as narrow transformations.