Databricks Certified Associate Developer for Apache Spark — Question 196
A data engineer is optimizing a Spark application that performs the following operations in sequence:
1. Reads data from distributed storage into a DataFrame.
2. Applies a filter transformation to remove certain records.
3. Performs a join operation with another large DataFrame.
4. Applies a groupBy and agg to aggregate data.
5. Writes the result to storage.
Which two statements describe how Apache Spark constructs the execution hierarchy for this application? (Choose two.)
Answer options
- A. The groupBy and agg operations may cause a shuffle, resulting in additional stages.
- B. The application will execute as a single job initiated by the write action.
- C. The DataFrame API eliminates all shuffles through automatic optimization.
- D. The join operation causes a shuffle, leading to a new stage in the execution plan.
- E. The filter and join transformations, are executed in the same stage because they are narrow transformations.
Correct answer: A, D
Explanation
The correct options A and D highlight that both the groupBy and agg operations, as well as the join operation, can cause shuffles, creating additional stages in the execution plan. Option B is incorrect because the application can consist of multiple jobs, and option C is misleading as not all shuffles can be eliminated through automatic optimization. Option E is also incorrect since the join operation is a wide transformation, and thus cannot be executed in the same stage as narrow transformations.