Databricks Certified Associate Developer for Apache Spark — Question 20
A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?
Answer options
- A. Either DataFrame can be broadcasted. Their results will be identical in result and efficiency.
- B. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
- C. DataFrame A should be broadcasted because it is larger and will eliminate the need for the shuffling of DataFrame B.
- D. DataFrame B should be broadcasted because it is smaller and will eliminate the need for the shuffling of DataFrame A.
- E. DataFrame A should be broadcasted because it is smaller and will eliminate the need for the shuffling of itself.
Correct answer: D
Explanation
The correct choice is D because broadcasting the smaller DataFrame B minimizes the amount of data that needs to be shuffled across the network, improving performance. Options A and C are incorrect as they overlook the efficiency gained by broadcasting the smaller DataFrame. Option B mistakenly suggests that shuffling DataFrame B is avoided, which is not relevant because we aim to avoid shuffling the larger DataFrame A instead.