Databricks Certified Associate Developer for Apache Spark — Question 20

A Spark application has a 128 GB DataFrame A and a 1 GB DataFrame B. If a broadcast join were to be performed on these two DataFrames, which of the following describes which DataFrame should be broadcasted and why?

Answer options

Correct answer: D

Explanation

The correct choice is D because broadcasting the smaller DataFrame B minimizes the amount of data that needs to be shuffled across the network, improving performance. Options A and C are incorrect as they overlook the efficiency gained by broadcasting the smaller DataFrame. Option B mistakenly suggests that shuffling DataFrame B is avoided, which is not relevant because we aim to avoid shuffling the larger DataFrame A instead.