Databricks Certified Associate Developer for Apache Spark — Question 41
The below code block contains a logical error resulting in inefficiency. The code block is intended to efficiently perform a broadcast join of DataFrame storesDF and the much larger DataFrame employeesDF using key column storeId. Identify the logical error.
Code block:
storesDF.join(broadcast(employeesDF), "storeId")
Answer options
- A. The larger DataFrame employeesDF is being broadcasted rather than the smaller DataFrame storesDF.
- B. There is never a need to call the broadcast() operation in Apache Spark 3.
- C. The entire line of code should be wrapped in broadcast() rather than just DataFrame employeesDF.
- D. The broadcast() operation will only perform a broadcast join if the Spark property spark.sql.autoBroadcastJoinThreshold is manually set.
- E. Only one of the DataFrames is being broadcasted rather than both of the DataFrames.
Correct answer: A
Explanation
The correct answer is A because the intention is to broadcast the smaller DataFrame, storesDF, not the larger employeesDF, which leads to inefficiency. Option B is incorrect because broadcast() can still be necessary in certain scenarios, while C and D misinterpret how the broadcast operation should be applied or configured. Option E is also incorrect as the problem lies with which DataFrame is being broadcasted, not the number of DataFrames being broadcasted.