Databricks Certified Data Engineer Professional — Question 191
Which statement describes the correct use of pyspark.sql.functions.broadcast?
Answer options
- A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
- B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
- C. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.
- D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
Correct answer: D
Explanation
The correct answer is D because pyspark.sql.functions.broadcast is used to signify that a DataFrame can fit in memory on all executors, enabling efficient broadcast joins. Option A incorrectly describes a column instead of a DataFrame, while option B focuses on a column rather than the DataFrame. Option C misrepresents the function by suggesting it caches the table for future queries, which is not its purpose.