Databricks Certified Data Engineer Professional — Question 49
Which statement describes the correct use of pyspark.sql.functions.broadcast?
Answer options
- A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.
- B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.
- C. It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.
- D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.
- E. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.
Correct answer: D
Explanation
The correct answer is D because pyspark.sql.functions.broadcast is used to indicate that a DataFrame can fit into memory on all executors, which is essential for optimizing broadcast joins. Options A and B incorrectly refer to columns instead of DataFrames, while C and E misrepresent the function's behavior regarding caching and storage locations.