Databricks Certified Associate Developer for Apache Spark — Question 73
Which of the following describes the difference between DataFrame.repartition(n) and DataFrame.coalesce(n)?
Answer options
- A. DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions.
- B. While the results are similar, DataFrame.repartition(n) will be more efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column.
- C. DataFrame.repartition(n) will split a Data Frame into any number of new partitions while minimizing shuffling. DataFrame.coalesce(n) will split a DataFrame onto any number of new partitions utilizing a full shuffle.
- D. While the results are similar, DataFrame.repartition(n) will be less efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column.
- E. DataFrame.repartition(n) will combine the existing partitions of a DataFrame but may result in an uneven distribution of data across the new partitions. DataFrame.coalesce(n) will more slowly split a Data Frame into n number of new partitions with data distributed evenly.
Correct answer: A
Explanation
The correct answer is A because DataFrame.repartition(n) creates n new partitions with balanced data distribution, while DataFrame.coalesce(n) merges partitions quickly but may lead to uneven data distribution. Options B, C, D, and E provide inaccurate descriptions of the functionality and efficiency of these methods.