Databricks Certified Associate Developer for Apache Spark — Question 9
Which of the following code blocks will most quickly return an approximation for the number of distinct values in column division in DataFrame storesDF?
Answer options
- A. storesDF.agg(approx_count_distinct(col("division")).alias("divisionDistinct"))
- B. storesDF.agg(approx_count_distinct(col("division"), 0.01).alias("divisionDistinct"))
- C. storesDF.agg(approx_count_distinct(col("division"), 0.15).alias("divisionDistinct"))
- D. storesDF.agg(approx_count_distinct(col("division"), 0.0).alias("divisionDistinct"))
- E. storesDF.agg(approx_count_distinct(col("division"), 0.05).alias("divisionDistinct"))
Correct answer: C
Explanation
Option C is correct because a higher relative error (0.15) allows for a faster approximation of distinct values compared to the lower thresholds in other options. Options A and D do not use a relative error parameter, making them less efficient for quick estimations. Options B and E have lower relative errors than C, which would result in a longer computation time.