Databricks Certified Associate Developer for Apache Spark — Question 207
A developer wants to refactor some older Spark code in order to take advantage of built-in functions introduced in Spark 3.5.0.
The developer comes across the following existing DataFrame code:
import pyspark.sql.functions as F
min_price = 110.50
result_df = prices_df \
.filter(F.col("spot_price") >= F.1it(min_price)) \
.agg(F.count ("*"))
Which code block should the developer use to refactor the code?
Answer options
- A. result_df = prices_df \ .withColumn(“valid_price”, F.when(F.col(“spot_price”) > F.lit(min_price), 1).otherwise(0))
- B. result_df = prices_df \ .agg(F.count_if(F.col(“spot_price”) >= F.lit(min_price)))
- C. result_df = prices_df \ .agg(F.min(“spot_price”), F.max(“spot_price”))
- D. result_df = prices_df \ .agg(F.count(“spot_price”).alias(“spot_price”))\ .filter(F.col(“spot_price”) > F.it(“min_price”))
Correct answer: B
Explanation
Option B is correct because it uses the new `count_if` function to count the rows where `spot_price` meets or exceeds `min_price`, effectively refactoring the original logic into a more efficient format. Option A creates a new column instead of directly counting, while options C and D do not focus on counting the valid prices that meet the condition.