Databricks Certified Machine Learning Associate — Question 2
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.
Which of the following code blocks will accomplish this task?
Answer options
- A. spark_df[spark_df["price"] > 0]
- B. spark_df.filter(col("price") > 0)
- C. SELECT * FROM spark_df WHERE price > 0
- D. spark_df.loc[spark_df["price"] > 0,:]
- E. spark_df.loc[:,spark_df["price"] > 0]
Correct answer: B
Explanation
The correct answer is B because the filter method is the appropriate way to select rows based on a condition in Spark DataFrames. Option A is incorrect as it uses a syntax not supported in Spark, C is a SQL query and not valid for a Spark DataFrame, while D and E use loc, which is not applicable in the context of Spark DataFrames.