Databricks Certified Machine Learning Associate — Question 41
A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.
Which of the following lines of code can the data scientist run to accomplish the task?
Answer options
- A. spark_df.summary ()
- B. spark_df.stats()
- C. spark_df.describe().head()
- D. spark_df.printSchema()
- E. spark_df.toPandas()
Correct answer: A
Explanation
The correct answer is A, as the summary() method provides a comprehensive overview of the statistics requested, including count, mean, and IQR. Option B, stats(), does not exist in Spark's DataFrame API, while C only returns a limited view of basic statistics, D prints the schema instead of statistics, and E converts the DataFrame to a pandas DataFrame without statistical analysis.