Databricks Certified Machine Learning Associate — Question 11

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.
Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Answer options

Correct answer: A

Explanation

The correct answer is A, as it correctly imports the pyspark.pandas module and converts the Spark DataFrame spark_df into a pandas-on-Spark DataFrame. Option B is incorrect because it attempts to convert the Spark DataFrame to a pandas DataFrame, which is not suitable for the context. Option C is not relevant since it deals with SQL operations, while Option D tries to use the standard pandas DataFrame constructor, which does not work with Spark DataFrames. Option E also incorrectly tries to convert to a pandas DataFrame directly from a Spark DataFrame.