Databricks Certified Associate Developer for Apache Spark — Question 197

A data engineer has a PySpark DataFrame named sales_data containing millions of rows. They need to calculate the total sales amount grouped by region using Spark SQL for improved readability and maintainability.

Which steps should the data engineer take to achieve this?

Answer options

Correct answer: C

Explanation

The correct answer is C because using 'createOrReplaceTempView' allows the data engineer to run SQL queries directly on the DataFrame, promoting maintainability and readability. Option A, while functional, does not utilize Spark SQL and thus lacks the clarity that SQL provides. Option B is inefficient as it involves unnecessary data export and import. Option D is not suitable for large datasets due to Pandas' limitations in handling big data.