A data engineer has a PySpark DataFrame named sales_data containing millions of rows. The…

Question

A data engineer has a PySpark DataFrame named sales_data containing millions of rows. They need to calculate the total sales amount grouped by region using Spark SQL for improved readability and maintainability. Which steps should the data engineer take to achieve this?

Accepted Answer

Correct answer: C. C. Use the ‘createOrReplaceTempView' method to register ‘sales_data’ as a temporary view and write the query in Spark SQL. — The correct answer is C because using 'createOrReplaceTempView' allows the data engineer to run SQL queries directly on the DataFrame, promoting maintainability and readability. Option A, while functional, does not utilize Spark SQL and thus lacks the clarity that SQL provides. Option B is inefficient as it involves unnecessary data export and import. Option D is not suitable for large datasets due to Pandas' limitations in handling big data.

Databricks Certified Associate Developer for Apache Spark — Question 197

Answer options

Correct answer: C

Explanation