Databricks Certified Associate Developer for Apache Spark — Question 197
A data engineer has a PySpark DataFrame named sales_data containing millions of rows. They need to calculate the total sales amount grouped by region using Spark SQL for improved readability and maintainability.
Which steps should the data engineer take to achieve this?
Answer options
- A. Write the query directly on the DataFrame using PySpark’s ‘groupBy’ and ‘agg’ methods.
- B. Export the DataFrame as a CSV file, read it into a relational database, and write SQL queries in the database
- C. Use the ‘createOrReplaceTempView' method to register ‘sales_data’ as a temporary view and write the query in Spark SQL.
- D. Convert ‘sales _data’ to Pandas DataFrame and use SQL-ike queries with Pandas to achieve the grouping.
Correct answer: C
Explanation
The correct answer is C because using 'createOrReplaceTempView' allows the data engineer to run SQL queries directly on the DataFrame, promoting maintainability and readability. Option A, while functional, does not utilize Spark SQL and thus lacks the clarity that SQL provides. Option B is inefficient as it involves unnecessary data export and import. Option D is not suitable for large datasets due to Pandas' limitations in handling big data.