Databricks Certified Associate Developer for Apache Spark — Question 183
A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.
Which line of Spark code will produce a Parquet table that meets these requirements?
Answer options
- A. final_df \ .sort ("market_time") \ .write \ .format ("parquet") \ .mode ("overwrite") \ .saveAsTable ("output .market_events")
- B. final_df \ .orderBy("market_time") \ .write \ .format ("parquet") \ .mode ("overwrite") \ .saveAsTable ("output .market_events")
- C. final_df \ .sort ("market_time") \ .coalesce(1) \ .write \ .format ("parquet") \ .mode ("overwrite") \ .saveAsTable ("output .market_events")
- D. final_df \ .sortWithinPartitions("market_time") \ .write \ .format ("parquet") \ .mode ("overwrite") \ .saveAsTable ("output .market_events")
Correct answer: C
Explanation
The correct answer is C because it sorts the records by market_time and uses coalesce(1) to ensure all data is written to a single file. Options A and B do not use coalesce(1), which could lead to multiple output files, and option D sorts only within partitions, which does not guarantee the overall sort order required for the final Parquet table.