A data engineer has been asked to produce a Parquet table which is overwritten every day…

Question

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field. Which line of Spark code will produce a Parquet table that meets these requirements?

Accepted Answer

Correct answer: C. C. final_df \
.sort ("market_time") \
.coalesce(1) \
.write \
.format ("parquet") \
.mode ("overwrite") \
.saveAsTable ("output .market_events") — The correct answer is C because it sorts the records by market_time and uses coalesce(1) to ensure all data is written to a single file. Options A and B do not use coalesce(1), which could lead to multiple output files, and option D sorts only within partitions, which does not guarantee the overall sort order required for the final Parquet table.

Databricks Certified Associate Developer for Apache Spark — Question 183

Answer options

Correct answer: C

Explanation