Databricks Certified Data Engineer Professional — Question 47

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Answer options

Correct answer: A

Explanation

Option A is correct because setting spark.sql.files.maxPartitionBytes to 512 MB ensures that the output files are optimized without any shuffling, which aligns with the requirements. The other options involve configurations that either lead to unnecessary shuffling or do not efficiently manage partition sizes for the desired output format.