A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a…

Question

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?

Accepted Answer

Correct answer: A. A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet. — Option A is correct because it directly sets the maximum partition size to 512 MB, which aligns with the target file size without necessitating data shuffling. The other options involve either shuffling or unnecessary repartitioning, which could degrade performance.

Databricks Certified Data Engineer Professional — Question 135

Answer options

Correct answer: A

Explanation