Databricks Certified Data Engineer Professional — Question 25
A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.
Which of the following likely explains these smaller file sizes?
Answer options
- A. Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations
- B. Z-order indices calculated on the table are preventing file compaction
- C. Bloom filter indices calculated on the table are preventing file compaction
- D. Databricks has autotuned to a smaller target file size based on the overall size of data in the table
- E. Databricks has autotuned to a smaller target file size based on the amount of data in each partition
Correct answer: A
Explanation
The correct answer, A, indicates that Databricks is optimizing file sizes to improve the efficiency of MERGE operations, which is crucial in streaming scenarios. The other options suggest that indices are preventing file compaction, which is not the case here, as the issue is related to autotuning for performance rather than index interference.