Databricks Certified Data Engineer Professional — Question 133
An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?
Answer options
- A. Iterate through an ordered set of changes to the table, applying each in turn to create the current state of the table, (insert, update, delete), timestamp of change, and the values.
- B. Use merge into to insert, update, or delete the most recent entry for each pk_id into a table, then propagate all changes throughout the system.
- C. Deduplicate records in each batch by pk_id and overwrite the target table.
- D. Use Delta Lake’s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.
Correct answer: D
Explanation
The correct answer is D because Delta Lake’s change data feed efficiently processes CDC data and ensures all changes are propagated to related tables, fulfilling the auditing and analytical requirements. Option A does not maintain a comprehensive history of changes, while option B only updates the most recent entries without retaining historical data. Option C loses historical context by overwriting records instead of preserving past values.