Databricks Certified Data Engineer Professional — Question 102
A Delta Lake table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineering team would like to determine the difference between the new version and the previous version of the table.
Given the current implementation, which method can be used?
Answer options
- A. Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and lime travel functionality.
- B. Parse the Delta Lake transaction log to identify all newly written data files.
- C. Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
- D. Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
- E. Use Delta Lake’s change data feed to identify those records that have been updated, inserted, or deleted.
Correct answer: A
Explanation
The correct answer is A because Delta Lake's built-in versioning and time travel functionality allows users to easily compare different versions of a table and see the specific changes made. Other methods, such as parsing logs or using change data feeds, do not directly provide the functionality to compare versions as efficiently as Delta Lake’s built-in features do.