Databricks Certified Data Engineer Professional — Question 160
A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
Answer options
- A. Rely on Delta Lake schema enforcement to prevent duplicate records.
- B. VACUUM the Delta table after each batch completes.
- C. Perform an insert-only merge with a matching condition on a unique key.
- D. Perform a full outer join on a unique key and overwrite existing data.
Correct answer: C
Explanation
The correct answer is C because performing an insert-only merge with a matching condition on a unique key allows the data engineer to effectively insert new records while checking for and eliminating duplicates based on previously processed records. The other options do not provide a mechanism to compare incoming records against existing ones in a way that prevents duplicates during the insertion process.