Databricks Certified Data Engineer Professional — Question 37
A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
Answer options
- A. Set the configuration delta.deduplicate = true.
- B. VACUUM the Delta table after each batch completes.
- C. Perform an insert-only merge with a matching condition on a unique key.
- D. Perform a full outer join on a unique key and overwrite existing data.
- E. Rely on Delta Lake schema enforcement to prevent duplicate records.
Correct answer: C
Explanation
The correct answer is C, as performing an insert-only merge with a matching condition on a unique key allows for the comparison of new records against existing data, effectively eliminating duplicates. Option A does not exist in Delta Lake's functionality, while B, D, and E do not directly address the need for deduplication against previously processed records during the insertion process.