Databricks Certified Associate Developer for Apache Spark — Question 192
A data engineer works with streaming data, where records arrive randomly. The engineer needs to de-duplicate records based on order_id while ensuring that the latest records are kept in the stream.
Which strategy should the engineer use for de-duplicating records while considering late-arriving data and imaging memory usage?
Answer options
- A. Apply withWatermark() and than use dropDuplicates() on order_id to remove duplicates.
- B. Use dropDuplicates() on timestamp to remove duplicates based on the most recent timestamp.
- C. Deduplication is not supported in streaming while accounting to late-arriving data.
- D. Use dropDuplicates() on order_id without watermarking to remove all duplicates.
Correct answer: A
Explanation
The correct choice is A because using withWatermark() allows the system to manage late-arriving data effectively while dropDuplicates() ensures that only the latest records for each order_id are kept. Option B is incorrect as it focuses on timestamps rather than order_id, while C incorrectly states that deduplication isn't possible. Option D fails to consider the importance of watermarking, which is essential for handling late data.