A data engineer works with streaming data, where records arrive randomly. The engineer ne…

Question

A data engineer works with streaming data, where records arrive randomly. The engineer needs to de-duplicate records based on order_id while ensuring that the latest records are kept in the stream. Which strategy should the engineer use for de-duplicating records while considering late-arriving data and imaging memory usage?

Accepted Answer

Correct answer: A. A. Apply withWatermark() and than use dropDuplicates() on order_id to remove duplicates. — The correct choice is A because using withWatermark() allows the system to manage late-arriving data effectively while dropDuplicates() ensures that only the latest records for each order_id are kept. Option B is incorrect as it focuses on timestamps rather than order_id, while C incorrectly states that deduplication isn't possible. Option D fails to consider the importance of watermarking, which is essential for handling late data.

Databricks Certified Associate Developer for Apache Spark — Question 192

Answer options

Correct answer: A

Explanation