Databricks Certified Associate Developer for Apache Spark — Question 192

A data engineer works with streaming data, where records arrive randomly. The engineer needs to de-duplicate records based on order_id while ensuring that the latest records are kept in the stream.

Which strategy should the engineer use for de-duplicating records while considering late-arriving data and imaging memory usage?

Answer options

Correct answer: A

Explanation

The correct choice is A because using withWatermark() allows the system to manage late-arriving data effectively while dropDuplicates() ensures that only the latest records for each order_id are kept. Option B is incorrect as it focuses on timestamps rather than order_id, while C incorrectly states that deduplication isn't possible. Option D fails to consider the importance of watermarking, which is essential for handling late data.