Databricks Certified Associate Developer for Apache Spark — Question 193
A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving records. The engineer has written the following code:
inputStream
.withWatermark ("event_time", "10 minutes")
.groupBy (window ("event_time", "15 minutes"))
.count ()
What happens to data that arrives after the watermark threshold?
Answer options
- A. Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.
- B. Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.
- C. Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.
- D. The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.
Correct answer: B
Explanation
The correct answer is B because records that arrive more than 10 minutes after the watermark threshold are considered late and are dropped from the aggregation. Option A is incorrect as it suggests late records are included, which contradicts the watermark's purpose. Option C is also wrong because late data does not shift to the next window; it's ignored altogether. Option D misrepresents the behavior of the watermark, which does not process data arriving later than the threshold.