Databricks Certified Associate Developer for Apache Spark — Question 193

A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving records. The engineer has written the following code:

inputStream
.withWatermark ("event_time", "10 minutes")
.groupBy (window ("event_time", "15 minutes"))
.count ()

What happens to data that arrives after the watermark threshold?

Answer options

Correct answer: B

Explanation

The correct answer is B because records that arrive more than 10 minutes after the watermark threshold are considered late and are dropped from the aggregation. Option A is incorrect as it suggests late records are included, which contradicts the watermark's purpose. Option C is also wrong because late data does not shift to the next window; it's ignored altogether. Option D misrepresents the behavior of the watermark, which does not process data arriving later than the threshold.