Google Cloud Professional Data Engineer — Question 84
Your company receives both batch- and stream-based event data. You want to process the data using Google Cloud Dataflow over a predictable time period.
However, you realize that in some instances data can arrive late or out of order. How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?
Answer options
- A. Set a single global window to capture all the data.
- B. Set sliding windows to capture all the lagged data.
- C. Use watermarks and timestamps to capture the lagged data.
- D. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.
Correct answer: C
Explanation
The correct answer is C because using watermarks and timestamps allows the pipeline to manage late data effectively by defining the point in time for processing. Option A is incorrect as a single global window does not handle late data effectively. Option B is not suitable since sliding windows do not specifically address the issue of late data processing. Option D, while it emphasizes timestamps, does not incorporate the critical aspect of watermarks for managing lateness.