Google Cloud Professional Data Engineer — Question 201
You are running a streaming pipeline with Dataflow and are using hopping windows to group the data as the data arrives. You noticed that some data is arriving late but is not being marked as late data, which is resulting in inaccurate aggregations downstream. You need to find a solution that allows you to capture the late data in the appropriate window. What should you do?
Answer options
- A. Use watermarks to define the expected data arrival window. Allow late data as it arrives.
- B. Change your windowing function to tumbling windows to avoid overlapping window periods.
- C. Change your windowing function to session windows to define your windows based on certain activity.
- D. Expand your hopping window so that the late data has more time to arrive within the grouping.
Correct answer: A
Explanation
Using watermarks allows the system to identify late data more effectively and incorporate it into the appropriate window, which is essential for accurate aggregations. Changing to tumbling or session windows may not accommodate the late data correctly, and simply expanding the hopping window could lead to inefficiencies without solving the root issue of late data recognition.