Google Cloud Professional Data Engineer — Question 23
You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?
Answer options
- A. Include ORDER BY DESK on timestamp column and LIMIT to 1.
- B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
- C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
- D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
Correct answer: D
Explanation
The correct answer is D because using the ROW_NUMBER window function allows you to assign a unique sequential integer to rows within a partition of data, which helps in identifying duplicates based on the unique ID. The other options either do not effectively filter out duplicates or do not utilize the unique identifiers correctly for the intended purpose.