Google Cloud Associate Data Practitioner — Question 26
You are working on a data pipeline that will validate and clean incoming data before loading it into BigQuery for real-time analysis. You want to ensure that the data validation and cleaning is performed efficiently and can handle high volumes of data. What should you do?
Answer options
- A. Write custom scripts in Python to validate and clean the data outside of Google Cloud. Load the cleaned data into BigQuery.
- B. Use Cloud Run functions to trigger data validation and cleaning routines when new data arrives in Cloud Storage.
- C. Use Dataflow to create a streaming pipeline that includes validation and transformation steps.
- D. Load the raw data into BigQuery using Cloud Storage as a staging area, and use SQL queries in BigQuery to validate and clean the data.
Correct answer: C
Explanation
The correct answer is C because Dataflow is specifically designed for processing large volumes of data efficiently and can handle validation and transformation in a streaming manner. Option A involves external processing, which may not be as efficient, while B, although a good choice, does not leverage the full capabilities of a dedicated data processing service like Dataflow. Option D may lead to higher costs and slower performance since it requires loading raw data and then processing it within BigQuery.