You are monitoring your organization’s data lake hosted on BigQuery. The ingestion pipeli…

Question

You are monitoring your organization’s data lake hosted on BigQuery. The ingestion pipelines read data from Pub/Sub and write the data into tables on BigQuery. After a new version of the ingestion pipelines is deployed, the daily stored data increased by 50%. The volumes of data in Pub/Sub remained the same and only some tables had their daily partition data size doubled. You need to investigate and fix the cause of the data increase. What should you do?

Accepted Answer

Correct answer: C. C. 1. Check for duplicate rows in the BigQuery tables that have the daily partition data size doubled.
2. Check the BigQuery Audit logs to find job IDs.
3. Use Cloud Monitoring to determine when the identified Dataflow jobs started and the pipeline code version.
4. When more than one pipeline ingests data into a table, stop all versions except the latest one. — Option C is correct because it addresses the investigation of duplicate rows and utilizes audit logs and monitoring tools to track job versions, which helps identify the source of increased data. The other options either focus on deduplication without investigating the root cause (A), address only code errors without a comprehensive approach (B), or involve rolling back changes without understanding the issue (D).

Google Cloud Professional Data Engineer — Question 228

Answer options

Correct answer: C

Explanation