You need to create a data pipeline that copies time-series transaction data so that it ca…

Question

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis.
Every hour, thousands of transactions are updated with a new status. The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)

Accepted Answer

Correct answer: A, D. A. Denormalize the data as must as possible. — D. Develop a data pipeline where status updates are appended to BigQuery instead of updated. — Denormalizing the data (Option A) improves query performance by reducing the number of joins needed, which is beneficial for large datasets. Appending status updates (Option D) instead of updating existing records ensures that all historical data remains intact, making it easier for the data science team to analyze trends over time. The other options either complicate the data structure or do not align with the need for efficient querying.

Google Cloud Professional Data Engineer — Question 69

Answer options

Correct answer: A, D

Explanation