Google Cloud Professional Data Engineer — Question 329
A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
Answer options
- A. Implement clustering in BigQuery on the ingest date column.
- B. Implement clustering in BigQuery on the package-tracking ID column.
- C. Tier older data onto Cloud Storage files and create a BigQuery table using Cloud Storage as an external data source.
- D. Re-create the table using data partitioning on the package delivery date.
Correct answer: B
Explanation
Implementing clustering on the package-tracking ID column (option B) helps optimize query performance by allowing BigQuery to retrieve only the necessary data related to specific tracking IDs, reducing scan times. In contrast, clustering on the ingest date (option A) may not effectively narrow down the data for geospatial analysis. Tiering older data to Cloud Storage (option C) may complicate access and reduce performance, while recreating the table with partitioning on the package delivery date (option D) does not focus on the immediate performance issue related to querying by tracking ID.