Google Cloud Professional Data Engineer — Question 132
You need ads data to serve AI models and historical data for analytics. Longtail and outlier data points need to be identified. You want to cleanse the data in near-real time before running it through AI models. What should you do?
Answer options
- A. Use Cloud Storage as a data warehouse, shell scripts for processing, and BigQuery to create views for desired datasets.
- B. Use Dataflow to identify longtail and outlier data points programmatically, with BigQuery as a sink.
- C. Use BigQuery to ingest, prepare, and then analyze the data, and then run queries to create views.
- D. Use Cloud Composer to identify longtail and outlier data points, and then output a usable dataset to BigQuery.
Correct answer: B
Explanation
The correct answer is B because Dataflow is specifically designed for processing and transforming large data streams, allowing for the identification of longtail and outlier data points in near-real time. Option A does not provide the near-real-time processing capability needed, while C focuses more on analysis rather than identification, and D uses Cloud Composer, which is not as efficient for this particular data cleansing task.