Google Cloud Professional Cloud Developer — Question 215
Your team develops services that run on Google Cloud. You want to process messages sent to a Pub/Sub topic, and then store them. Each message must be processed exactly once to avoid duplication of data and any data conflicts. You need to use the cheapest and most simple solution. What should you do?
Answer options
- A. Process the messages with a Dataproc job, and write the output to storage.
- B. Process the messages with a Dataflow streaming pipeline using Apache Beam's PubSubIO package, and write the output to storage.
- C. Process the messages with a Cloud Function, and write the results to a BigQuery location where you can run a job to deduplicate the data.
- D. Retrieve the messages with a Dataflow streaming pipeline, store them in Cloud Bigtable, and use another Dataflow streaming pipeline to deduplicate messages.
Correct answer: B
Explanation
The correct answer is B because using a Dataflow streaming pipeline with Apache Beam's PubSubIO allows for exactly-once processing, which meets the requirement of avoiding data duplication. Option A involves Dataproc, which is more complex and potentially more costly. Option C uses a Cloud Function, which may not guarantee exactly-once processing without additional deduplication logic. Option D adds unnecessary complexity by involving multiple Dataflow pipelines for deduplication.