Google Cloud Professional Data Engineer — Question 159
Your infrastructure team has set up an interconnect link between Google Cloud and the on-premises network. You are designing a high-throughput streaming pipeline to ingest data in streaming from an Apache Kafka cluster hosted on- premises. You want to store the data in BigQuery, with as minimal latency as possible. What should you do?
Answer options
- A. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Use a Google-provided Dataflow template to read the data from Pub/Sub, and write the data to BigQuery.
- B. Use a proxy host in the VPC in Google Cloud connecting to Kafka. Write a Dataflow pipeline, read data from the proxy host, and write the data to BigQuery.
- C. Use Dataflow, write a pipeline that reads the data from Kafka, and writes the data to BigQuery.
- D. Setup a Kafka Connect bridge between Kafka and Pub/Sub. Write a Dataflow pipeline, read the data from Pub/Sub, and write the data to BigQuery.
Correct answer: C
Explanation
The correct choice, C, directly uses Dataflow to read from Kafka and write to BigQuery, providing an efficient and low-latency solution. Option A introduces an unnecessary step of using Pub/Sub, which can add latency, while B complicates the architecture with a proxy host, which is not needed. Option D, while similar to A, also involves Pub/Sub, making it less optimal for minimizing latency.