Google Cloud Professional Data Engineer — Question 110

You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
✑ Decoupling producer from consumer
✑ Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely
✑ Near real-time SQL query
✑ Maintain at least 2 years of historical data, which will be queried with SQL
Which pipeline should you use to meet these requirements?

Answer options

Correct answer: D

Explanation

The correct answer is D because it utilizes Cloud Pub/Sub for decoupling producers and consumers, and Cloud Dataflow enables near real-time processing of JSON data, transforming it into Avro format while writing to Cloud Storage and BigQuery for efficient querying and long-term storage. Options A and B do not meet the near real-time SQL querying requirement effectively, while Option C does not provide direct integration with BigQuery for SQL queries.