You want to rebuild your batch pipeline for structured data on Google Cloud. You are usin…

Question

You want to rebuild your batch pipeline for structured data on Google Cloud. You are using PySpark to conduct data transformations at scale, but your pipelines are taking over twelve hours to run. To expedite development and pipeline run time, you want to use a serverless tool and SOL syntax. You have already moved your raw data into Cloud Storage. How should you build the pipeline on Google Cloud while meeting speed and processing requirements?

Accepted Answer

Correct answer: C. C. Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table. — Option C is correct because it leverages BigQuery's capabilities for fast data processing and allows for SQL-based transformations, which aligns with the need for speed and serverless architecture. Options A and B involve using SparkSQL with Dataproc or Cloud SQL, which do not provide the same level of speed and serverless benefits as BigQuery. Option D, while using Apache Beam, does not directly address the SQL syntax requirement and may not optimize for the quickest execution time compared to BigQuery.

Google Cloud Professional Data Engineer — Question 290

Answer options

Correct answer: C

Explanation