You want to rebuild your ML pipeline for structured data on Google Cloud. You are using P…

Question

You want to rebuild your ML pipeline for structured data on Google Cloud. You are using PySpark to conduct data transformations at scale, but your pipelines are taking over 12 hours to run. To speed up development and pipeline run time, you want to use a serverless tool and SQL syntax. You have already moved your raw data into Cloud Storage. How should you build the pipeline on Google Cloud while meeting the speed and processing requirements?

Accepted Answer

Correct answer: D. D. Ingest your data into BigQuery using BigQuery Load, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table. — Option D is correct because it leverages BigQuery's capabilities for fast processing and allows for SQL syntax to be used, which meets the requirements of speed and serverless architecture. Option A does not directly address the need for serverless SQL syntax, while Option B still relies on Dataproc, which is not fully serverless. Option C introduces Cloud SQL, which is less efficient for large-scale data processing compared to BigQuery.

Google Cloud Professional Machine Learning Engineer — Question 53

Answer options

Correct answer: D

Explanation