Google Cloud Professional Machine Learning Engineer — Question 99

You have built a model that is trained on data stored in Parquet files. You access the data through a Hive table hosted on Google Cloud. You preprocessed these data with PySpark and exported it as a CSV file into Cloud Storage. After preprocessing, you execute additional steps to train and evaluate your model. You want to parametrize this model training in Kubeflow Pipelines. What should you do?

Answer options

Correct answer: C

Explanation

The correct answer is C because it efficiently utilizes Dataproc to handle large data transformations in a scalable manner, saving the results directly to Cloud Storage. Option A is incorrect as data transformation is essential for model training. Option B does not leverage the capability of managing a cluster for big data processing. Option D complicates the process with additional infrastructure without the direct benefit of Dataproc's capabilities.