You have built a model that is trained on data stored in Parquet files. You access the da…

Question

You have built a model that is trained on data stored in Parquet files. You access the data through a Hive table hosted on Google Cloud. You preprocessed these data with PySpark and exported it as a CSV file into Cloud Storage. After preprocessing, you execute additional steps to train and evaluate your model. You want to parametrize this model training in Kubeflow Pipelines. What should you do?

Accepted Answer

Correct answer: C. C. Add a ContainerOp to your pipeline that spins a Dataproc cluster, runs a transformation, and then saves the transformed data in Cloud Storage. — The correct answer is C because it efficiently utilizes Dataproc to handle large data transformations in a scalable manner, saving the results directly to Cloud Storage. Option A is incorrect as data transformation is essential for model training. Option B does not leverage the capability of managing a cluster for big data processing. Option D complicates the process with additional infrastructure without the direct benefit of Dataproc's capabilities.

Google Cloud Professional Machine Learning Engineer — Question 99

Answer options

Correct answer: C

Explanation