Google Cloud Professional Machine Learning Engineer — Question 177
You have been tasked with deploying prototype code to production. The feature engineering code is in PySpark and runs on Dataproc Serverless. The model training is executed by using a Vertex AI custom training job. The two steps are not connected, and the model training must currently be run manually after the feature engineering step finishes. You need to create a scalable and maintainable production process that runs end-to-end and tracks the connections between steps. What should you do?
Answer options
- A. Create a Vertex AI Workbench notebook. Use the notebook to submit the Dataproc Serverless feature engineering job. Use the same notebook to submit the custom model training job. Run the notebook cells sequentially to tie the steps together end-to-end.
- B. Create a Vertex AI Workbench notebook. Initiate an Apache Spark context in the notebook and run the PySpark feature engineering code. Use the same notebook to run the custom model training job in TensorFlow. Run the notebook cells sequentially to tie the steps together end-to-end.
- C. Use the Kubeflow pipelines SDK to write code that specifies two components: - The first is a Dataproc Serverless component that launches the feature engineering job - The second is a custom component wrapped in the create_custom_training_job_from_component utility that launches the custom model training job Create a Vertex AI Pipelines job to link and run both components
- D. Use the Kubeflow pipelines SDK to write code that specifies two components - The first component initiates an Apache Spark context that runs the PySpark feature engineering code - The second component runs the TensorFlow custom model training code Create a Vertex AI Pipelines job to link and run both components.
Correct answer: C
Explanation
The correct answer is C because using the Kubeflow pipelines SDK allows you to create a structured and maintainable workflow that integrates both the feature engineering and model training steps as components, ensuring they are connected and can be monitored. Options A and B do not provide a scalable solution as they rely on manual execution in a notebook, which is less maintainable. Option D, while similar to C, incorrectly uses an Apache Spark context instead of leveraging Dataproc Serverless, which is more suitable for this scenario.