Google Cloud Professional Machine Learning Engineer — Question 326
You have developed a custom ML model using Vertex AI and want to deploy it for online serving. You need to optimize the model's serving performance by ensuring that the model can handle high throughput while minimizing latency. You want to use the simplest solution. What should you do?
Answer options
- A. Deploy the model to a Vertex AI endpoint resource to automatically scale the serving backend based on the throughput. Configure the endpoint's autoscaling settings to minimize latency.
- B. Implement a containerized serving solution using Cloud Run. Configure the concurrency settings to handle multiple requests simultaneously.
- C. Apply simplification techniques such as model pruning and quantization to reduce the model's size and complexity. Retrain the model using Vertex AI to improve its performance, latency, memory, and throughput.
- D. Enable request-response logging for the model hosted in Vertex AI. Use Looker Studio to analyze the logs, identify bottlenecks, and optimize the model accordingly.
Correct answer: A
Explanation
The correct answer is A because deploying the model to a Vertex AI endpoint allows for automatic scaling based on demand, which optimizes both throughput and latency. Option B, while viable, introduces additional complexity with containerization that may not be necessary for the simplest solution. Options C and D focus on model improvement and analysis rather than directly addressing the immediate serving performance needs.