You developed an ML model with AI Platform, and you want to move it to production. You se…

Question

You developed an ML model with AI Platform, and you want to move it to production. You serve a few thousand queries per second and are experiencing latency issues. Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine
(GKE). Your goal is to improve the serving latency without changing the underlying infrastructure. What should you do?

Accepted Answer

Correct answer: D. D. Recompile TensorFlow Serving using the source to support CPU-specific optimizations. Instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes. — The correct answer, D, focuses on recompiling TensorFlow Serving to leverage CPU-specific optimizations, which can effectively reduce latency. Increasing max_batch_size or max_enqueued_batches (options A and C) may help with throughput but won't necessarily address latency issues directly. Switching to the universal version of TensorFlow Serving (option B) does not guarantee improvements in serving latency either.

Google Cloud Professional Machine Learning Engineer — Question 46

Answer options

Correct answer: D

Explanation