You are an ML engineer at a mobile gaming company. A data scientist on your team recently…

Question

You are an ML engineer at a mobile gaming company. A data scientist on your team recently trained a TensorFlow model, and you are responsible for deploying this model into a mobile application. You discover that the inference latency of the current model doesn’t meet production requirements. You need to reduce the inference time by 50%, and you are willing to accept a small decrease in model accuracy in order to reach the latency requirement. Without training a new model, which model optimization technique for reducing latency should you try first?

Accepted Answer

Correct answer: B. B. Dynamic range quantization — Dynamic range quantization is effective for reducing inference latency without the need for retraining, as it compresses the model's weights and activations to lower precision. While weight pruning, model distillation, and dimensionality reduction can also help with performance, they may require more complex adjustments or retraining, making them less suitable for immediate latency reduction.

Google Cloud Professional Machine Learning Engineer — Question 101

Answer options

Correct answer: B

Explanation