Google Cloud Professional Machine Learning Engineer — Question 101
You are an ML engineer at a mobile gaming company. A data scientist on your team recently trained a TensorFlow model, and you are responsible for deploying this model into a mobile application. You discover that the inference latency of the current model doesn’t meet production requirements. You need to reduce the inference time by 50%, and you are willing to accept a small decrease in model accuracy in order to reach the latency requirement. Without training a new model, which model optimization technique for reducing latency should you try first?
Answer options
- A. Weight pruning
- B. Dynamic range quantization
- C. Model distillation
- D. Dimensionality reduction
Correct answer: B
Explanation
Dynamic range quantization is effective for reducing inference latency without the need for retraining, as it compresses the model's weights and activations to lower precision. While weight pruning, model distillation, and dimensionality reduction can also help with performance, they may require more complex adjustments or retraining, making them less suitable for immediate latency reduction.