Google Cloud Professional Machine Learning Engineer — Question 161
You work for a rapidly growing social media company. Your team builds TensorFlow recommender models in an on-premises CPU cluster. The data contains billions of historical user events and 100,000 categorical features. You notice that as the data increases, the model training time increases. You plan to move the models to Google Cloud. You want to use the most scalable approach that also minimizes training time. What should you do?
Answer options
- A. Deploy the training jobs by using TPU VMs with TPUv3 Pod slices, and use the TPUEmbeading API
- B. Deploy the training jobs in an autoscaling Google Kubernetes Engine cluster with CPUs
- C. Deploy a matrix factorization model training job by using BigQuery ML
- D. Deploy the training jobs by using Compute Engine instances with A100 GPUs, and use the tf.nn.embedding_lookup API
Correct answer: A
Explanation
The correct answer is A because TPU VMs with TPUv3 Pod slices are specifically designed for high scalability and performance in training large models, significantly reducing training time. Option B, while scalable, may not provide the same efficiency as TPUs for this workload. Option C is not suitable for the complexity of the recommender model, and Option D, although powerful, does not match the scalability and optimization provided by TPU for this specific scenario.