You are pre-training a large language model on Google Cloud. This model includes custom T…

Question

You are pre-training a large language model on Google Cloud. This model includes custom TensorFlow operations in the training loop. Model training will use a large batch size, and you expect training to take several weeks. You need to configure a training architecture that minimizes both training time and compute costs. What should you do?

Accepted Answer

Correct answer: A. A. Implement 8 workers of a2-megagpu-16g machines by using tf.distribute.MultiWorkerMirroredStrategy. — The correct answer, A, is optimal because a2-megagpu-16g machines provide high memory bandwidth and GPU capabilities that are well-suited for training large models with substantial batch sizes. The other options either involve less efficient hardware configurations or a TPU strategy that may not fully leverage the benefits of the specific TensorFlow operations implemented in the model.

Google Cloud Professional Machine Learning Engineer — Question 211

Answer options

Correct answer: A

Explanation