Google Cloud Professional Machine Learning Engineer — Question 199
You developed a Transformer model in TensorFlow to translate text. Your training data includes millions of documents in a Cloud Storage bucket. You plan to use distributed training to reduce training time. You need to configure the training job while minimizing the effort required to modify code and to manage the cluster’s configuration. What should you do?
Answer options
- A. Create a Vertex AI custom training job with GPU accelerators for the second worker pool. Use tf.distribute.MultiWorkerMirroredStrategy for distribution.
- B. Create a Vertex AI custom distributed training job with Reduction Server. Use N1 high-memory machine type instances for the first and second pools, and use N1 high-CPU machine type instances for the third worker pool.
- C. Create a training job that uses Cloud TPU VMs. Use tf.distribute.TPUStrategy for distribution.
- D. Create a Vertex AI custom training job with a single worker pool of A2 GPU machine type instances. Use tf.distribute.MirroredStrategv for distribution.
Correct answer: A
Explanation
The correct answer is A because it utilizes Vertex AI's custom training capabilities with GPU accelerators and the tf.distribute.MultiWorkerMirroredStrategy, which is specifically designed for distributed training without extensive code changes. Options B and C suggest configurations that either complicate the setup or do not align with the requirement for minimizing management effort. Option D, while valid, does not leverage the efficiency of multiple worker pools as effectively as option A.