Google Cloud Professional Machine Learning Engineer — Question 149
You are training an object detection machine learning model on a dataset that consists of three million X-ray images, each roughly 2 GB in size. You are using Vertex AI Training to run a custom training application on a Compute Engine instance with 32-cores, 128 GB of RAM, and 1 NVIDIA P100 GPU. You notice that model training is taking a very long time. You want to decrease training time without sacrificing model performance. What should you do?
Answer options
- A. Increase the instance memory to 512 GB, and increase the batch size.
- B. Replace the NVIDIA P100 GPU with a K80 GPU in the training job.
- C. Enable early stopping in your Vertex AI Training job.
- D. Use the tf.distribute.Strategy API and run a distributed training job.
Correct answer: D
Explanation
The correct choice is D because using the tf.distribute.Strategy API allows for distributed training, which can significantly speed up training time by leveraging multiple devices. Options A and B may not effectively reduce training time; increasing memory or switching to a less capable GPU could hinder performance. Option C, while useful for stopping training early, does not address the fundamental issue of long training duration.