Google Cloud Professional Machine Learning Engineer — Question 67
You lead a data science team at a large international corporation. Most of the models your team trains are large-scale models using high-level TensorFlow APIs on AI Platform with GPUs. Your team usually takes a few weeks or months to iterate on a new version of a model. You were recently asked to review your team’s spending. How should you reduce your Google Cloud compute costs without impacting the model’s performance?
Answer options
- A. Use AI Platform to run distributed training jobs with checkpoints.
- B. Use AI Platform to run distributed training jobs without checkpoints.
- C. Migrate to training with Kuberflow on Google Kubernetes Engine, and use preemptible VMs with checkpoints.
- D. Migrate to training with Kuberflow on Google Kubernetes Engine, and use preemptible VMs without checkpoints.
Correct answer: C
Explanation
The correct answer is C because using Kuberflow on Google Kubernetes Engine with preemptible VMs allows for significant cost savings while still using checkpoints to save progress during training. Option A and B do not provide the cost benefits of preemptible VMs, and D lacks the checkpointing feature, which is essential for long training processes to avoid losing progress.