Google Cloud Professional Machine Learning Engineer — Question 131
You work for a biotech startup that is experimenting with deep learning ML models based on properties of biological organisms. Your team frequently works on early-stage experiments with new architectures of ML models, and writes custom TensorFlow ops in C++. You train your models on large datasets and large batch sizes. Your typical batch size has 1024 examples, and each example is about 1 MB in size. The average size of a network with all weights and embeddings is 20 GB. What hardware should you choose for your models?
Answer options
- A. A cluster with 2 n1-highcpu-64 machines, each with 8 NVIDIA Tesla V100 GPUs (128 GB GPU memory in total), and a n1-highcpu-64 machine with 64 vCPUs and 58 GB RAM
- B. A cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM
- C. A cluster with an n1-highcpu-64 machine with a v2-8 TPU and 64 GB RAM
- D. A cluster with 4 n1-highcpu-96 machines, each with 96 vCPUs and 86 GB RAM
Correct answer: D
Explanation
The correct answer is D because it provides a balance of high vCPU count and adequate RAM, which is essential for handling large batch sizes and datasets efficiently. Options A and B, while having high GPU memory, do not provide sufficient CPU resources for the workload. Option C lacks the necessary CPU power and RAM capacity compared to option D.