A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10…

Question

A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10 hours on average to train the model on GPU instances. The data scientist suspects that training is not converging and that resource utilization is not optimal. What should the data scientist do to identify and address training issues with the LEAST development effort?

Accepted Answer

Correct answer: C. C. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected. — The correct answer is C because the SageMaker Debugger's built-in rules specifically target issues like vanishing gradients and low GPU utilization, making it effective for diagnosing training problems. Options A and B focus on CPU metrics and custom metrics, which may not provide direct insights into GPU utilization issues, while D addresses different concerns that are not as relevant for the described situation.

AWS Certified Machine Learning – Specialty — Question 207

Answer options

Correct answer: C

Explanation