AWS Certified Machine Learning – Specialty — Question 202
An ecommerce company wants to train a large image classification model with 10,000 classes. The company runs multiple model training iterations and needs to minimize operational overhead and cost. The company also needs to avoid loss of work and model retraining.
Which solution will meet these requirements?
Answer options
- A. Create the training jobs as AWS Batch jobs that use Amazon EC2 Spot Instances in a managed compute environment.
- B. Use Amazon EC2 Spot Instances to run the training jobs. Use a Spot Instance interruption notice to save a snapshot of the model to Amazon S3 before an instance is terminated.
- C. Use AWS Lambda to run the training jobs. Save model weights to Amazon S3.
- D. Use managed spot training in Amazon SageMaker. Launch the training jobs with checkpointing enabled.
Correct answer: D
Explanation
The correct answer is D, as managed spot training in Amazon SageMaker allows for checkpointing, which ensures that work is not lost and eliminates the need for retraining. Option A may reduce costs but lacks the checkpointing feature. Option B, while it uses Spot Instances, also does not provide a robust solution for avoiding loss of work. Option C is not suitable for large models and does not address the need to maintain job state.