AWS Certified Machine Learning – Specialty — Question 118
A Data Scientist is developing a binary classifier to predict whether a patient has a particular disease on a series of test results. The Data Scientist has data on
400 patients randomly selected from the population. The disease is seen in 3% of the population.
Which cross-validation strategy should the Data Scientist adopt?
Answer options
- A. A k-fold cross-validation strategy with k=5
- B. A stratified k-fold cross-validation strategy with k=5
- C. A k-fold cross-validation strategy with k=5 and 3 repeats
- D. An 80/20 stratified split between training and validation
Correct answer: B
Explanation
The correct answer is B because a stratified k-fold cross-validation strategy ensures that each fold maintains the same proportion of the classes as the full dataset, which is crucial when dealing with imbalanced classes like in this case. Options A and C do not account for the class distribution, while option D may not provide sufficient validation as it only splits the data once.