AWS Certified Machine Learning – Specialty — Question 329
A banking company provides financial products to customers around the world. A machine learning (ML) specialist collected transaction data from internal customers. The ML specialist split the dataset into training, testing, and validation datasets. The ML specialist analyzed the training dataset by using Amazon SageMaker Clarify. The analysis found that the training dataset contained fewer examples of customers in the 40 to 55 year-old age group compared to the other age groups.
Which type of pretraining bias did the ML specialist observe in the training dataset?
Answer options
- A. Difference in proportions of labels (DPL)
- B. Class imbalance (CI)
- C. Conditional demographic disparity (CDD)
- D. Kolmogorov-Smirnov (KS)
Correct answer: B
Explanation
Class imbalance (CI) measures whether one facet group has significantly fewer training examples than other groups, which directly corresponds to the underrepresentation of the 40 to 55 age demographic. In contrast, metrics like Difference in proportions of labels (DPL) and Conditional demographic disparity (CDD) analyze the distribution of outcomes or labels rather than the distribution of the input demographic features. Kolmogorov-Smirnov (KS) is used to determine if the label distributions for different facets are significantly different.