AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 20
A company has a large, unstructured dataset. The dataset includes many duplicate records across several key attributes.
Which solution on AWS will detect duplicates in the dataset with the LEAST code development?
Answer options
- A. Use Amazon Mechanical Turk jobs to detect duplicates.
- B. Use Amazon QuickSight ML Insights to build a custom deduplication model.
- C. Use Amazon SageMaker Data Wrangler to pre-process and detect duplicates.
- D. Use the AWS Glue FindMatches transform to detect duplicates.
Correct answer: D
Explanation
The correct answer, D, is appropriate because AWS Glue FindMatches is specifically designed for identifying duplicates with minimal coding. The other options, while capable of addressing the issue, require more complex setups or custom development, making them less efficient for this specific need.