AWS Certified Data Engineer – Associate (DEA-C01) — Question 254
An investment company needs to manage and extract insights from a volume of semi-structured data that grows continuously.
A data engineer needs to deduplicate the semi-structured data, remove records that are duplicates, and remove common misspellings of duplicates.
Which solution will meet these requirements with the LEAST operational overhead?
Answer options
- A. Use the FindMatches feature of AWS Glue to remove duplicate records.
- B. Use non-Windows functions in Amazon Athena to remove duplicate records.
- C. Use Amazon Neptune ML and an Apache Gremlin script to remove duplicate records.
- D. Use the global tables feature of Amazon DynamoDB to prevent duplicate data.
Correct answer: A
Explanation
The correct answer is A because the FindMatches feature of AWS Glue is specifically designed to identify and remove duplicate records with minimal operational complexity. Options B and C involve more complex operations that do not directly address the deduplication needs as effectively as AWS Glue. Option D focuses on preventing duplicates rather than removing existing ones.