AWS Certified Data Analytics – Specialty — Question 127
A healthcare company ingests patient data from multiple data sources and stores it in an Amazon S3 staging bucket. An AWS Glue ETL job transforms the data, which is written to an S3-based data lake to be queried using Amazon Athena. The company wants to match patient records even when the records do not have a common unique identifier.
Which solution meets this requirement?
Answer options
- A. Use Amazon Macie pattern matching as part of the ETLjob
- B. Train and use the AWS Glue PySpark filter class in the ETLjob
- C. Partition tables and use the ETL job to partition the data on patient name
- D. Train and use the AWS Glue FindMatches ML transform in the ETLjob
Correct answer: D
Explanation
The correct answer is D because the AWS Glue FindMatches ML transform is specifically designed to match records that may not have a common unique identifier, utilizing machine learning to improve accuracy. The other options do not provide a suitable method for matching records without unique identifiers, as Amazon Macie focuses on data security, the PySpark filter class is for data filtering, and partitioning by patient name does not solve the matching issue.