AWS Certified Data Engineer – Associate (DEA-C01) — Question 121
A company ingests data from multiple data sources and stores the data in an Amazon S3 bucket. An AWS Glue extract, transform, and load (ETL) job transforms the data and writes the transformed data to an Amazon S3 based data lake. The company uses Amazon Athena to query the data that is in the data lake.
The company needs to identify matching records even when the records do not have a common unique identifier.
Which solution will meet this requirement?
Answer options
- A. Use Amazon Macie pattern matching as part of the ETL job.
- B. Train and use the AWS Glue PySpark Filter class in the ETL job.
- C. Partition tables and use the ETL job to partition the data on a unique identifier.
- D. Train and use the AWS Lake Formation FindMatches transform in the ETL job.
Correct answer: D
Explanation
The correct answer is D because the AWS Lake Formation FindMatches transform is specifically designed to help find matching records that lack a common unique identifier. Option A, Amazon Macie, focuses on data security and privacy, not matching records. Option B, AWS Glue PySpark Filter class, is used for filtering data rather than identifying matches. Option C involves partitioning data based on a unique identifier, which does not solve the problem of matching records without such identifiers.