A company uses Amazon S3 and AWS Glue Data Catalog to manage a data lake that contains co…

Question

A company uses Amazon S3 and AWS Glue Data Catalog to manage a data lake that contains contact information for customers. The company uses PySpark and AWS Glue jobs with a DynamicFrame to run a workflow that processes data within the data lake. A data engineer notices that the workflow is generating errors as a result of how customer postal codes are stored in the data lake. Some postal codes include unnecessary numbers or invalid characters. The data engineer needs a solution to address the errors and correct the postal codes in the data lake.

Accepted Answer

Correct answer: A. A. Create a schema definition for PySpark that matches the format the processing workflow requires for postal codes. Pass the schema to the DynamicFrame during processing. — The correct answer is A because creating a schema definition that matches the expected format for postal codes ensures that the data is processed correctly without errors. The other options do not directly address the need to format or clean the postal codes, as B focuses on job state sharing, C on predicate settings, and D on implementation options that do not resolve the formatting issue.

AWS Certified Data Engineer – Associate (DEA-C01) — Question 224

Answer options

Correct answer: A

Explanation