AWS Certified Data Engineer – Associate (DEA-C01) — Question 235
A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket.
The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a correlation process that begins only after the first two processes complete.
Which solution will meet these requirements with the LEAST operational overhead?
Answer options
- A. Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the workflow by using AWS Glue. Configure AWS Glue to begin the third process after the first two processes have finished.
- B. Use Amazon EMR to run each process in the workflow. Create an Amazon Simple Queue Service (Amazon SQS) queue to handle messages that indicate the completion of the first two processes. Configure an AWS Lambda function to process the SQS queue by running the third process.
- C. Use AWS Glue workflows to run the first two processes in parallel. Ensure that the third process starts after the first two processes have finished.
- D. Use AWS Step Functions to orchestrate a workflow that uses multiple AWS Lambda functions. Ensure that the third process starts after the first two processes have finished.
Correct answer: C
Explanation
The correct answer is C because AWS Glue workflows are specifically designed for data processing tasks and allow for running multiple processes in parallel, which meets the requirement for concurrent processing with minimal operational overhead. Option A, while viable, introduces unnecessary complexity by using Amazon MWAA. Option B adds extra components with Amazon EMR and SQS, increasing operational overhead. Option D relies on AWS Step Functions and Lambda, which may not be as efficient for this specific data processing scenario.