AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 146
A company collects customer data every day. The company stores the data as compressed files in an Amazon S3 bucket that is partitioned by date. Every month, analysts download the data, process the data to check the data quality, and then upload the data to Amazon QuickSight dashboards.
An ML engineer needs to implement a solution to automatically check the data quality before the data is sent to QuickSight.
Which solution will meet these requirements with the LEAST operational overhead?
Answer options
- A. Run an AWS Glue crawler every month to update the AWS Glue Data Catalog. Use AWS Glue Data Quality rules to check the data quality.
- B. Use an AWS Glue trigger to run an AWS Glue crawler every month to update the AWS Glue Data Catalog. Create an AWS Glue job that loads the data into a PySpark DataFrame. Configure the job to apply custom functions and to evaluate the data quality.
- C. Run Python scripts on an AWS Lambda function every month to evaluate data quality. Configure the S3 bucket to invoke the Lambda function when objects are added to the S3 bucket.
- D. Configure the S3 bucket to send event notifications to an Amazon Simple Queue Service (Amazon SQS) queue when objects are uploaded. Use Amazon CloudWatch insights every month for the SQS queue to evaluate the data quality.
Correct answer: A
Explanation
Option A is the correct answer as it leverages AWS Glue Data Quality rules in conjunction with a monthly crawler, providing a straightforward and efficient method to ensure data quality with minimal maintenance. Options B and C introduce additional complexity with custom functions and Lambda scripts, which would increase operational overhead. Option D involves event notifications and CloudWatch insights, which is less efficient for automated data quality checks than using Glue Data Quality rules.