AWS Certified Data Engineer – Associate (DEA-C01) — Question 198
A company builds a new data pipeline to process data for business intelligence reports. Users have noticed that data is missing from the reports.
A data engineer needs to add a data quality check for columns that contain null values and for referential integrity at a stage before the data is added to storage.
Which solution will meet these requirements with the LEAST operational overhead?
Answer options
- A. Use Amazon SageMaker Data Wrangler to create a Data Quality and Insights report.
- B. Use AWS Glue ETL jobs to perform a data quality evaluation transform on the data. Use an IsComplete rule on the requested columns. Use a ReferentialItegrity rule for each join.
- C. Use AWS Glue ETL jobs to perform a SQL transform on the data to determine whether requested column contain null values. Use a second SQL transform to check referential integrity.
- D. Use Amazon SageMaker Data Wrangler and a custom Python transform to create custom rules to check for null values and referential integrity.
Correct answer: B
Explanation
Option B is the best choice because AWS Glue ETL jobs provide built-in functionalities for data quality checks with minimal operational overhead. The other options either involve more complex processes or additional tools, which could increase maintenance efforts and operational overhead.