AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 209

An ML engineer needs to build a processing pipeline to identify and remove personally identifiable information (PII) from petabytes of unstructured data. The ML engineer will use the processed data to train ML models in Amazon SageMaker AI.

Which solution will meet these requirements?

Answer options

Correct answer: A

Explanation

Option A is correct because it specifically mentions using the Apache Spark-based serverless engine from AWS Glue, which is designed for processing large datasets and includes a feature for detecting and transforming PII data. The other options, while they may offer some functionality for PII detection, do not provide the same level of integration and scalability necessary for working with petabytes of unstructured data.