AWS Certified Machine Learning Engineer – Associate (MLA-C01) — Question 209
An ML engineer needs to build a processing pipeline to identify and remove personally identifiable information (PII) from petabytes of unstructured data. The ML engineer will use the processed data to train ML models in Amazon SageMaker AI.
Which solution will meet these requirements?
Answer options
- A. Use the Apache Spark-based serverless engine from AWS Glue interactive sessions. Use the Detect PII transform feature to identify and remove the PII data.
- B. Use AWS Glue Data Wrangler within Amazon SageMaker Canvas to detect and remove the PII.
- C. Use the Amazon SageMaker Clarify API to detect and mask the PII data.
- D. Use the DetectEntities API action in Amazon Comprehend to identify and remove the PII data.
Correct answer: A
Explanation
Option A is correct because it specifically mentions using the Apache Spark-based serverless engine from AWS Glue, which is designed for processing large datasets and includes a feature for detecting and transforming PII data. The other options, while they may offer some functionality for PII detection, do not provide the same level of integration and scalability necessary for working with petabytes of unstructured data.