AWS Certified Machine Learning – Specialty — Question 321
A global company receives and processes hundreds of documents daily. The documents are in printed .pdf format or .jpg format.
A machine learning (ML) specialist wants to build an automated document processing workflow to extract text from specific fields from the documents and to classify the documents. The ML specialist wants a solution that requires low maintenance.
Which solution will meet these requirements with the LEAST operational effort?
Answer options
- A. Use a PaddleOCR model in Amazon SageMaker to detect and extract the required text and fields. Use a SageMaker text classification model to classify the document.
- B. Use a PaddleOCR model in Amazon SageMaker to detect and extract the required text and fields. Use Amazon Comprehend to classify the document.
- C. Use Amazon Textract to detect and extract the required text and fields. Use Amazon Rekognition to classify the document.
- D. Use Amazon Textract to detect and extract the required text and fields. Use Amazon Comprehend to classify the document.
Correct answer: D
Explanation
Amazon Textract is a fully managed service that automatically extracts text and data from scanned documents, eliminating the operational overhead of managing custom OCR models like PaddleOCR on Amazon SageMaker. Amazon Comprehend is a managed natural language processing service ideal for classifying text-based documents, whereas Amazon Rekognition is optimized for computer vision tasks on images/videos rather than text classification. Combining Textract and Comprehend provides a serverless, low-maintenance solution that minimizes operational effort.