AWS Certified Big Data – Specialty — Question 17
A new algorithm has been written in Python to identify SPAM e-mails. The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage.
Which AWS service strategy is best for this use case?
Answer options
- A. Copy the data into Amazon ElastiCache to perform text analysis on the in-memory data and export the results of the model into Amazon Machine Learning.
- B. Use Amazon EMR to parallelize the text analysis tasks across the cluster using a streaming program step.
- C. Use Amazon Elasticsearch Service to store the text and then use the Python Elasticsearch Client to run analysis against the text index.
- D. Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3 text files.
Correct answer: C
Explanation
The correct answer is C because Amazon Elasticsearch Service is optimized for searching and analyzing large volumes of text data, making it ideal for this use case. Options A and D do not utilize the scalable capabilities needed for the large dataset effectively, while option B, although applicable for processing, may not provide the specialized text analysis features that Elasticsearch offers.