AWS Certified Solutions Architect – Professional — Question 897
A Solutions Architect is designing the storage layer for a data warehousing application. The data files are large, but they have statically placed metadata at the beginning of each file that describes the size and placement of the file's index. The data files are read in by a fleet of Amazon EC2 instances that store the index size, index location, and other category information about the data file in a database. That database is used by Amazon EMR to group files together for deeper analysis.
What would be the MOST cost-effective, high availability storage solution for this workflow?
Answer options
- A. Store the data files in Amazon S3 and use Range GET for each file's metadata, then index the relevant data.
- B. Store the data files in Amazon EFS mounted by the EC2 fleet and EMR nodes.
- C. Store the data files on Amazon EBS volumes and allow the EC2 fleet and EMR to mount and unmount the volumes where they are needed.
- D. Store the content of the data files in Amazon DynamoDB tables with the metadata, index, and data as their own keys.
Correct answer: A
Explanation
Amazon S3 is the most cost-effective and highly available storage option for large files, and using Range GETs allows the EC2 instances to efficiently read only the metadata at the beginning of each file without downloading the entire payload. Solutions using Amazon EFS or Amazon EBS are significantly more expensive and introduce operational complexity when sharing data across dynamic EC2 and EMR fleets. Amazon DynamoDB is completely unsuitable for storing large files due to its 400KB item size limit and high throughput costs.