AWS Certified Big Data – Specialty — Question 19
There are thousands of text files on Amazon S3. The total size of the files is 1 PB. The files contain retail order information for the past 2 years. A data engineer needs to run multiple interactive queries to manipulate the data. The Data Engineer has AWS access to spin up an Amazon EMR cluster. The data engineer needs to use an application on the cluster to process this data and return the results in interactive time frame.
Which application on the cluster should the data engineer use?
Answer options
- A. Oozie
- B. Apache Pig with Tachyon
- C. Apache Hive
- D. Presto
Correct answer: C
Explanation
Apache Hive is designed for querying and managing large datasets in a distributed storage environment like Amazon S3, making it suitable for the data engineer's needs. Oozie is a workflow scheduler, Apache Pig is more suited for batch processing, and Presto, while performant, may not be as integrated with EMR for the specific use case described.