AWS Certified Data Engineer – Associate (DEA-C01) — Question 250
A company is planning to migrate on-premises Apache Hadoop clusters to Amazon EMR. The company also needs to migrate a data catalog into a persistent storage solution.
The company currently stores the data catalog in an on-premises Apache Hive metastore on the Hadoop clusters. The company requires a serverless solution to migrate the data catalog.
Which solution will meet these requirements MOST cost-effectively?
Answer options
- A. Use AWS Database Migration Service (AWS DMS) to migrate the Hive metastore into Amazon S3. Configure AWS Glue Data Catalog to scan Amazon S3 to produce the data catalog.
- B. Configure a Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use AWS Glue Data Catalog to store the company's data catalog as an external data catalog.
- C. Configure an external Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use Amazon Aurora MySQL to store the company's data catalog.
- D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hive metastore into Amazon EMR. Use the new metastore as the company's data catalog.
Correct answer: B
Explanation
Option B is correct because it provides a direct integration of the existing Hive metastore with AWS Glue Data Catalog, which is a serverless solution for managing the data catalog. Option A involves using Amazon S3, which may not provide the same level of integration as using AWS Glue. Option C introduces Amazon Aurora MySQL, which adds unnecessary complexity and cost. Option D suggests creating a new metastore, which does not utilize the existing data structure efficiently.