Google Cloud Professional Data Engineer — Question 247
Your data science team needs to perform interactive SQL queries on large datasets stored in Apache Parquet format within a Cloud Storage bucket. The team is familiar with Apache Hive and wants to leverage existing HiveQL queries. You need to provide an environment for the team to run their interactive HiveQL queries directly against the data in Cloud Storage. You want to keep operational overhead to a minimum. What should you do?
Answer options
- A. Install and configure an Apache Hadoop and Hive cluster manually on a group of Compute Engine instances.
- B. Load the Parquet data into a BigQuery native table and use the BigQuery Connector for Hive to run the queries.
- C. Configure BigQuery with an external table definition pointing to the Parquet files.
- D. Deploy a Dataproc cluster with Hive services enabled.
Correct answer: D
Explanation
The correct answer is D because deploying a Dataproc cluster with Hive services enabled allows the team to run their HiveQL queries directly on the data stored in Cloud Storage with minimal operational overhead. Option A requires significant manual setup and management, while option B involves data loading and additional complexity with the BigQuery Connector. Option C, while feasible, does not provide the same level of interactivity and compatibility with existing HiveQL queries as using a Dataproc cluster.