Google Cloud Professional Data Engineer — Question 247

Your data science team needs to perform interactive SQL queries on large datasets stored in Apache Parquet format within a Cloud Storage bucket. The team is familiar with Apache Hive and wants to leverage existing HiveQL queries. You need to provide an environment for the team to run their interactive HiveQL queries directly against the data in Cloud Storage. You want to keep operational overhead to a minimum. What should you do?

Answer options

Correct answer: D

Explanation

The correct answer is D because deploying a Dataproc cluster with Hive services enabled allows the team to run their HiveQL queries directly on the data stored in Cloud Storage with minimal operational overhead. Option A requires significant manual setup and management, while option B involves data loading and additional complexity with the BigQuery Connector. Option C, while feasible, does not provide the same level of interactivity and compatibility with existing HiveQL queries as using a Dataproc cluster.