Google Cloud Professional Data Engineer — Question 233
You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?
Answer options
- A. Import the ORC files to Bigtable tables for the data scientist team.
- B. Import the ORC files to BigQuery tables for the data scientist team.
- C. Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.
- D. Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.
Correct answer: D
Explanation
The correct answer is D because creating external BigQuery tables allows the data scientists to query the ORC files directly in Cloud Storage using SQL, maintaining familiarity with the Hive query engine without incurring additional storage costs. Options A and B involve importing data which can be more expensive and less flexible. Option C requires setting up a Dataproc cluster, which involves additional management and resource costs.