Google Cloud Professional Data Engineer — Question 164
You need to modernize your existing on-premises data strategy. Your organization currently uses:
• Apache Hadoop clusters for processing multiple large data sets, including on-premises Hadoop Distributed File System (HDFS) for data replication.
• Apache Airflow to orchestrate hundreds of ETL pipelines with thousands of job steps.
You need to set up a new architecture in Google Cloud that can handle your Hadoop workloads and requires minimal changes to your existing orchestration processes. What should you do?
Answer options
- A. Use Bigtable for your large workloads, with connections to Cloud Storage to handle any HDFS use cases. Orchestrate your pipelines with Cloud Composer.
- B. Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Orchestrate your pipelines with Cloud Composer.
- C. Use Dataproc to migrate Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Convert your ETL pipelines to Dataflow.
- D. Use Dataproc to migrate your Hadoop clusters to Google Cloud, and Cloud Storage to handle any HDFS use cases. Use Cloud Data Fusion to visually design and deploy your ETL pipelines.
Correct answer: B
Explanation
The correct answer is B because Dataproc allows for the seamless migration of Hadoop clusters to Google Cloud with minimal changes to existing workflows, while Cloud Storage can effectively replace HDFS for data storage. The other options suggest using different services like Bigtable or Dataflow, which would require more significant changes to the existing architecture and orchestration processes.