Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache…

Question

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

Accepted Answer

Correct answer: A. A. Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless. — The correct answer is A because migrating data to Cloud Storage and using Dataproc Metastore allows for seamless integration with Spark while maintaining the existing Parquet format. This minimizes changes to the ETL process and leverages managed services effectively. Options B, C, and D either introduce unnecessary complexity or do not align with the requirement to minimize ETL data processing changes.

Google Cloud Professional Data Engineer — Question 151

Answer options

Correct answer: A

Explanation