Google Cloud Professional Data Engineer — Question 151

Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?

Answer options

Correct answer: A

Explanation

The correct answer is A because migrating data to Cloud Storage and using Dataproc Metastore allows for seamless integration with Spark while maintaining the existing Parquet format. This minimizes changes to the ETL process and leverages managed services effectively. Options B, C, and D either introduce unnecessary complexity or do not align with the requirement to minimize ETL data processing changes.