You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job…

Question

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

Accepted Answer

Correct answer: D. D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size. — Option D is correct because increasing the boot disk size on SSDs can significantly enhance the performance of your Spark job, especially for workloads that involve heavy disk I/O. Options A and B suggest changing file sizes or formats, which may not directly address the performance issues as effectively. Option C, while it might improve performance, involves additional costs and does not align with maintaining a cost-sensitive approach.

Google Cloud Professional Data Engineer — Question 14

Answer options

Correct answer: D

Explanation