Google Cloud Professional Data Engineer — Question 47

Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud
Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google
BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?

Answer options

Correct answer: B

Explanation

Using pre-emptible virtual machines (VMs) for the cluster is the most cost-effective option because they are significantly cheaper than regular VMs and are suitable for workloads that can tolerate interruptions. Migrating to Google Cloud Dataflow (option A) may not be necessary and could complicate the workflow. Choosing higher-memory nodes (option C) may improve speed but at a higher cost, which is not optimal for cost reduction. Finally, using SSDs (option D) can enhance performance but does not directly address cost optimization as it involves additional expense.