You have an upstream process that writes data to Cloud Storage. This data is then read by…

Question

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

Accepted Answer

Correct answer: D. D. 1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.
2. Enable turbo replication.
3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region.
4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket. — Option D is correct because it uses a dual-region bucket with turbo replication, ensuring data is available in both regions while maintaining low latency by reading from the local bucket. Options A and C involve using a separate bucket for recovery, which could introduce latency issues, while option B relies on a multi-region bucket but does not optimize for the required RPO and recovery efficiency.

Google Cloud Professional Data Engineer — Question 222

Answer options

Correct answer: D

Explanation