You are the Operations Lead for an ongoing incident with one of your services. The servic…

Question

You are the Operations Lead for an ongoing incident with one of your services. The service usually runs at around 70% capacity. You notice that one node is returning 5xx errors for all requests. There has also been a noticeable increase in support cases from customers. You need to remove the offending node from the load balancer pool so that you can isolate and investigate the node. You want to follow Google-recommended practices to manage the incident and reduce the impact on users. What should you do?

Accepted Answer

Correct answer: A. A. 1. Communicate your intent to the incident team.
2. Perform a load analysis to determine if the remaining nodes can handle the increase in traffic offloaded from the removed node, and scale appropriately.
3. When any new nodes report healthy, drain traffic from the unhealthy node, and remove the unhealthy node from service. — Option A is correct because it ensures communication with the incident team, assesses the capability of remaining nodes before removal, and manages traffic effectively by draining the unhealthy node only after confirming new nodes are healthy. Options B and D introduce new nodes before addressing the unhealthy one, which may not be the best practice in an ongoing incident. Option C fails to perform a load analysis before taking action, which could lead to further issues if the remaining nodes cannot handle the traffic.

Google Cloud Professional Cloud DevOps Engineer — Question 139

Answer options

Correct answer: A

Explanation