Google Cloud Professional Machine Learning Engineer — Question 271
You work for a large bank that serves customers through an application hosted in Google Cloud that is running in the US and Singapore. You have developed a PyTorch model to classify transactions as potentially fraudulent or not. The model is a three-layer perceptron that uses both numerical and categorical features as input, and hashing happens within the model.
You deployed the model to the us-central1 region on nl-highcpu-16 machines, and predictions are served in real time. The model's current median response latency is 40 ms. You want to reduce latency, especially in Singapore, where some customers are experiencing the longest delays. What should you do?
Answer options
- A. Attach an NVIDIA T4 GPU to the machines being used for online inference.
- B. Change the machines being used for online inference to nl-highcpu-32.
- C. Deploy the model to Vertex AI private endpoints in the us-central1 and asia-southeast1 regions, and allow the application to choose the appropriate endpoint.
- D. Create another Vertex AI endpoint in the asia-southeast1 region, and allow the application to choose the appropriate endpoint.
Correct answer: C
Explanation
The correct answer is C because deploying the model to both us-central1 and asia-southeast1 regions allows for lower latency access for users in Singapore, as they can connect to a closer endpoint. Option A may improve performance but does not address geographic latency. Option B increases machine capacity but does not solve the problem of distance from the users. Option D suggests only creating an additional endpoint, which is less optimal than having endpoints in both regions for improved responsiveness.