AWS Certified Generative AI – Professional (AIP-C01) — Question 25
A company is developing a generative AI (GenAI) application that analyzes customer service calls in real-time and generates suggested responses for human customer service agents. The application must process 500,000 concurrent calls during peak hours with less than 200 ms end-to-end latency for each suggestion. The company uses existing architecture to transcribe customer call audio streams. The application must not exceed a pre-defined monthly compute budget and must maintain auto scaling capabilities.
Which solution will meet these requirements?
Answer options
- A. Deploy a large, complex reasoning model on Amazon Bedrock. Purchase provisioned throughput and optimize for batch processing.
- B. Deploy a low-latency, real-time optimized model on Amazon Bedrock. Purchase provisioned throughput and set up automatic scaling policies.
- C. Deploy a large language model (LLM) on an Amazon SageMaker AI real-time endpoint that uses dedicated GPU instances.
- D. Deploy a mid-sized language model on an Amazon SageMaker AI serverless endpoint that is optimized for batch processing.
Correct answer: B
Explanation
The correct answer is B because it focuses on a low-latency, real-time optimized model that can handle the required volume of calls with the necessary performance. Options A, C, and D either do not meet the latency requirements or are not designed for the real-time processing needed for 500,000 concurrent calls.