AWS Certified Machine Learning – Specialty — Question 221
A company is building a pipeline that periodically retrains its machine learning (ML) models by using new streaming data from devices. The company's data engineering team wants to build a data ingestion system that has high throughput, durable storage, and scalability. The company can tolerate up to 5 minutes of latency for data ingestion. The company needs a solution that can apply basic data transformation during the ingestion process.
Which solution will meet these requirements with the MOST operational efficiency?
Answer options
- A. Configure the devices to send streaming data to an Amazon Kinesis data stream. Configure an Amazon Kinesis Data Firehose delivery stream to automatically consume the Kinesis data stream, transform the data with an AWS Lambda function, and save the output into an Amazon S3 bucket.
- B. Configure the devices to send streaming data to an Amazon S3 bucket. Configure an AWS Lambda function that is invoked by S3 event notifications to transform the data and load the data into an Amazon Kinesis data stream. Configure an Amazon Kinesis Data Firehose delivery stream to automatically consume the Kinesis data stream and load the output back into the S3 bucket.
- C. Configure the devices to send streaming data to an Amazon S3 bucket. Configure an AWS Glue job that is invoked by S3 event notifications to read the data, transform the data, and load the output into a new S3 bucket.
- D. Configure the devices to send streaming data to an Amazon Kinesis Data Firehose delivery stream. Configure an AWS Glue job that connects to the delivery stream to transform the data and load the output into an Amazon S3 bucket.
Correct answer: A
Explanation
Option A is the most efficient as it utilizes Amazon Kinesis for real-time streaming and AWS Lambda for on-the-fly data transformation, efficiently meeting high throughput and scalability needs. Options B and C involve additional steps that increase latency and complexity by routing data through Amazon S3, which does not align with the requirement for low latency. Option D, while effective, does not leverage the Kinesis data stream mechanism as effectively as option A.