A smart home automation company must efficiently ingest and process messages from various…

Question

A smart home automation company must efficiently ingest and process messages from various connected devices and sensors. The majority of these messages are comprised of a large number of small files. These messages are ingested using Amazon Kinesis Data Streams and sent to Amazon S3 using a Kinesis data stream consumer application. The Amazon S3 message data is then passed through a processing pipeline built on Amazon EMR running scheduled PySpark jobs.
The data platform team manages data processing and is concerned about the efficiency and cost of downstream data processing. They want to continue to use
PySpark.
Which solution improves the efficiency of the data processing jobs and is well architected?

Accepted Answer

Correct answer: D. D. Set up AWS Glue Python jobs to merge the small data files in Amazon S3 into larger files and transform them to Apache Parquet format. Migrate the downstream PySpark jobs from Amazon EMR to AWS Glue. — The correct answer is D because using AWS Glue to merge small files and convert them to Apache Parquet format optimizes storage and processing efficiency, which is crucial for large datasets. Option A, while beneficial, does not specifically address the merging of small files. Option B does not provide a scalable solution for batch processing of large amounts of data, and option C shifts processing to Redshift, which is not leveraging the existing PySpark jobs as required.

AWS Certified Data Analytics – Specialty — Question 48

Answer options

Correct answer: D

Explanation