A gaming company uses Amazon Kinesis Data Streams to collect clickstream data. The compan…

Question

A gaming company uses Amazon Kinesis Data Streams to collect clickstream data. The company uses Amazon Data Firehose delivery streams to store the data in JSON format in Amazon S3. Data scientists at the company use Amazon Athena to query the most recent data to obtain business insights. The company wants to reduce Athena costs but does not want to recreate the data pipeline. Which solution will meet these requirements with the LEAST management effort?

Accepted Answer

Correct answer: A. A. Change the Firehose output format to Apache Parquet. Provide a custom S3 object YYYYMMDD prefix expression and specify a large buffer size. For the existing data, create an AWS Glue extract, transform, and load (ETL) job. Configure the ETL job to combine small JSON files, convert the JSON files to large Parquet files, and add the YYYYMMDD prefix. Use the ALTER TABLE ADD PARTITION statement to reflect the partition on the existing Athena table. — Option A is the best choice as it directly changes the output format to Apache Parquet, which is more cost-effective for querying in Athena, without needing to overhaul the existing pipeline. The other options involve additional complexity, such as running additional jobs or creating new clusters, which increases management effort and costs.

AWS Certified Data Engineer – Associate (DEA-C01) — Question 155

Answer options

Correct answer: A

Explanation