AWS Certified Machine Learning – Specialty — Question 306
A data engineer wants to perform exploratory data analysis (EDA) on a petabyte of data. The data engineer does not want to manage compute resources and wants to pay only for queries that are run. The data engineer must write the analysis by using Python from a Jupyter notebook.
Which solution will meet these requirements?
Answer options
- A. Use Apache Spark from within Amazon Athena.
- B. Use Apache Spark from within Amazon SageMaker.
- C. Use Apache Spark from within an Amazon EMR cluster.
- D. Use Apache Spark through an integration with Amazon Redshift.
Correct answer: A
Explanation
Amazon Athena provides a serverless Apache Spark environment that allows running interactive Python queries in Jupyter notebooks with no infrastructure to manage and a pricing model based on queries run. In contrast, Amazon EMR requires managing cluster instances, while Amazon SageMaker and Amazon Redshift integrations do not offer the same serverless, pay-per-query Spark notebook configuration.