AWS Certified Machine Learning – Specialty — Question 65

A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing.
The Data Scientist has been given the following requirements to the cloud solution:
✑ Combine multiple data sources.
✑ Reuse existing PySpark logic.
✑ Run the solution on the existing schedule.
✑ Minimize the number of servers that will need to be managed.
Which architecture should the Data Scientist use to build this solution?

Answer options

Correct answer: B

Explanation

Option B is the correct choice as it directly addresses all the requirements: it combines multiple data sources, reuses existing PySpark logic, runs on the specified schedule through AWS Glue triggers, and minimizes server management by leveraging AWS Glue. Options A and C involve AWS Lambda, which may not efficiently handle large data volumes as required, while option D uses Kinesis Data Analytics, which is not suitable for batch ETL processes.