A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The c…

Question

A Data Scientist needs to migrate an existing on-premises ETL process to the cloud. The current process runs at regular time intervals and uses PySpark to combine and format multiple large data sources into a single consolidated output for downstream processing.
The Data Scientist has been given the following requirements to the cloud solution:
✑ Combine multiple data sources.
✑ Reuse existing PySpark logic.
✑ Run the solution on the existing schedule.
✑ Minimize the number of servers that will need to be managed.
Which architecture should the Data Scientist use to build this solution?

Accepted Answer

Correct answer: B. B. Write the raw data to Amazon S3. Create an AWS Glue ETL job to perform the ETL processing against the input data. Write the ETL job in PySpark to leverage the existing logic. Create a new AWS Glue trigger to trigger the ETL job based on the existing schedule. Configure the output target of the ETL job to write to a ג€processedג€ location in Amazon S3 that is accessible for downstream use. — Option B is the correct choice as it directly addresses all the requirements: it combines multiple data sources, reuses existing PySpark logic, runs on the specified schedule through AWS Glue triggers, and minimizes server management by leveraging AWS Glue. Options A and C involve AWS Lambda, which may not efficiently handle large data volumes as required, while option D uses Kinesis Data Analytics, which is not suitable for batch ETL processes.

AWS Certified Machine Learning – Specialty — Question 65

Answer options

Correct answer: B

Explanation