AWS Certified Data Engineer – Associate (DEA-C01) — Question 230
A data engineer is configuring an AWS Glue Apache Spark extract, transform, and load (ETL) job. The job contains a sort-merge join of two large and equally sized DataFrames.
The job is failing with the following error: No space left on device.
Which solution will resolve the error?
Answer options
- A. Use the AWS Glue Spark shuffle manager.
- B. Deploy are Amazon Elastic Block Store (Amazon EBS) volume for the job to use.
- C. Convert the sort-merge join in the job to be a broadcast join.
- D. Convert the DataFrames to DynamicFrames, and perform a DynamicFrame join in the job.
Correct answer: A
Explanation
The correct answer is A because using the AWS Glue Spark shuffle manager can effectively manage memory and disk space during the shuffle phase of the job, which is crucial for operations like sort-merge joins. Option B, while it may provide additional storage, does not address the shuffle management issue. Option C would not be suitable for large DataFrames where a broadcast join may lead to further memory issues. Option D also does not solve the underlying space problem as it merely changes the data structure without optimizing the join process.