A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes tha…

Question

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?

Accepted Answer

Correct answer: D. D. Skew caused by more data being assigned to a subset of spark-partitions. — The correct answer is D because skew occurs when a disproportionate amount of data is assigned to a few partitions, causing those tasks to take significantly longer to complete. The other options do not adequately explain the observed performance issue, as task queueing, spill, network latency, and credential validation typically affect all tasks rather than causing extreme differences in task completion times.

Databricks Certified Data Engineer Professional — Question 230

Answer options

Correct answer: D

Explanation