Databricks Certified Data Engineer Professional — Question 152
A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
Answer options
- A. Task queueing resulting from improper thread pool assignment.
- B. Spill resulting from attached volume storage being too small.
- C. Network latency due to some cluster nodes being in different regions from the source data
- D. Skew caused by more data being assigned to a subset of spark-partitions.
Correct answer: D
Explanation
The correct answer is D because data skew means that some spark-partitions are overloaded with more data than others, leading to significantly longer processing times for those tasks. The other options, while they can cause delays, do not specifically explain the large disparity between the Min, Median, and Max task durations as effectively as data skew does.