Databricks Certified Data Engineer Associate — Question 123
A data engineer has inherited a Databricks pipeline from a previous team. The pipeline is missing SLAs and costs more than the allotted budget. On analysis, it is noted that the cluster is not being fully utilized, and the dataset is getting skewed.
How should the data engineer resolve this issue?
Answer options
- A. Use coalesce() on the dataset to merge partitions and reduce skew.
- B. Increase the number of executors for the job.
- C. Repartition the dataset to have it be more optimally spread across all nodes.
- D. Increase the executor memory for the job.
Correct answer: C
Explanation
The correct answer is C because repartitioning the dataset allows for a more balanced distribution of data across the nodes, which can help alleviate skew and improve performance. Option A, using coalesce(), reduces the number of partitions but may not effectively address data skew. Option B, increasing the number of executors, might not solve the underlying issue of data distribution. Option D, increasing executor memory, does not directly address the imbalance in the dataset.