A data engineer has inherited a Databricks pipeline from a previous team. The pipeline is…

Question

A data engineer has inherited a Databricks pipeline from a previous team. The pipeline is missing SLAs and costs more than the allotted budget. On analysis, it is noted that the cluster is not being fully utilized, and the dataset is getting skewed. How should the data engineer resolve this issue?

Accepted Answer

Correct answer: C. C. Repartition the dataset to have it be more optimally spread across all nodes. — The correct answer is C because repartitioning the dataset allows for a more balanced distribution of data across the nodes, which can help alleviate skew and improve performance. Option A, using coalesce(), reduces the number of partitions but may not effectively address data skew. Option B, increasing the number of executors, might not solve the underlying issue of data distribution. Option D, increasing executor memory, does not directly address the imbalance in the dataset.

Databricks Certified Data Engineer Associate — Question 123

Answer options

Correct answer: C

Explanation