Databricks Certified Data Engineer Associate — Question 123

A data engineer has inherited a Databricks pipeline from a previous team. The pipeline is missing SLAs and costs more than the allotted budget. On analysis, it is noted that the cluster is not being fully utilized, and the dataset is getting skewed.

How should the data engineer resolve this issue?

Answer options

Correct answer: C

Explanation

The correct answer is C because repartitioning the dataset allows for a more balanced distribution of data across the nodes, which can help alleviate skew and improve performance. Option A, using coalesce(), reduces the number of partitions but may not effectively address data skew. Option B, increasing the number of executors, might not solve the underlying issue of data distribution. Option D, increasing executor memory, does not directly address the imbalance in the dataset.