Databricks Certified Data Engineer Professional — Question 18

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.
Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Answer options

Correct answer: D

Explanation

The correct answer, D, highlights that using display() triggers jobs and many transformations just build up the logical query plan, making repeated executions less meaningful due to caching. Options A and C misinterpret the testing methodology, suggesting Scala or local builds are necessary, which is not the case. Option B suggests using production-sized resources in notebooks, which may not accurately reflect production performance, and E's recommendation about Photon is irrelevant to measuring execution time accurately.