A user new to Databricks is trying to troubleshoot long execution times for some pipeline…

Question

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively. Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Accepted Answer

Correct answer: D. D. Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results. — Option D is correct because calling display() triggers a job, which can lead to misleading execution time results as it does not account for the logical query plan that is built up over time. Options A, B, and C suggest methods that do not effectively capture the performance of the code under conditions that closely resemble production, as they rely on incorrect assumptions about execution context or environment.

Databricks Certified Data Engineer Professional — Question 179

Answer options

Correct answer: D

Explanation