Databricks Certified Data Engineer Professional — Question 140

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

Answer options

Correct answer: D

Explanation

The correct answer is D because manually setting types in Databricks provides a higher level of confidence in data quality compared to relying solely on inferred types. Options A and C incorrectly suggest that string types are always the most efficient and that inferred types always match downstream systems, respectively. Option B, while true about Delta Lake's use of Parquet, does not address the importance of maintaining data quality through manually defined types.