Databricks Certified Data Engineer Professional — Question 153
A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE
This table is partitioned by the date column. A query is run with the following filter:
longitude < 20 & longitude > -20
Which statement describes how data will be filtered?
Answer options
- A. Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
- B. No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
- C. The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.
- D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
Correct answer: D
Explanation
The correct answer is D because the Delta Engine uses statistics in the Delta Log to pinpoint which data files may include records that fit the specified longitude criteria. Option A is incorrect as it refers to partitions rather than specific data files. Option B is not right since the optimizer can still perform file skipping based on statistics. Option C is inaccurate because the Delta Engine generally does not scan each row individually; it relies on data file statistics instead.