Databricks Certified Data Engineer Associate — Question 136
A data engineer needs to optimize the data layout and query performance for an e-commerce transactions Delta table. The table is partitioned by "purchase_date" a date column which helps with time-based queries but does not optimize searches on user statistics "customer_id", a high-cardinality column.
The table is usually queried with filters on "customer_id" within specific date ranges, but since this data is spread across multiple files in each partition, it results in full partition scans and increased runtime and costs.
How should the data engineer optimize the Data Layout for efficient reads?
Answer options
- A. Alter table implementing liquid clustering on "customer_id" while keeping the existing partitioning.
- B. Alter the table implementing liquid clustering by "customer_id" and "purchase_date".
- C. Alter the table to partition by "customer_id".
- D. Enable delta caching on the cluster so that frequent reads are cached for performance.
Correct answer: B
Explanation
The correct answer is B because implementing liquid clustering on both 'customer_id' and 'purchase_date' allows for more efficient data organization, which optimizes query performance by reducing scan times. Option A only enhances 'customer_id' without considering the date range, while option C would not retain the benefits of time-based querying. Option D improves read performance but does not address the underlying data layout issues.