A data engineer needs to optimize the data layout and query performance for an e-commerce…

Question

A data engineer needs to optimize the data layout and query performance for an e-commerce transactions Delta table. The table is partitioned by "purchase_date" a date column which helps with time-based queries but does not optimize searches on user statistics "customer_id", a high-cardinality column. The table is usually queried with filters on "customer_id" within specific date ranges, but since this data is spread across multiple files in each partition, it results in full partition scans and increased runtime and costs. How should the data engineer optimize the Data Layout for efficient reads?

Accepted Answer

Correct answer: B. B. Alter the table implementing liquid clustering by "customer_id" and "purchase_date". — The correct answer is B because implementing liquid clustering on both 'customer_id' and 'purchase_date' allows for more efficient data organization, which optimizes query performance by reducing scan times. Option A only enhances 'customer_id' without considering the date range, while option C would not retain the benefits of time-based querying. Option D improves read performance but does not address the underlying data layout issues.

Databricks Certified Data Engineer Associate — Question 136

Answer options

Correct answer: B

Explanation