A data team is working to optimize an existing large, fast-growing table 'orders' with hi…

Question

A data team is working to optimize an existing large, fast-growing table 'orders' with high cardinality columns, which experiences significant data skew and requires frequent concurrent writes. The team notice that the columns 'user_id', 'event_timestamp' and 'product_id' are heavily used in analytical queries and filters, although those keys may be subject to change in the future due to different business requirements. Which partitioning strategy should the team choose to optimize the table for immediate data skipping, incremental management over time, and flexibility?

Accepted Answer

Correct answer: C. C. Cluster the table with: ALTER TABLE orders CLUSTER BY user_id, product_id, event_timestamp — The correct answer is C because clustering the table by 'user_id', 'product_id', and 'event_timestamp' allows for better performance in querying and management of the data over time, especially given the high cardinality and changing nature of these columns. Options A and B focus on partitioning and z-ordering, which may not provide the same level of flexibility for future changes. Option D also emphasizes z-ordering without offering the benefits of clustering for incremental management.

Databricks Certified Data Engineer Professional — Question 214

Answer options

Correct answer: C

Explanation