Databricks Certified Data Engineer Professional — Question 214
A data team is working to optimize an existing large, fast-growing table 'orders' with high cardinality columns, which experiences significant data skew and requires frequent concurrent writes. The team notice that the columns 'user_id', 'event_timestamp' and 'product_id' are heavily used in analytical queries and filters, although those keys may be subject to change in the future due to different business requirements.
Which partitioning strategy should the team choose to optimize the table for immediate data skipping, incremental management over time, and flexibility?
Answer options
- A. Partition the table with: ALTER TABLE orders PARTITION BY user_id, product_id, event_timestamp
- B. Use z-order after partitiing the table: OPTIMIZE orders ZORDER BY (user_id, product_id) WHERE event_timestamp = current date () - 1 DAY
- C. Cluster the table with: ALTER TABLE orders CLUSTER BY user_id, product_id, event_timestamp
- D. Z-order the table with OPTIMIZE orders ZORDER BY (user_id, product_id, event_timestamp)
Correct answer: C
Explanation
The correct answer is C because clustering the table by 'user_id', 'product_id', and 'event_timestamp' allows for better performance in querying and management of the data over time, especially given the high cardinality and changing nature of these columns. Options A and B focus on partitioning and z-ordering, which may not provide the same level of flexibility for future changes. Option D also emphasizes z-ordering without offering the benefits of clustering for incremental management.