Databricks Certified Data Engineer Professional — Question 156

The data science team has requested assistance in accelerating queries on free-form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer’s suggestion is correct?

Answer options

Correct answer: A

Explanation

The correct answer is A because Delta Lake does not optimize statistics for free text fields that have high cardinality, which can lead to inefficient querying. Option B is incorrect as Delta Lake can collect statistics on all columns, not just the first four. Option C is misleading since ZORDER can help with performance, but it is not a requirement for improving query performance on free text fields. Option D is false because Delta Lake does not create a term matrix for free text fields.