Databricks Certified Generative AI Engineer Associate — Question 46
A Generative AI Engineer has written scalable PySpark code to ingest unstructured PDF documents and chunk them in preparation for storing in a Databricks Vector Search index. Currently, the two columns of their dataframe include the original filename as a string and an array of text chunks from that document.
What set of steps should the Generative AI Engineer perform to store the chunks in a ready-to-ingest manner for Databricks Vector Search?
Answer options
- A. Use PySpark’s autoloader to apply a UDF across all chunks, formatting them in a JSON structure for Vector Search ingestion.
- B. Flatten the dataframe to one chunk per row, create a unique identifier for each row, and enable change feed on the output Delta table.
- C. Utilize the original filename as the unique identifier and save the dataframe as is.
- D. Create a unique identifier for each document, flatten the dataframe to one chunk per row and save to an output Delta table.
Correct answer: B
Explanation
The correct answer B outlines the necessary steps to ensure each chunk is uniquely identifiable and compatible with Delta Lake's change data capture features. Options A and C do not adequately prepare the data for ingestion, while D, although close, does not mention enabling change feed, which is crucial for tracking changes in the output Delta table.