Databricks Certified Generative AI Engineer Associate — Question 31

A Generative AI Engineer is building a RAG application that will rely on context retrieved from source documents that are currently in PDF format. These PDFs can contain both text and images. They want to develop a solution using the least amount of lines of code.
Which Python package should be used to extract the text from the source documents?

Answer options

Correct answer: C

Explanation

The correct answer is C, as the 'unstructured' package is specifically designed to handle various document types, including PDFs, and extract text efficiently. Options A (flask) and B (beautifulsoup) are not suited for PDF text extraction, as flask is a web framework and beautifulsoup is primarily for parsing HTML and XML. Option D (numpy) is a library for numerical computations and does not address text extraction from PDFs.