Databricks Certified Generative AI Engineer Associate — Question 73
A Generative AI Engineer is building a RAG application that will rely on context retrieved from source documents that are currently in HTML format. They want to develop a solution using the least amount of lines of code.
Which Python package should be used to extract the text from the source documents?
Answer options
- A. pytesseract
- B. numpy
- C. pypdf2
- D. beautifulsoup
Correct answer: D
Explanation
The correct answer is D, beautifulsoup, which is specifically designed for parsing HTML and extracting text from it. Options A (pytesseract) is used for OCR on images, B (numpy) is a library for numerical computations and not related to HTML processing, and C (pypdf2) is meant for handling PDF files, making them unsuitable for this scenario.