Databricks Certified Generative AI Engineer Associate — Question 73

A Generative AI Engineer is building a RAG application that will rely on context retrieved from source documents that are currently in HTML format. They want to develop a solution using the least amount of lines of code.

Which Python package should be used to extract the text from the source documents?

Answer options

Correct answer: D

Explanation

The correct answer is D, beautifulsoup, which is specifically designed for parsing HTML and extracting text from it. Options A (pytesseract) is used for OCR on images, B (numpy) is a library for numerical computations and not related to HTML processing, and C (pypdf2) is meant for handling PDF files, making them unsuitable for this scenario.