A news company is developing an article search tool for its editors. The search tool shou…

Question

A news company is developing an article search tool for its editors. The search tool should look for the articles that are most relevant and representative for particular words that are queried among a corpus of historical news documents. The editors test the first version of the tool and report that the tool seems to look for word matches in general. The editors have to spend additional time to filter the results to look for the articles where the queried words are most important. A group of data scientists must redesign the tool so that it isolates the most frequently used words in a document. The tool also must capture the relevance and importance of words for each document in the corpus. Which solution meets these requirements?

Accepted Answer

Correct answer: B. B. Build a term frequency for each word in the articles that is weighted with the article's length. Build an inverse document frequency for each word that is weighted with all articles in the corpus. Define a final highlight score as the product of both of these frequencies. Configure the tool to retrieve the articles where this highlight score is higher for the queried words. — Term Frequency-Inverse Document Frequency (TF-IDF), as outlined in Option B, is the standard statistical method for measuring how important a word is to a document in a collection. Option D only calculates term frequency (TF) and fails to account for word uniqueness across the wider corpus (IDF), which would still surface generic matches. Topic modeling (Option A) and semantic word embeddings (Option C) do not directly address the requirement to weight individual word importance using document and corpus-level frequencies.

AWS Certified Machine Learning – Specialty — Question 322

Answer options

Correct answer: B

Explanation