AWS Certified Machine Learning – Specialty — Question 183

A data scientist is using the Amazon SageMaker Neural Topic Model (NTM) algorithm to build a model that recommends tags from blog posts. The raw blog post data is stored in an Amazon S3 bucket in JSON format. During model evaluation, the data scientist discovered that the model recommends certain stopwords such as "a," "an," and "the" as tags to certain blog posts, along with a few rare words that are present only in certain blog entries. After a few iterations of tag review with the content team, the data scientist notices that the rare words are unusual but feasible. The data scientist also must ensure that the tag recommendations of the generated model do not include the stopwords.
What should the data scientist do to meet these requirements?

Answer options

Correct answer: D

Explanation

The correct answer is D because using the CountVectorizer function allows the data scientist to effectively remove stopwords from the blog post data, ensuring that the model does not suggest them as tags. Options A and B do not directly address the issue of stopwords and instead focus on removing or transforming data in ways that do not meet the requirement. Option C is incorrect because changing to the Object Detection algorithm is not relevant to the task of tag recommendation from textual data.