AWS Certified Machine Learning – Specialty — Question 55
A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions.
Here is an example from the dataset:
"The quck BROWN FOX jumps over the lazy dog.`
Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Choose three.)
Answer options
- A. Perform part-of-speech tagging and keep the action verb and the nouns only.
- B. Normalize all words by making the sentence lowercase.
- C. Remove stop words using an English stopword dictionary.
- D. Correct the typography on "quck" to "quick.ג€
- E. One-hot encode all words in the sentence.
- F. Tokenize the sentence into words.
Correct answer: B, C, F
Explanation
The correct answer choices, B, C, and F, are essential for data preparation in natural language processing. Normalizing words to lowercase (B) ensures consistency, removing stop words (C) eliminates unnecessary noise, and tokenizing the sentence (F) breaks it down into manageable parts. Options A, D, and E are not necessary for initial data sanitization or preparation for Word2Vec.