A company wants to segment a large group of customers into subgroups based on shared char…

Question

A company wants to segment a large group of customers into subgroups based on shared characteristics. The company’s data scientist is planning to use the Amazon SageMaker built-in k-means clustering algorithm for this task. The data scientist needs to determine the optimal number of subgroups (k) to use. Which data visualization approach will MOST accurately determine the optimal value of k?

Accepted Answer

Correct answer: D. D. Run the k-means clustering algorithm for a range of k. For each value of k, calculate the sum of squared errors (SSE). Plot a line chart of the SSE for each value of k. The optimal value of k is the point after which the curve starts decreasing in a linear fashion. — The correct answer, D, accurately identifies the optimal k by plotting the sum of squared errors (SSE) against different k values and finding the point where the curve starts to decline linearly, indicating diminishing returns on clustering quality. Option A focuses incorrectly on PCA components for separation, B misapplies PCA's explained variance, and C utilizes t-SNE, which is not ideal for determining k in k-means clustering.

AWS Certified Machine Learning – Specialty — Question 208

Answer options

Correct answer: D

Explanation