A company is evaluating several large language models (LLMs) for a text summarization tas…

Question

A company is evaluating several large language models (LLMs) for a text summarization task. The company needs to select a metric to evaluate the quality of the summaries that the LLMs generate. Which metric will meet this requirement?

Accepted Answer

Correct answer: C. C. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) — Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of metrics specifically designed to evaluate automatic summarization and machine translation by comparing generated text to reference summaries. In contrast, Recall and Area under the ROC curve (AUC) are standard metrics for classification models, while Mean squared error (MSE) is used to measure the performance of regression models. Therefore, ROUGE is the correct metric for assessing the quality of LLM-generated summaries.

AWS Certified AI Practitioner (AIF-C01) — Question 328

Answer options

Correct answer: C

Explanation