Databricks Certified Generative AI Engineer Associate — Question 55

A Generative AI Engineer has built an LLM-based system that will automatically translate user text between two languages. They now want to benchmark multiple LLM’s on this task and pick the best one. They have an evaluation set with known high quality translation examples. They want to evaluate each LLM using the evaluation set with a performant metric.

Which metric should they choose for this evaluation?

Answer options

Correct answer: A

Explanation

The BLEU metric is specifically designed for evaluating the quality of text produced by machine translation systems by comparing it to reference translations. Other metrics like NDCG and RECALL are not tailored for translation tasks, while ROUGE is typically used for evaluating summarization rather than translation accuracy.