What do other LLMs think about the quality of other LLMs when compared to humans?
SEAHORSE is a dataset introduced for multilingual, multifaceted summarization evaluation. The dataset consists of 96K summaries with human ratings along six quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness[11†source].
The main idea of this project is to utilize Language Learning Models (LLMs) to evaluate the quality of summaries generated by other machine learning models. These evaluations will be compared to the human ratings present in the SEAHORSE dataset, and the consistency between human and model evaluations will be analyzed. The parameters being considered are:
- Comprehensibility: The summary can be read and understood by the rater.
- Repetition: The summary is free of unnecessarily repeated information.
- Grammar: The summary is grammatically correct.
- Attribution: All of the information provided by the summary is fully attributable to the source article.
- Main Ideas: The summary captures the main idea(s) of the source article.
- Conciseness: The summary concisely represents the information in the source article.
The objective of this project is to evaluate how well LLMs align with human judgment in evaluating the quality of summaries generated by other models, providing insight into the reliability and limitations of LLMs as evaluators of summarization quality. Currently the objective used is MSE between human vs model generated ratings.