truthful_qa及ContinuationLogitsModelAdapter使用疑问 #280

Haruka1307 · 2025-01-12T10:27:45Z

hi，关于测评代码，我有一些问题：
1.hellaswag测评和truthful_qa都使用了ContinuationLogitsModelAdapter，但是hellaswag在最后对completion的length做了归一化，truthful_qa没有做。
2. ContinuationLogitsModelAdapter测评逻辑，相当于是把选项放进来评判可能性吗，那为什么同样作为选择题的mmlu不采用这个逻辑呢，划分task测评不同modeladapter的逻辑是？
3. 我想构造一些few-shot样例给llm，truthful_qa和hellaswag该如何构造prompt让他们与测评逻辑保持一致呢

谢谢您的解答！

Yunnglin · 2025-01-13T03:47:09Z

这两个数据集格式不太一样，后处理逻辑也不一样，但两者最后是都做了归一化（truthful_qa的mc2是多选题处理逻辑，hellaswag是单选）
多项选择题确实都可以用logits来评测，或者使用模型生成+后处理来评测，后面这一部分会做成选项，让用户自行选择
truthful_qa和hellaswag不支持few-shot，如果想要添加的话，需要自己修改gen_prompt的逻辑，给context添加few-shot的上文

Haruka1307 · 2025-01-13T08:42:48Z

就是我不是很理解为啥都是选择题，hellaswag的回答要按长度归一化，truthful_qa的没有做。
3.我的意思是指我想构造一些few-shot作为下游任务的样例，最后测评还是hellaswag 0-shot，但是由于测评不是query,response的格式，我担心会有偏

Yunnglin · 2025-01-17T09:40:54Z

这个可能是代码上的不一致，logits按长度归一化理论上可以让不同长度的回答具有可比性，这块的逻辑需要再调研一下其他框架，看看是怎么处理的
我没有太理解这个问题，下游任务指的是什么，模型评测应该就是最后的任务了呀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

truthful_qa及ContinuationLogitsModelAdapter使用疑问 #280

truthful_qa及ContinuationLogitsModelAdapter使用疑问 #280

Haruka1307 commented Jan 12, 2025

Yunnglin commented Jan 13, 2025

Haruka1307 commented Jan 13, 2025

Yunnglin commented Jan 17, 2025

truthful_qa及ContinuationLogitsModelAdapter使用疑问 #280

truthful_qa及ContinuationLogitsModelAdapter使用疑问 #280

Comments

Haruka1307 commented Jan 12, 2025

Yunnglin commented Jan 13, 2025

Haruka1307 commented Jan 13, 2025

Yunnglin commented Jan 17, 2025