We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hi,关于测评代码,我有一些问题: 1.hellaswag测评和truthful_qa都使用了ContinuationLogitsModelAdapter,但是hellaswag在最后对completion的length做了归一化,truthful_qa没有做。 2. ContinuationLogitsModelAdapter测评逻辑,相当于是把选项放进来评判可能性吗,那为什么同样作为选择题的mmlu不采用这个逻辑呢,划分task测评不同modeladapter的逻辑是? 3. 我想构造一些few-shot样例给llm,truthful_qa和hellaswag该如何构造prompt让他们与测评逻辑保持一致呢
谢谢您的解答!
The text was updated successfully, but these errors were encountered:
gen_prompt
context
Sorry, something went wrong.
就是我不是很理解为啥都是选择题,hellaswag的回答要按长度归一化,truthful_qa的没有做。 3.我的意思是指我想构造一些few-shot作为下游任务的样例,最后测评还是hellaswag 0-shot,但是由于测评不是query,response的格式,我担心会有偏
No branches or pull requests
hi,关于测评代码,我有一些问题:
1.hellaswag测评和truthful_qa都使用了ContinuationLogitsModelAdapter,但是hellaswag在最后对completion的length做了归一化,truthful_qa没有做。
2. ContinuationLogitsModelAdapter测评逻辑,相当于是把选项放进来评判可能性吗,那为什么同样作为选择题的mmlu不采用这个逻辑呢,划分task测评不同modeladapter的逻辑是?
3. 我想构造一些few-shot样例给llm,truthful_qa和hellaswag该如何构造prompt让他们与测评逻辑保持一致呢
谢谢您的解答!
The text was updated successfully, but these errors were encountered: