-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about gpt-4-0125-preview reference answer #21
Comments
Then I see that reference answer is only prepare for 100~130, we do not neet to run gen_api_answer.py since the reference 30 is given. |
I just tested the result of FuseChat-2.0 provided in your link, the result is (1st turn: 7.6125 2nd turn: 6.425 mean: 7.01875) in stead of (7.70 7.05 7.38) what you report, may I know what is wrong and how can I reproduce it ? I did not use vllm and also I maintain all environment version the same with you. I generate the result of FuseChat-2.0 with 8 GPUs. |
Regarding MT-Bench, the evaluation results may fluctuate. When conducting the evaluation, we only changed the reference and judge model from gpt-4-0613 to gpt-4-0125-preview. Below are several reasons that could lead to differences in reproducibility:
Also, we provide our judgement file for openchat_3.5 and FuseChat-2.0(FuseChat-7B-SCE). |
Thank you so much! I did make an error when setting the chat template, and the performance improved after correcting it. |
hello,
我想咨询一下在MT-bench上测试时,使用的reference answer 是通过 gen_api_answer.py --model gpt-4-0125-preview这个命令来获取的吗?
生成的reference answer有80个,然后把其中100~130个用official commenthttps://github.com/lm-sys/FastChat/pull/3158这里的正确的30个进行替换吗?
总结一下;judge model 是用gpt-4-0125-preview, 但是80个问题的reference answer 是怎么获取呢,judge的结果是可复现的还是会有波动呢?
The text was updated successfully, but these errors were encountered: