Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about gpt-4-0125-preview reference answer #21

Open
duguodong7 opened this issue Sep 10, 2024 · 4 comments
Open

about gpt-4-0125-preview reference answer #21

duguodong7 opened this issue Sep 10, 2024 · 4 comments
Labels
good first issue Good for newcomers

Comments

@duguodong7
Copy link

hello,

我想咨询一下在MT-bench上测试时,使用的reference answer 是通过 gen_api_answer.py --model gpt-4-0125-preview这个命令来获取的吗?
生成的reference answer有80个,然后把其中100~130个用official commenthttps://github.com/lm-sys/FastChat/pull/3158这里的正确的30个进行替换吗?
总结一下;judge model 是用gpt-4-0125-preview, 但是80个问题的reference answer 是怎么获取呢,judge的结果是可复现的还是会有波动呢?

@duguodong7
Copy link
Author

Then I see that reference answer is only prepare for 100~130, we do not neet to run gen_api_answer.py since the reference 30 is given.
However, with the provided gpt-4-0125-preview.jsonl and gpt-4-0125-preview as judge model, we still can not obtaion a same result for openchat_3.5 as shown in paper, we are trying more tests. The result we got is (7.4375 6.9875 7.2125) in stead of (7.14 6.55 6.84) as shown in paper.

@duguodong7
Copy link
Author

I just tested the result of FuseChat-2.0 provided in your link, the result is (1st turn: 7.6125 2nd turn: 6.425 mean: 7.01875) in stead of (7.70 7.05 7.38) what you report, may I know what is wrong and how can I reproduce it ? I did not use vllm and also I maintain all environment version the same with you. I generate the result of FuseChat-2.0 with 8 GPUs.

@yangzy39
Copy link
Contributor

yangzy39 commented Sep 10, 2024

Regarding MT-Bench, the evaluation results may fluctuate. When conducting the evaluation, we only changed the reference and judge model from gpt-4-0613 to gpt-4-0125-preview. Below are several reasons that could lead to differences in reproducibility:

  1. Ensuring that the reference answer and API used during the evaluation are from gpt-4-0125-preview.
  2. Ensuring that the correct chat template is used when evaluating our model, which may require modifying the matching code in FastChat (https://github.com/lm-sys/FastChat/blob/main/fastchat/model/model_adapter.py).
  3. Ensuring the openchat_3.5 model you tested is from https://huggingface.co/openchat/openchat_3.5.

Also, we provide our judgement file for openchat_3.5 and FuseChat-2.0(FuseChat-7B-SCE).
judgement.zip

@duguodong7
Copy link
Author

Thank you so much! I did make an error when setting the chat template, and the performance improved after correcting it.

@18907305772 18907305772 added the good first issue Good for newcomers label Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants