-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy in gpt4o-mini Results on MSMarco Compared to Reported Results #6
Comments
Hi, thank you for your interest in our work! Can you please show me the arguments that were saved in your output files and the name of the output file? |
Of course! The arguments saved in the output file are as follows: "args": {
"config": "configs/rerank.yaml",
"tag": "eval",
"model_name_or_path": "gpt-4o-mini-2024-07-18",
"use_vllm": false,
"datasets": "msmarco_rerank_psg",
"demo_files": "data/msmarco/test_reranking_data_k10_dep3.jsonl",
"test_files": "data/msmarco/test_reranking_data_k1000_dep3.jsonl",
"output_dir": "output/gpt-4o-mini-2024-07-18",
"overwrite": false,
"max_test_samples": 100,
"num_workers": 4,
"num_depths": 10,
"popularity_threshold": 3,
"shots": 2,
"input_max_length": 131072,
"do_sample": false,
"generation_max_length": 200,
"generation_min_length": 0,
"temperature": 1.0,
"top_p": 1.0,
"stop_newline": false,
"seed": 42,
"no_cuda": false,
"no_bf16": false,
"no_torch_compile": false,
"use_chat_template": false,
"rope_theta": null,
"debug": false,
"count_tokens": false,
"stop_new_line": true
} And the filename is Let me know if you need further information! |
I re-ran these experiments today, and I got |
It's strange—I directly used the command bash scripts/run_api.sh, but the results are still quite different. Could you please share your output files so I can check which step may have gone wrong? You're truly appreciated! |
You can find the result files here: https://drive.google.com/file/d/1PDRFhRXn4YcZ5IH9250gC5xe4CixXK4S/view?usp=sharing |
Thank you very much! I compared the two and found that the only difference lies in the demos. Since the data.py code uses a hash, this difference is normal (or possibly need an extra hash seed setting). Additionally, there are some fluctuations in the API results. I'm not exactly sure where the difference comes from. I tried using the same inputs as you (including the demos), but the results still vary quite a bit. I suspect the API might be the cause, but I need to check further. If I find anything else later, I’ll be sure to share it with you! Thanks again! |
Thanks for catching this! I will update the code with a deterministic hash function to get the demos. |
Hi, thank you very much for your attention and response! I've encountered an issue while aligning the evaluation results for MSMARCO. Specifically, there seems to be a problem with the qrels used in the metric calculations. In the code, this occurs between lines 305 and 321 of The logic used here — data = data.filter(lambda x: x[key] in keys)
keys = random.sample(sorted(keys), min(max_test_samples, len(keys))) — appears correct. However, the data field More critically, in the qrels[d["qid"]] = {c["id"]: c["label"] for c in d["ctxs"]} Since pytrec_eval.RelevanceEvaluator(qrels, {map_string, ndcg_string, recall_string, precision_string, "recip_rank"}) Could you please provide insights or suggestions on how to address this issue? Thank you once again for your time and for your valuable work! |
Thanks for pointing this out, I should have added more documentation on this. Yes there are multiple instances of the same Thus, the sample step is the following logic: we first sample some number of qids and then keep the three different permutations each Hope this makes sense! I checked the code and it seems that I didn't set the random seed in load_msmarco, which could lead to different subsets. I will update the results in the next iteration with the same random seed set for all runs. |
Thank you for providing this excellent benchmark and sharing the evaluation results across various models!
When we tested the MSMARCO task using the evaluation code provided in this GitHub repository, we observed that the results seem to differ significantly from the numbers reported in the paper. Specifically, we used the same model (gpt-4o-mini-2024-07-18) as mentioned, but for the different lengths in MSMARCO, we only achieved results like 67(8k), 58(16k), 46(32k), 30(64k), and 22(128k).
Could you kindly advise if there's anything extra we need to do in order to reproduce the results from the paper? Currently, we are using the code exactly as provided in this repository.
Thank you!
The text was updated successfully, but these errors were encountered: