-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llava_onevision_moviechat performance #86
Comments
It seems that after we pr the llava_onevision_moviechat to lmms-eval, LLaVA-NeXT updates their codes again. Did you try to use the llava in our repo? Additionally, generate_until_multi_round() is also added after we pr the MovieChat series to lmms-eval. We will update our codes in recent days. |
Thank you for your prompt response! I replaced the llava folder in LLaVA-NeXT with the llava folder from this repository and ran pip install -e . in the LLaVA-NeXT directory. Unfortunately, I obtained a similar result (accuracy: 39.1%, score: 2.77). To summarize, the only modification I made was changing LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py. Since I used the llava folder from this repository, there was no need to modify llava_qwen.py. Are there any other issues with the modifications I made or the script I used? python3 -m accelerate.commands.launch Alternatively, is it possible to run llava_onevision_moviechat using this repository as well? |
Can you provide some examples in your log? Maybe it is named as 'moviechat_global.json' |
Yes, these are the first three lines of "logs/date_time_samples_moviechat_global.jsonl" {"doc_id": 0, "doc": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "time": 240, "pred": "boat"}, "target": "A boat.", "arguments": {"0": "do_sample", "1": "max_new_tokens", "2": "temperature", "3": "top_p", "4": "num_beams", "5": "stopping_criteria"}, "resps": [["boat"]], "filtered_resps": ["boat"], "doc_hash": "74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b", "prompt_hash": "64c2d8221c3a606afa1db32244f19680c722941f872fdbd720305dad38ce17e6", "target_hash": "3830d9d685549b63df5bea558b60a9730f1f7a1257474eeca2c9e23c3024f64a", "gpt_eval_score": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "pred": "boat", "score": 5, "review": "{'pred': 'yes', 'score': 5}"}, "gpt_eval_acc": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "pred": "boat", "acc": "yes", "review": "{'pred': 'yes', 'score': 5}"}, "input": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.What is the vehicle?"} |
@Espere-1119-Song I was tried to eval moviechat global by setting --tasks moviechat_global. But, it seems like that the dataset card (https://huggingface.co/datasets/Enxin/lmms_MovieChat_test) is only for break mode. |
I'm sorry that I misunderstood because I checked first few samples of results. Actually, I found that the moviechat_global task dealt with all samples in both break and global tasks. As you can see in the below URL, the dataset viewer sees the files as a single dataset. So, I could get results for the global task by filtering doc_id (>=1,907). In summary, the dataset card should be separated for each task. Thank you. |
Thank you for your detailed analysis and for pointing this out! MovieChat-global might indeed not have handled all samples across the global and breakpoint tasks as expected. We’ll double-check this in the evaluation process to ensure everything aligns correctly. Your suggestion to separate the dataset card for each task is very helpful, and we appreciate the time you took to test and share your results. |
Thanks for your helpful comments. I met the same issue and solved the most problems as you did. |
I figured out that I installed editable version of LLAVA-NEXT, so the llava directs to that dir. Thanks again for your sharing :) |
I ran the code from the lmms-eval repository (https://github.com/EvolvingLMMs-Lab/lmms-eval) after making the following corrections:
1. Imported IGNORE_INDEX from llava.constants in llava_qwen.py.
2. Pasted the generate_moviechat() function in llava_qwen.py from this repository.
3. Modified LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py.
4. Added the generate_until_multi_round() function in llava_onevision_moviechat.py to address the error indicating that this abstract method was not implemented. While I declared it, I did not actually use this function.
After these adjustments, I tested the llava_onevision_moviechat model using the following script:
However, the accuracy I achieved was only 39% as below. Could you kindly help me identify what might have gone wrong?
{
"results": {
"moviechat_global": {
"alias": "moviechat_global",
"gpt_eval_score,none": 2.7583781547372777,
"gpt_eval_score_stderr,none": "N/A",
"gpt_eval_acc,none": 0.39098055440628876,
"gpt_eval_acc_stderr,none": "N/A"
}
},
"group_subtasks": {
"moviechat_global": []
},
"configs": {
"moviechat_global": {
"task": "moviechat_global",
"dataset_path": "Enxin/lmms_MovieChat_test",
"dataset_kwargs": {
"token": true
},
"test_split": "test",
"full_docs": false,
"process_results_use_image": false,
"doc_to_visual": "<function moviechat_doc_to_visual at 0x7f8ab8418af0>",
"doc_to_text": "<function moviechat_doc_to_text at 0x7f8ab8431670>",
"doc_to_target": "<function moviechat_doc_to_answer at 0x7f8ab8431f70>",
"process_results": "<function moviechat_process_results_generic at 0x7f8ab843baf0>",
"description": "",
"target_delimiter": " ",
"fewshot_delimiter": "\n\n",
"num_fewshot": 0,
"metric_list": [
{
"metric": "gpt_eval_score",
"aggregation": "<function moviechat_aggregate_score at 0x7f8ab8442430>",
"higher_is_better": true
},
{
"metric": "gpt_eval_acc",
"aggregation": "<function moviechat_aggregate_acc at 0x7f8ab8442d30>",
"higher_is_better": true
}
],
"output_type": "generate_until",
"generation_kwargs": {
"until": [
"\n\n"
],
"do_sample": false
},
"repeats": 1,
"should_decontaminate": false,
"metadata": {
"version": 0.0,
"gpt_eval_model_name": "gpt-3.5-turbo-0125"
},
"lmms_eval_specific_kwargs": {
"default": {
"pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.",
"post_prompt": ""
},
"pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.",
"post_prompt": ""
}
}
},
"versions": {
"moviechat_global": 0.0
},
"n-shot": {
"moviechat_global": 0
},
"higher_is_better": {
"moviechat_global": {
"gpt_eval_score": true,
"gpt_eval_acc": true
}
},
"n-samples": {
"moviechat_global": {
"original": 2417,
"effective": 2417
}
},
"config": {
"model": "llava_onevision_moviechat",
"model_args": "",
"batch_size": "1",
"batch_sizes": [],
"device": null,
"use_cache": null,
"limit": null,
"bootstrap_iters": 100000,
"gen_kwargs": "",
"random_seed": 0,
"numpy_seed": 1234,
"torch_seed": 1234,
"fewshot_seed": 1234
},
"git_hash": "d2056e6",
"date": "20241127_134436",
"task_hashes": {
"moviechat_global": "51d9d796ea5bc78838d989e9f7802a1cdf68efab207ef3ef66d59a9993836fef"
},
"model_source": "llava_onevision_moviechat",
"model_name": "",
"model_name_sanitized": "",
"system_instruction": null,
"system_instruction_sha": null,
"fewshot_as_multiturn": false,
"chat_template": null,
"chat_template_sha": null,
"start_time": 2483129.288830893,
"end_time": 2503653.673547,
"total_evaluation_time_seconds": "20524.384716107044"
}
The text was updated successfully, but these errors were encountered: