Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llava_onevision_moviechat performance #86

Open
prote376 opened this issue Nov 27, 2024 · 10 comments
Open

llava_onevision_moviechat performance #86

prote376 opened this issue Nov 27, 2024 · 10 comments

Comments

@prote376
Copy link

I ran the code from the lmms-eval repository (https://github.com/EvolvingLMMs-Lab/lmms-eval) after making the following corrections:
1. Imported IGNORE_INDEX from llava.constants in llava_qwen.py.
2. Pasted the generate_moviechat() function in llava_qwen.py from this repository.
3. Modified LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py.
4. Added the generate_until_multi_round() function in llava_onevision_moviechat.py to address the error indicating that this abstract method was not implemented. While I declared it, I did not actually use this function.

After these adjustments, I tested the llava_onevision_moviechat model using the following script:

python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model llava_onevision_moviechat \
    --tasks moviechat_global \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_onevision_moviechat \
    --output_path ./logs/

However, the accuracy I achieved was only 39% as below. Could you kindly help me identify what might have gone wrong?

{
"results": {
"moviechat_global": {
"alias": "moviechat_global",
"gpt_eval_score,none": 2.7583781547372777,
"gpt_eval_score_stderr,none": "N/A",
"gpt_eval_acc,none": 0.39098055440628876,
"gpt_eval_acc_stderr,none": "N/A"
}
},
"group_subtasks": {
"moviechat_global": []
},
"configs": {
"moviechat_global": {
"task": "moviechat_global",
"dataset_path": "Enxin/lmms_MovieChat_test",
"dataset_kwargs": {
"token": true
},
"test_split": "test",
"full_docs": false,
"process_results_use_image": false,
"doc_to_visual": "<function moviechat_doc_to_visual at 0x7f8ab8418af0>",
"doc_to_text": "<function moviechat_doc_to_text at 0x7f8ab8431670>",
"doc_to_target": "<function moviechat_doc_to_answer at 0x7f8ab8431f70>",
"process_results": "<function moviechat_process_results_generic at 0x7f8ab843baf0>",
"description": "",
"target_delimiter": " ",
"fewshot_delimiter": "\n\n",
"num_fewshot": 0,
"metric_list": [
{
"metric": "gpt_eval_score",
"aggregation": "<function moviechat_aggregate_score at 0x7f8ab8442430>",
"higher_is_better": true
},
{
"metric": "gpt_eval_acc",
"aggregation": "<function moviechat_aggregate_acc at 0x7f8ab8442d30>",
"higher_is_better": true
}
],
"output_type": "generate_until",
"generation_kwargs": {
"until": [
"\n\n"
],
"do_sample": false
},
"repeats": 1,
"should_decontaminate": false,
"metadata": {
"version": 0.0,
"gpt_eval_model_name": "gpt-3.5-turbo-0125"
},
"lmms_eval_specific_kwargs": {
"default": {
"pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.",
"post_prompt": ""
},
"pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.",
"post_prompt": ""
}
}
},
"versions": {
"moviechat_global": 0.0
},
"n-shot": {
"moviechat_global": 0
},
"higher_is_better": {
"moviechat_global": {
"gpt_eval_score": true,
"gpt_eval_acc": true
}
},
"n-samples": {
"moviechat_global": {
"original": 2417,
"effective": 2417
}
},
"config": {
"model": "llava_onevision_moviechat",
"model_args": "",
"batch_size": "1",
"batch_sizes": [],
"device": null,
"use_cache": null,
"limit": null,
"bootstrap_iters": 100000,
"gen_kwargs": "",
"random_seed": 0,
"numpy_seed": 1234,
"torch_seed": 1234,
"fewshot_seed": 1234
},
"git_hash": "d2056e6",
"date": "20241127_134436",
"task_hashes": {
"moviechat_global": "51d9d796ea5bc78838d989e9f7802a1cdf68efab207ef3ef66d59a9993836fef"
},
"model_source": "llava_onevision_moviechat",
"model_name": "",
"model_name_sanitized": "",
"system_instruction": null,
"system_instruction_sha": null,
"fewshot_as_multiturn": false,
"chat_template": null,
"chat_template_sha": null,
"start_time": 2483129.288830893,
"end_time": 2503653.673547,
"total_evaluation_time_seconds": "20524.384716107044"
}

@Espere-1119-Song
Copy link
Collaborator

It seems that after we pr the llava_onevision_moviechat to lmms-eval, LLaVA-NeXT updates their codes again. Did you try to use the llava in our repo?

Additionally, generate_until_multi_round() is also added after we pr the MovieChat series to lmms-eval. We will update our codes in recent days.

@prote376
Copy link
Author

Thank you for your prompt response!

I replaced the llava folder in LLaVA-NeXT with the llava folder from this repository and ran pip install -e . in the LLaVA-NeXT directory.

Unfortunately, I obtained a similar result (accuracy: 39.1%, score: 2.77).

To summarize, the only modification I made was changing LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py.

Since I used the llava folder from this repository, there was no need to modify llava_qwen.py.

Are there any other issues with the modifications I made or the script I used?

python3 -m accelerate.commands.launch
--num_processes=8
-m lmms_eval
--model llava_onevision_moviechat
--tasks moviechat_global
--batch_size 1
--log_samples
--log_samples_suffix llava_onevision_moviechat
--output_path ./logs/

Alternatively, is it possible to run llava_onevision_moviechat using this repository as well?

@Espere-1119-Song
Copy link
Collaborator

Can you provide some examples in your log? Maybe it is named as 'moviechat_global.json'

@prote376
Copy link
Author

prote376 commented Dec 1, 2024

Yes, these are the first three lines of "logs/date_time_samples_moviechat_global.jsonl"

{"doc_id": 0, "doc": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "time": 240, "pred": "boat"}, "target": "A boat.", "arguments": {"0": "do_sample", "1": "max_new_tokens", "2": "temperature", "3": "top_p", "4": "num_beams", "5": "stopping_criteria"}, "resps": [["boat"]], "filtered_resps": ["boat"], "doc_hash": "74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b", "prompt_hash": "64c2d8221c3a606afa1db32244f19680c722941f872fdbd720305dad38ce17e6", "target_hash": "3830d9d685549b63df5bea558b60a9730f1f7a1257474eeca2c9e23c3024f64a", "gpt_eval_score": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "pred": "boat", "score": 5, "review": "{'pred': 'yes', 'score': 5}"}, "gpt_eval_acc": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "pred": "boat", "acc": "yes", "review": "{'pred': 'yes', 'score': 5}"}, "input": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.What is the vehicle?"}
{"doc_id": 8, "doc": {"video_name": "1.mp4", "question": "Did a man in bathtub appear?", "answer": "Yes.", "time": 7608, "pred": "The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it."}, "target": "Yes.", "arguments": {"0": "do_sample", "1": "max_new_tokens", "2": "temperature", "3": "top_p", "4": "num_beams", "5": "stopping_criteria"}, "resps": [["The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it."]], "filtered_resps": ["The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it."], "doc_hash": "74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b", "prompt_hash": "24d209f5b3ea62f80b9be7a946db5441aa40843c1a54d3b16775cc7bc1adf0e3", "target_hash": "5f9a2b795615ba6a3d5455fd5624d773fbca5bcd16249c421fd37411dc9837da", "gpt_eval_score": {"video_name": "1.mp4", "question": "Did a man in bathtub appear?", "answer": "Yes.", "pred": "The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it.", "score": 2, "review": "{'pred': 'no', 'score': 2}"}, "gpt_eval_acc": {"video_name": "1.mp4", "question": "Did a man in bathtub appear?", "answer": "Yes.", "pred": "The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it.", "acc": "no", "review": "{'pred': 'no', 'score': 2}"}, "input": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.Did a man in bathtub appear?"}
{"doc_id": 16, "doc": {"video_name": "2.mp4", "question": "What is the person doing?", "answer": "Talking to a speaker.", "time": 5760, "pred": "The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot."}, "target": "Talking to a speaker.", "arguments": {"0": "do_sample", "1": "max_new_tokens", "2": "temperature", "3": "top_p", "4": "num_beams", "5": "stopping_criteria"}, "resps": [["The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot."]], "filtered_resps": ["The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot."], "doc_hash": "74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b", "prompt_hash": "f79410b5f7b504797eb21c1ee3cf9d454928388d23feeb613452b131fc3483c7", "target_hash": "9171d5b842f6e0a8dccd9c533129b15211e66ea2ffa69371fdcece97d6e803e3", "gpt_eval_score": {"video_name": "2.mp4", "question": "What is the person doing?", "answer": "Talking to a speaker.", "pred": "The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot.", "score": 2, "review": "{'pred': 'no', 'score': 2}"}, "gpt_eval_acc": {"video_name": "2.mp4", "question": "What is the person doing?", "answer": "Talking to a speaker.", "pred": "The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot.", "acc": "no", "review": "{'pred': 'no', 'score': 2}"}, "input": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.What is the person doing?"}

@prote376
Copy link
Author

prote376 commented Dec 11, 2024

@Espere-1119-Song I was tried to eval moviechat global by setting --tasks moviechat_global. But, it seems like that the dataset card (https://huggingface.co/datasets/Enxin/lmms_MovieChat_test) is only for break mode.

@Espere-1119-Song
Copy link
Collaborator

In fact, our repo includes both the global mode and the breakpoint mode.
image

@prote376
Copy link
Author

@Espere-1119-Song

I'm sorry that I misunderstood because I checked first few samples of results.

Actually, I found that the moviechat_global task dealt with all samples in both break and global tasks.

As you can see in the below URL, the dataset viewer sees the files as a single dataset.
It could be same in the evaluation process, too.
https://huggingface.co/datasets/Enxin/lmms_MovieChat_test/viewer/default/test?p=19

So, I could get results for the global task by filtering doc_id (>=1,907).
The accuracy was 72.35%. It is still lower than 79%, but it is understandable if we consider the characteristic of LLM-based evaluation.

In summary, the dataset card should be separated for each task.

Thank you.

@Espere-1119-Song
Copy link
Collaborator

Thank you for your detailed analysis and for pointing this out! MovieChat-global might indeed not have handled all samples across the global and breakpoint tasks as expected. We’ll double-check this in the evaluation process to ensure everything aligns correctly. Your suggestion to separate the dataset card for each task is very helpful, and we appreciate the time you took to test and share your results.

@LiJiaqi96
Copy link

I ran the code from the lmms-eval repository (https://github.com/EvolvingLMMs-Lab/lmms-eval) after making the following corrections: 1. Imported IGNORE_INDEX from llava.constants in llava_qwen.py. 2. Pasted the generate_moviechat() function in llava_qwen.py from this repository. 3. Modified LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py. 4. Added the generate_until_multi_round() function in llava_onevision_moviechat.py to address the error indicating that this abstract method was not implemented. While I declared it, I did not actually use this function.

After these adjustments, I tested the llava_onevision_moviechat model using the following script:

python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model llava_onevision_moviechat \
    --tasks moviechat_global \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_onevision_moviechat \
    --output_path ./logs/

However, the accuracy I achieved was only 39% as below. Could you kindly help me identify what might have gone wrong?

{ "results": { "moviechat_global": { "alias": "moviechat_global", "gpt_eval_score,none": 2.7583781547372777, "gpt_eval_score_stderr,none": "N/A", "gpt_eval_acc,none": 0.39098055440628876, "gpt_eval_acc_stderr,none": "N/A" } }, "group_subtasks": { "moviechat_global": [] }, "configs": { "moviechat_global": { "task": "moviechat_global", "dataset_path": "Enxin/lmms_MovieChat_test", "dataset_kwargs": { "token": true }, "test_split": "test", "full_docs": false, "process_results_use_image": false, "doc_to_visual": "<function moviechat_doc_to_visual at 0x7f8ab8418af0>", "doc_to_text": "<function moviechat_doc_to_text at 0x7f8ab8431670>", "doc_to_target": "<function moviechat_doc_to_answer at 0x7f8ab8431f70>", "process_results": "<function moviechat_process_results_generic at 0x7f8ab843baf0>", "description": "", "target_delimiter": " ", "fewshot_delimiter": "\n\n", "num_fewshot": 0, "metric_list": [ { "metric": "gpt_eval_score", "aggregation": "<function moviechat_aggregate_score at 0x7f8ab8442430>", "higher_is_better": true }, { "metric": "gpt_eval_acc", "aggregation": "<function moviechat_aggregate_acc at 0x7f8ab8442d30>", "higher_is_better": true } ], "output_type": "generate_until", "generation_kwargs": { "until": [ "\n\n" ], "do_sample": false }, "repeats": 1, "should_decontaminate": false, "metadata": { "version": 0.0, "gpt_eval_model_name": "gpt-3.5-turbo-0125" }, "lmms_eval_specific_kwargs": { "default": { "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" }, "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" } } }, "versions": { "moviechat_global": 0.0 }, "n-shot": { "moviechat_global": 0 }, "higher_is_better": { "moviechat_global": { "gpt_eval_score": true, "gpt_eval_acc": true } }, "n-samples": { "moviechat_global": { "original": 2417, "effective": 2417 } }, "config": { "model": "llava_onevision_moviechat", "model_args": "", "batch_size": "1", "batch_sizes": [], "device": null, "use_cache": null, "limit": null, "bootstrap_iters": 100000, "gen_kwargs": "", "random_seed": 0, "numpy_seed": 1234, "torch_seed": 1234, "fewshot_seed": 1234 }, "git_hash": "d2056e6", "date": "20241127_134436", "task_hashes": { "moviechat_global": "51d9d796ea5bc78838d989e9f7802a1cdf68efab207ef3ef66d59a9993836fef" }, "model_source": "llava_onevision_moviechat", "model_name": "", "model_name_sanitized": "", "system_instruction": null, "system_instruction_sha": null, "fewshot_as_multiturn": false, "chat_template": null, "chat_template_sha": null, "start_time": 2483129.288830893, "end_time": 2503653.673547, "total_evaluation_time_seconds": "20524.384716107044" }

Thanks for your helpful comments. I met the same issue and solved the most problems as you did.
One question remains for me: where to Pasted the generate_moviechat() function? I moved the llava folder to lmms-eval as illustrated in (https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/run_examples.md), so that the llava_qwen.py under lmms-eval has the generate_moviechat() function.
Could you please help me with this issue? Many thanks!

@LiJiaqi96
Copy link

I ran the code from the lmms-eval repository (https://github.com/EvolvingLMMs-Lab/lmms-eval) after making the following corrections: 1. Imported IGNORE_INDEX from llava.constants in llava_qwen.py. 2. Pasted the generate_moviechat() function in llava_qwen.py from this repository. 3. Modified LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py. 4. Added the generate_until_multi_round() function in llava_onevision_moviechat.py to address the error indicating that this abstract method was not implemented. While I declared it, I did not actually use this function.
After these adjustments, I tested the llava_onevision_moviechat model using the following script:

python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model llava_onevision_moviechat \
    --tasks moviechat_global \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_onevision_moviechat \
    --output_path ./logs/

However, the accuracy I achieved was only 39% as below. Could you kindly help me identify what might have gone wrong?
{ "results": { "moviechat_global": { "alias": "moviechat_global", "gpt_eval_score,none": 2.7583781547372777, "gpt_eval_score_stderr,none": "N/A", "gpt_eval_acc,none": 0.39098055440628876, "gpt_eval_acc_stderr,none": "N/A" } }, "group_subtasks": { "moviechat_global": [] }, "configs": { "moviechat_global": { "task": "moviechat_global", "dataset_path": "Enxin/lmms_MovieChat_test", "dataset_kwargs": { "token": true }, "test_split": "test", "full_docs": false, "process_results_use_image": false, "doc_to_visual": "<function moviechat_doc_to_visual at 0x7f8ab8418af0>", "doc_to_text": "<function moviechat_doc_to_text at 0x7f8ab8431670>", "doc_to_target": "<function moviechat_doc_to_answer at 0x7f8ab8431f70>", "process_results": "<function moviechat_process_results_generic at 0x7f8ab843baf0>", "description": "", "target_delimiter": " ", "fewshot_delimiter": "\n\n", "num_fewshot": 0, "metric_list": [ { "metric": "gpt_eval_score", "aggregation": "<function moviechat_aggregate_score at 0x7f8ab8442430>", "higher_is_better": true }, { "metric": "gpt_eval_acc", "aggregation": "<function moviechat_aggregate_acc at 0x7f8ab8442d30>", "higher_is_better": true } ], "output_type": "generate_until", "generation_kwargs": { "until": [ "\n\n" ], "do_sample": false }, "repeats": 1, "should_decontaminate": false, "metadata": { "version": 0.0, "gpt_eval_model_name": "gpt-3.5-turbo-0125" }, "lmms_eval_specific_kwargs": { "default": { "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" }, "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" } } }, "versions": { "moviechat_global": 0.0 }, "n-shot": { "moviechat_global": 0 }, "higher_is_better": { "moviechat_global": { "gpt_eval_score": true, "gpt_eval_acc": true } }, "n-samples": { "moviechat_global": { "original": 2417, "effective": 2417 } }, "config": { "model": "llava_onevision_moviechat", "model_args": "", "batch_size": "1", "batch_sizes": [], "device": null, "use_cache": null, "limit": null, "bootstrap_iters": 100000, "gen_kwargs": "", "random_seed": 0, "numpy_seed": 1234, "torch_seed": 1234, "fewshot_seed": 1234 }, "git_hash": "d2056e6", "date": "20241127_134436", "task_hashes": { "moviechat_global": "51d9d796ea5bc78838d989e9f7802a1cdf68efab207ef3ef66d59a9993836fef" }, "model_source": "llava_onevision_moviechat", "model_name": "", "model_name_sanitized": "", "system_instruction": null, "system_instruction_sha": null, "fewshot_as_multiturn": false, "chat_template": null, "chat_template_sha": null, "start_time": 2483129.288830893, "end_time": 2503653.673547, "total_evaluation_time_seconds": "20524.384716107044" }

Thanks for your helpful comments. I met the same issue and solved the most problems as you did. One question remains for me: where to Pasted the generate_moviechat() function? I moved the llava folder to lmms-eval as illustrated in (https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/run_examples.md), so that the llava_qwen.py under lmms-eval has the generate_moviechat() function. Could you please help me with this issue? Many thanks!

I figured out that I installed editable version of LLAVA-NEXT, so the llava directs to that dir. Thanks again for your sharing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants