llava_onevision_moviechat performance #86

prote376 · 2024-11-27T11:42:58Z

I ran the code from the lmms-eval repository (https://github.com/EvolvingLMMs-Lab/lmms-eval) after making the following corrections:
1. Imported IGNORE_INDEX from llava.constants in llava_qwen.py.
2. Pasted the generate_moviechat() function in llava_qwen.py from this repository.
3. Modified LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py.
4. Added the generate_until_multi_round() function in llava_onevision_moviechat.py to address the error indicating that this abstract method was not implemented. While I declared it, I did not actually use this function.

After these adjustments, I tested the llava_onevision_moviechat model using the following script:

python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model llava_onevision_moviechat \
    --tasks moviechat_global \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_onevision_moviechat \
    --output_path ./logs/

However, the accuracy I achieved was only 39% as below. Could you kindly help me identify what might have gone wrong?

{
"results": {
"moviechat_global": {
"alias": "moviechat_global",
"gpt_eval_score,none": 2.7583781547372777,
"gpt_eval_score_stderr,none": "N/A",
"gpt_eval_acc,none": 0.39098055440628876,
"gpt_eval_acc_stderr,none": "N/A"
}
},
"group_subtasks": {
"moviechat_global": []
},
"configs": {
"moviechat_global": {
"task": "moviechat_global",
"dataset_path": "Enxin/lmms_MovieChat_test",
"dataset_kwargs": {
"token": true
},
"test_split": "test",
"full_docs": false,
"process_results_use_image": false,
"doc_to_visual": "<function moviechat_doc_to_visual at 0x7f8ab8418af0>",
"doc_to_text": "<function moviechat_doc_to_text at 0x7f8ab8431670>",
"doc_to_target": "<function moviechat_doc_to_answer at 0x7f8ab8431f70>",
"process_results": "<function moviechat_process_results_generic at 0x7f8ab843baf0>",
"description": "",
"target_delimiter": " ",
"fewshot_delimiter": "\n\n",
"num_fewshot": 0,
"metric_list": [
{
"metric": "gpt_eval_score",
"aggregation": "<function moviechat_aggregate_score at 0x7f8ab8442430>",
"higher_is_better": true
},
{
"metric": "gpt_eval_acc",
"aggregation": "<function moviechat_aggregate_acc at 0x7f8ab8442d30>",
"higher_is_better": true
}
],
"output_type": "generate_until",
"generation_kwargs": {
"until": [
"\n\n"
],
"do_sample": false
},
"repeats": 1,
"should_decontaminate": false,
"metadata": {
"version": 0.0,
"gpt_eval_model_name": "gpt-3.5-turbo-0125"
},
"lmms_eval_specific_kwargs": {
"default": {
"pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.",
"post_prompt": ""
},
"pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.",
"post_prompt": ""
}
}
},
"versions": {
"moviechat_global": 0.0
},
"n-shot": {
"moviechat_global": 0
},
"higher_is_better": {
"moviechat_global": {
"gpt_eval_score": true,
"gpt_eval_acc": true
}
},
"n-samples": {
"moviechat_global": {
"original": 2417,
"effective": 2417
}
},
"config": {
"model": "llava_onevision_moviechat",
"model_args": "",
"batch_size": "1",
"batch_sizes": [],
"device": null,
"use_cache": null,
"limit": null,
"bootstrap_iters": 100000,
"gen_kwargs": "",
"random_seed": 0,
"numpy_seed": 1234,
"torch_seed": 1234,
"fewshot_seed": 1234
},
"git_hash": "d2056e6",
"date": "20241127_134436",
"task_hashes": {
"moviechat_global": "51d9d796ea5bc78838d989e9f7802a1cdf68efab207ef3ef66d59a9993836fef"
},
"model_source": "llava_onevision_moviechat",
"model_name": "",
"model_name_sanitized": "",
"system_instruction": null,
"system_instruction_sha": null,
"fewshot_as_multiturn": false,
"chat_template": null,
"chat_template_sha": null,
"start_time": 2483129.288830893,
"end_time": 2503653.673547,
"total_evaluation_time_seconds": "20524.384716107044"
}

The text was updated successfully, but these errors were encountered:

Espere-1119-Song · 2024-11-27T11:59:25Z

It seems that after we pr the llava_onevision_moviechat to lmms-eval, LLaVA-NeXT updates their codes again. Did you try to use the llava in our repo?

Additionally, generate_until_multi_round() is also added after we pr the MovieChat series to lmms-eval. We will update our codes in recent days.

prote376 · 2024-11-28T00:56:40Z

Thank you for your prompt response!

I replaced the llava folder in LLaVA-NeXT with the llava folder from this repository and ran pip install -e . in the LLaVA-NeXT directory.

Unfortunately, I obtained a similar result (accuracy: 39.1%, score: 2.77).

To summarize, the only modification I made was changing LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py.

Since I used the llava folder from this repository, there was no need to modify llava_qwen.py.

Are there any other issues with the modifications I made or the script I used?

python3 -m accelerate.commands.launch
--num_processes=8
-m lmms_eval
--model llava_onevision_moviechat
--tasks moviechat_global
--batch_size 1
--log_samples
--log_samples_suffix llava_onevision_moviechat
--output_path ./logs/

Alternatively, is it possible to run llava_onevision_moviechat using this repository as well?

Espere-1119-Song · 2024-11-29T08:29:46Z

Can you provide some examples in your log? Maybe it is named as 'moviechat_global.json'

prote376 · 2024-12-01T05:05:01Z

Yes, these are the first three lines of "logs/date_time_samples_moviechat_global.jsonl"

{"doc_id": 0, "doc": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "time": 240, "pred": "boat"}, "target": "A boat.", "arguments": {"0": "do_sample", "1": "max_new_tokens", "2": "temperature", "3": "top_p", "4": "num_beams", "5": "stopping_criteria"}, "resps": [["boat"]], "filtered_resps": ["boat"], "doc_hash": "74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b", "prompt_hash": "64c2d8221c3a606afa1db32244f19680c722941f872fdbd720305dad38ce17e6", "target_hash": "3830d9d685549b63df5bea558b60a9730f1f7a1257474eeca2c9e23c3024f64a", "gpt_eval_score": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "pred": "boat", "score": 5, "review": "{'pred': 'yes', 'score': 5}"}, "gpt_eval_acc": {"video_name": "1.mp4", "question": "What is the vehicle?", "answer": "A boat.", "pred": "boat", "acc": "yes", "review": "{'pred': 'yes', 'score': 5}"}, "input": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.What is the vehicle?"}
{"doc_id": 8, "doc": {"video_name": "1.mp4", "question": "Did a man in bathtub appear?", "answer": "Yes.", "time": 7608, "pred": "The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it."}, "target": "Yes.", "arguments": {"0": "do_sample", "1": "max_new_tokens", "2": "temperature", "3": "top_p", "4": "num_beams", "5": "stopping_criteria"}, "resps": [["The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it."]], "filtered_resps": ["The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it."], "doc_hash": "74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b", "prompt_hash": "24d209f5b3ea62f80b9be7a946db5441aa40843c1a54d3b16775cc7bc1adf0e3", "target_hash": "5f9a2b795615ba6a3d5455fd5624d773fbca5bcd16249c421fd37411dc9837da", "gpt_eval_score": {"video_name": "1.mp4", "question": "Did a man in bathtub appear?", "answer": "Yes.", "pred": "The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it.", "score": 2, "review": "{'pred': 'no', 'score': 2}"}, "gpt_eval_acc": {"video_name": "1.mp4", "question": "Did a man in bathtub appear?", "answer": "Yes.", "pred": "The image you've provided shows a sailboat on the water. There is no man in a bathtub visible in this image. The scene depicts a single sailboat with its sails partially unfurled, set against a backdrop of a clear sky with some clouds. The boat appears to be in motion, as suggested by the angle of the sails and the wake behind it.", "acc": "no", "review": "{'pred': 'no', 'score': 2}"}, "input": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.Did a man in bathtub appear?"}
{"doc_id": 16, "doc": {"video_name": "2.mp4", "question": "What is the person doing?", "answer": "Talking to a speaker.", "time": 5760, "pred": "The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot."}, "target": "Talking to a speaker.", "arguments": {"0": "do_sample", "1": "max_new_tokens", "2": "temperature", "3": "top_p", "4": "num_beams", "5": "stopping_criteria"}, "resps": [["The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot."]], "filtered_resps": ["The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot."], "doc_hash": "74234e98afe7498fb5daf1f36ac2d78acc339464f950703b8c019892f982b90b", "prompt_hash": "f79410b5f7b504797eb21c1ee3cf9d454928388d23feeb613452b131fc3483c7", "target_hash": "9171d5b842f6e0a8dccd9c533129b15211e66ea2ffa69371fdcece97d6e803e3", "gpt_eval_score": {"video_name": "2.mp4", "question": "What is the person doing?", "answer": "Talking to a speaker.", "pred": "The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot.", "score": 2, "review": "{'pred': 'no', 'score': 2}"}, "gpt_eval_acc": {"video_name": "2.mp4", "question": "What is the person doing?", "answer": "Talking to a speaker.", "pred": "The person in the image appears to be sitting inside a vehicle, likely a car. The individual is looking slightly to the side, and their expression seems contemplative or focused on something outside of the frame. The lighting suggests it might be nighttime or in a dimly lit area. The presence of what looks like a dashboard or part of a vehicle's interior in the background supports the idea that this is an interior shot.", "acc": "no", "review": "{'pred': 'no', 'score': 2}"}, "input": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.What is the person doing?"}

prote376 · 2024-12-11T11:43:45Z

@Espere-1119-Song I was tried to eval moviechat global by setting --tasks moviechat_global. But, it seems like that the dataset card (https://huggingface.co/datasets/Enxin/lmms_MovieChat_test) is only for break mode.

Espere-1119-Song · 2024-12-11T13:30:42Z

In fact, our repo includes both the global mode and the breakpoint mode.

prote376 · 2024-12-12T07:35:25Z

@Espere-1119-Song

I'm sorry that I misunderstood because I checked first few samples of results.

Actually, I found that the moviechat_global task dealt with all samples in both break and global tasks.

As you can see in the below URL, the dataset viewer sees the files as a single dataset.
It could be same in the evaluation process, too.
https://huggingface.co/datasets/Enxin/lmms_MovieChat_test/viewer/default/test?p=19

So, I could get results for the global task by filtering doc_id (>=1,907).
The accuracy was 72.35%. It is still lower than 79%, but it is understandable if we consider the characteristic of LLM-based evaluation.

In summary, the dataset card should be separated for each task.

Thank you.

Espere-1119-Song · 2024-12-12T07:48:23Z

Thank you for your detailed analysis and for pointing this out! MovieChat-global might indeed not have handled all samples across the global and breakpoint tasks as expected. We’ll double-check this in the evaluation process to ensure everything aligns correctly. Your suggestion to separate the dataset card for each task is very helpful, and we appreciate the time you took to test and share your results.

LiJiaqi96 · 2025-01-24T12:29:02Z

I ran the code from the lmms-eval repository (https://github.com/EvolvingLMMs-Lab/lmms-eval) after making the following corrections: 1. Imported IGNORE_INDEX from llava.constants in llava_qwen.py. 2. Pasted the generate_moviechat() function in llava_qwen.py from this repository. 3. Modified LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py. 4. Added the generate_until_multi_round() function in llava_onevision_moviechat.py to address the error indicating that this abstract method was not implemented. While I declared it, I did not actually use this function.

After these adjustments, I tested the llava_onevision_moviechat model using the following script:
python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model llava_onevision_moviechat \
    --tasks moviechat_global \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_onevision_moviechat \
    --output_path ./logs/
However, the accuracy I achieved was only 39% as below. Could you kindly help me identify what might have gone wrong?

{ "results": { "moviechat_global": { "alias": "moviechat_global", "gpt_eval_score,none": 2.7583781547372777, "gpt_eval_score_stderr,none": "N/A", "gpt_eval_acc,none": 0.39098055440628876, "gpt_eval_acc_stderr,none": "N/A" } }, "group_subtasks": { "moviechat_global": [] }, "configs": { "moviechat_global": { "task": "moviechat_global", "dataset_path": "Enxin/lmms_MovieChat_test", "dataset_kwargs": { "token": true }, "test_split": "test", "full_docs": false, "process_results_use_image": false, "doc_to_visual": "<function moviechat_doc_to_visual at 0x7f8ab8418af0>", "doc_to_text": "<function moviechat_doc_to_text at 0x7f8ab8431670>", "doc_to_target": "<function moviechat_doc_to_answer at 0x7f8ab8431f70>", "process_results": "<function moviechat_process_results_generic at 0x7f8ab843baf0>", "description": "", "target_delimiter": " ", "fewshot_delimiter": "\n\n", "num_fewshot": 0, "metric_list": [ { "metric": "gpt_eval_score", "aggregation": "<function moviechat_aggregate_score at 0x7f8ab8442430>", "higher_is_better": true }, { "metric": "gpt_eval_acc", "aggregation": "<function moviechat_aggregate_acc at 0x7f8ab8442d30>", "higher_is_better": true } ], "output_type": "generate_until", "generation_kwargs": { "until": [ "\n\n" ], "do_sample": false }, "repeats": 1, "should_decontaminate": false, "metadata": { "version": 0.0, "gpt_eval_model_name": "gpt-3.5-turbo-0125" }, "lmms_eval_specific_kwargs": { "default": { "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" }, "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" } } }, "versions": { "moviechat_global": 0.0 }, "n-shot": { "moviechat_global": 0 }, "higher_is_better": { "moviechat_global": { "gpt_eval_score": true, "gpt_eval_acc": true } }, "n-samples": { "moviechat_global": { "original": 2417, "effective": 2417 } }, "config": { "model": "llava_onevision_moviechat", "model_args": "", "batch_size": "1", "batch_sizes": [], "device": null, "use_cache": null, "limit": null, "bootstrap_iters": 100000, "gen_kwargs": "", "random_seed": 0, "numpy_seed": 1234, "torch_seed": 1234, "fewshot_seed": 1234 }, "git_hash": "d2056e6", "date": "20241127_134436", "task_hashes": { "moviechat_global": "51d9d796ea5bc78838d989e9f7802a1cdf68efab207ef3ef66d59a9993836fef" }, "model_source": "llava_onevision_moviechat", "model_name": "", "model_name_sanitized": "", "system_instruction": null, "system_instruction_sha": null, "fewshot_as_multiturn": false, "chat_template": null, "chat_template_sha": null, "start_time": 2483129.288830893, "end_time": 2503653.673547, "total_evaluation_time_seconds": "20524.384716107044" }

Thanks for your helpful comments. I met the same issue and solved the most problems as you did.
One question remains for me: where to Pasted the generate_moviechat() function? I moved the llava folder to lmms-eval as illustrated in (https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/run_examples.md), so that the llava_qwen.py under lmms-eval has the generate_moviechat() function.
Could you please help me with this issue? Many thanks!

LiJiaqi96 · 2025-01-24T12:48:11Z

I ran the code from the lmms-eval repository (https://github.com/EvolvingLMMs-Lab/lmms-eval) after making the following corrections: 1. Imported IGNORE_INDEX from llava.constants in llava_qwen.py. 2. Pasted the generate_moviechat() function in llava_qwen.py from this repository. 3. Modified LLAVA_NeXT.llava to llava in llava_onevision_moviechat.py. 4. Added the generate_until_multi_round() function in llava_onevision_moviechat.py to address the error indicating that this abstract method was not implemented. While I declared it, I did not actually use this function.
After these adjustments, I tested the llava_onevision_moviechat model using the following script:
python3 -m accelerate.commands.launch \
    --num_processes=8 \
    -m lmms_eval \
    --model llava_onevision_moviechat \
    --tasks moviechat_global \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_onevision_moviechat \
    --output_path ./logs/
However, the accuracy I achieved was only 39% as below. Could you kindly help me identify what might have gone wrong?
{ "results": { "moviechat_global": { "alias": "moviechat_global", "gpt_eval_score,none": 2.7583781547372777, "gpt_eval_score_stderr,none": "N/A", "gpt_eval_acc,none": 0.39098055440628876, "gpt_eval_acc_stderr,none": "N/A" } }, "group_subtasks": { "moviechat_global": [] }, "configs": { "moviechat_global": { "task": "moviechat_global", "dataset_path": "Enxin/lmms_MovieChat_test", "dataset_kwargs": { "token": true }, "test_split": "test", "full_docs": false, "process_results_use_image": false, "doc_to_visual": "<function moviechat_doc_to_visual at 0x7f8ab8418af0>", "doc_to_text": "<function moviechat_doc_to_text at 0x7f8ab8431670>", "doc_to_target": "<function moviechat_doc_to_answer at 0x7f8ab8431f70>", "process_results": "<function moviechat_process_results_generic at 0x7f8ab843baf0>", "description": "", "target_delimiter": " ", "fewshot_delimiter": "\n\n", "num_fewshot": 0, "metric_list": [ { "metric": "gpt_eval_score", "aggregation": "<function moviechat_aggregate_score at 0x7f8ab8442430>", "higher_is_better": true }, { "metric": "gpt_eval_acc", "aggregation": "<function moviechat_aggregate_acc at 0x7f8ab8442d30>", "higher_is_better": true } ], "output_type": "generate_until", "generation_kwargs": { "until": [ "\n\n" ], "do_sample": false }, "repeats": 1, "should_decontaminate": false, "metadata": { "version": 0.0, "gpt_eval_model_name": "gpt-3.5-turbo-0125" }, "lmms_eval_specific_kwargs": { "default": { "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" }, "pre_prompt": "You are able to understand the visual content that the user provides.Follow the instructions carefully and explain your answers in detail.", "post_prompt": "" } } }, "versions": { "moviechat_global": 0.0 }, "n-shot": { "moviechat_global": 0 }, "higher_is_better": { "moviechat_global": { "gpt_eval_score": true, "gpt_eval_acc": true } }, "n-samples": { "moviechat_global": { "original": 2417, "effective": 2417 } }, "config": { "model": "llava_onevision_moviechat", "model_args": "", "batch_size": "1", "batch_sizes": [], "device": null, "use_cache": null, "limit": null, "bootstrap_iters": 100000, "gen_kwargs": "", "random_seed": 0, "numpy_seed": 1234, "torch_seed": 1234, "fewshot_seed": 1234 }, "git_hash": "d2056e6", "date": "20241127_134436", "task_hashes": { "moviechat_global": "51d9d796ea5bc78838d989e9f7802a1cdf68efab207ef3ef66d59a9993836fef" }, "model_source": "llava_onevision_moviechat", "model_name": "", "model_name_sanitized": "", "system_instruction": null, "system_instruction_sha": null, "fewshot_as_multiturn": false, "chat_template": null, "chat_template_sha": null, "start_time": 2483129.288830893, "end_time": 2503653.673547, "total_evaluation_time_seconds": "20524.384716107044" }
Thanks for your helpful comments. I met the same issue and solved the most problems as you did. One question remains for me: where to Pasted the generate_moviechat() function? I moved the llava folder to lmms-eval as illustrated in (https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/run_examples.md), so that the llava_qwen.py under lmms-eval has the generate_moviechat() function. Could you please help me with this issue? Many thanks!

I figured out that I installed editable version of LLAVA-NEXT, so the llava directs to that dir. Thanks again for your sharing :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llava_onevision_moviechat performance #86

llava_onevision_moviechat performance #86

prote376 commented Nov 27, 2024

Espere-1119-Song commented Nov 27, 2024

prote376 commented Nov 28, 2024

Espere-1119-Song commented Nov 29, 2024

prote376 commented Dec 1, 2024

prote376 commented Dec 11, 2024 •

edited

Loading

Espere-1119-Song commented Dec 11, 2024

prote376 commented Dec 12, 2024

Espere-1119-Song commented Dec 12, 2024

LiJiaqi96 commented Jan 24, 2025

LiJiaqi96 commented Jan 24, 2025

llava_onevision_moviechat performance #86

llava_onevision_moviechat performance #86

Comments

prote376 commented Nov 27, 2024

Espere-1119-Song commented Nov 27, 2024

prote376 commented Nov 28, 2024

Espere-1119-Song commented Nov 29, 2024

prote376 commented Dec 1, 2024

prote376 commented Dec 11, 2024 • edited Loading

Espere-1119-Song commented Dec 11, 2024

prote376 commented Dec 12, 2024

Espere-1119-Song commented Dec 12, 2024

LiJiaqi96 commented Jan 24, 2025

LiJiaqi96 commented Jan 24, 2025

prote376 commented Dec 11, 2024 •

edited

Loading