Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

WeixuanZ
Copy link

@WeixuanZ WeixuanZ commented Feb 15, 2023

We aim to conduct DST evaluation on the MultiWOZ 2.4 corpus. This PR shows our proposed extension to the existing code to achieve this.

@WeixuanZ WeixuanZ closed this Feb 15, 2023
@WeixuanZ
Copy link
Author

WeixuanZ commented Feb 15, 2023

Oops, meant to create a PR in my fork. I'll reopen this PR once it is reviewed in my fork.

@WeixuanZ WeixuanZ reopened this Feb 26, 2023
@WeixuanZ
Copy link
Author

@smartyfh I would greatly appreciate your comments on this!

@smartyfh
Copy link

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

success (bool): Whether to include Inform & Success rates metrics.
richness (bool): Whether to include lexical richness metric.
dst (bool, optional): Whether to include DST metrics. Defaults to False.
enable_normalization (bool, optional): Whether to use slot name and value normalization. Defaults to True.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeixuanZ you should state what normalisation is applied here (e.g. "same normalisation as per the 22 version") or something along these lines.

@WeixuanZ
Copy link
Author

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

@smartyfh
Copy link

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

@alexcoca
Copy link

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

@smartyfh
Copy link

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

Hi @alexcoca, my pleasure. NONE is not a special value. When either a slot is not mentioned or its value has been deleted, the value is NONE. "Not Mentioned" is another value that is also used to indicate ''not mentioned'' slots. So it is safe to change "not mentioned" to "none". Regarding the last question, it sounds like a good option to add a flag. Cheers!

@WeixuanZ WeixuanZ changed the title Add MultiWOZ 2.4 DST evaluation Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support Mar 3, 2023
@WeixuanZ
Copy link
Author

WeixuanZ commented Mar 3, 2023

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

Hi @alexcoca, my pleasure. NONE is not a special value. When either a slot is not mentioned or its value has been deleted, the value is NONE. "Not Mentioned" is another value that is also used to indicate ''not mentioned'' slots. So it is safe to change "not mentioned" to "none". Regarding the last question, it sounds like a good option to add a flag. Cheers!

Thanks so much @smartyfh !

@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s
# drop the blocked slots from the reference state
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loop through the references first - we should predict for every turn in the test set and so input_states should have all the dialogues. If something went wrong during parsing or prediction and the user has missed predictions for some dialogues and/or turns, the code should fail. As currently implemented, there will be a silent bug.

@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s
# drop the blocked slots from the reference state
new_turn_ref = {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an assertion to check that turn and turn_ref lists have the same length

@@ -293,11 +293,11 @@ def get_dst(input_data, reference_states, include_loocv_metrics=False, fuzzy_rat
"""
DOMAINS = {"hotel", "train", "restaurant", "attraction", "taxi"}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make some API improvements. Instead of include_loocv_metrics which most users won't understand we can have left_out_domain=None in which case we can return:

  • the joint goal accuracy wrt to all domains - a turn is marked as correct if all states from all domains are predicted correctly. This would be called test-jga.
  • we could also report the JGA wrt to each individual domain under the key [domain_name]_jga. A turn is marked correct if the states from a given domain are all correct. Errors in predicting states in other domains are ignored.
  • we could also report the joint accuracy of all 4 domain combinations, to facilitate comparisons with leave-one-out setting. Here we just ignore the predictions from the left out domain & the turns where only the left out domain appears. We would name the fields except_{domain}_jga.

left_out_domain should be a string that the user can set to one of the 5 domains in DOMAINS and we should assert the input is correct at the very beginning. The keys reported should be:

  • test-jga where the joint accuracy with respect to all domains is computed. This number should be directly comparable with the setting when left_out_domain=None
  • [domain_name]_jga - as before, this is the joint accuracy of each individual domain. The numbers should be comparable with the equivalents when left_out_domain=None.
  • except_{left_out_domain}_jga - this is joint accuracy with respect to all the domains seen in training. If we also report it when left_out_domain=None, then the user sees if the left out domain "helped" improve performance in the other domains or not.

I think this is largely what the current output evaluation returns but we should very carefully and clearly document this to make sure the implementation is correct. We should document the above very clearly in the docstring so that reviewers of the PR who are multiwoz experts can validate our approach in full knowledge of our logic.

@kingb12
Copy link

kingb12 commented Oct 2, 2023

Hi all!

I was interested in using the MultiWOZ 2.4 evaluator implemented here. For my understanding, is the implementation in this PR complete & correct, and any remaining changes would be API and documentation improvements? If so I may try and use it, and could possibly even help resolve remaining issues if the PR is no longer active.

Thanks all for building such a useful tool! The discussion here has also been helpful for my understanding of the evaluation process.

@alexcoca
Copy link

alexcoca commented Oct 3, 2023

@kingb12 This sounds like a great idea. Let's have a fresh look at the work and maybe ask one or two additional MultiWOZ experts to validate the work to be extra sure the evaluator is correct. @Tomiinek , apart from yourself, who do you think would be suited to sing off this evaluation PR?

@Tomiinek
Copy link
Owner

Tomiinek commented Oct 6, 2023

Hey guys,

I think I do not have enough capacity to meaningfully and thoroughly review (it has been a long time since I did something related to dialogs). If you guys are going to test it and finish the remaining bits, I would be more than happy to merge it ... just ping me.

Maybe @tuetschek, @vojtsek or @oplatek could chimme in

@kingb12
Copy link

kingb12 commented Oct 6, 2023

Thanks all! I can work on the remaining feedback from the open PR, testing, etc, unless someone else would prefer to. I'm also working on a few other things so it may take me a week or so, but I wanted to gauge whether this would be helpful. Appreciate the response and comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants