Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18

WeixuanZ · 2023-02-15T14:24:26Z

We aim to conduct DST evaluation on the MultiWOZ 2.4 corpus. This PR shows our proposed extension to the existing code to achieve this.

WeixuanZ · 2023-02-15T14:28:35Z

Oops, meant to create a PR in my fork. I'll reopen this PR once it is reviewed in my fork.

This reverts commit 1abd9f5.

WeixuanZ · 2023-02-26T22:52:39Z

@smartyfh I would greatly appreciate your comments on this!

smartyfh · 2023-02-27T10:53:39Z

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

alexcoca · 2023-02-27T14:24:30Z

mwzeval/metrics.py

+            success (bool): Whether to include Inform & Success rates metrics.
+            richness (bool): Whether to include lexical richness metric.
+            dst (bool, optional): Whether to include DST metrics. Defaults to False.
+            enable_normalization (bool, optional): Whether to use slot name and value normalization. Defaults to True.


@WeixuanZ you should state what normalisation is applied here (e.g. "same normalisation as per the 22 version") or something along these lines.

WeixuanZ · 2023-02-27T16:09:07Z

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

smartyfh · 2023-02-27T18:44:05Z

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

alexcoca · 2023-02-28T14:52:25Z

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

smartyfh · 2023-02-28T15:44:11Z

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

Hi @alexcoca, my pleasure. NONE is not a special value. When either a slot is not mentioned or its value has been deleted, the value is NONE. "Not Mentioned" is another value that is also used to indicate ''not mentioned'' slots. So it is safe to change "not mentioned" to "none". Regarding the last question, it sounds like a good option to add a flag. Cheers!

WeixuanZ · 2023-03-03T17:43:05Z

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

Hi @alexcoca, my pleasure. NONE is not a special value. When either a slot is not mentioned or its value has been deleted, the value is NONE. "Not Mentioned" is another value that is also used to indicate ''not mentioned'' slots. So it is safe to change "not mentioned" to "none". Regarding the last question, it sounds like a good option to add a flag. Cheers!

Thanks so much @smartyfh !

mwzeval/metrics.py

alexcoca · 2023-03-07T10:40:53Z

mwzeval/metrics.py

@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s
                # drop the blocked slots from the reference state


loop through the references first - we should predict for every turn in the test set and so input_states should have all the dialogues. If something went wrong during parsing or prediction and the user has missed predictions for some dialogues and/or turns, the code should fail. As currently implemented, there will be a silent bug.

alexcoca · 2023-03-07T10:41:19Z

mwzeval/metrics.py

@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s
                # drop the blocked slots from the reference state
                new_turn_ref = {}


add an assertion to check that turn and turn_ref lists have the same length

alexcoca · 2023-03-07T10:57:30Z

mwzeval/metrics.py

@@ -293,11 +293,11 @@ def get_dst(input_data, reference_states, include_loocv_metrics=False, fuzzy_rat
    """
    DOMAINS = {"hotel", "train", "restaurant", "attraction", "taxi"}



We can make some API improvements. Instead of include_loocv_metrics which most users won't understand we can have left_out_domain=None in which case we can return:

the joint goal accuracy wrt to all domains - a turn is marked as correct if all states from all domains are predicted correctly. This would be called test-jga.

we could also report the JGA wrt to each individual domain under the key [domain_name]_jga. A turn is marked correct if the states from a given domain are all correct. Errors in predicting states in other domains are ignored.

we could also report the joint accuracy of all 4 domain combinations, to facilitate comparisons with leave-one-out setting. Here we just ignore the predictions from the left out domain & the turns where only the left out domain appears. We would name the fields except_{domain}_jga.

left_out_domain should be a string that the user can set to one of the 5 domains in DOMAINS and we should assert the input is correct at the very beginning. The keys reported should be:

test-jga where the joint accuracy with respect to all domains is computed. This number should be directly comparable with the setting when left_out_domain=None

[domain_name]_jga - as before, this is the joint accuracy of each individual domain. The numbers should be comparable with the equivalents when left_out_domain=None.

except_{left_out_domain}_jga - this is joint accuracy with respect to all the domains seen in training. If we also report it when left_out_domain=None, then the user sees if the left out domain "helped" improve performance in the other domains or not.

I think this is largely what the current output evaluation returns but we should very carefully and clearly document this to make sure the implementation is correct. We should document the above very clearly in the docstring so that reviewers of the PR who are multiwoz experts can validate our approach in full knowledge of our logic.

mwzeval/metrics.py

kingb12 · 2023-10-02T16:24:08Z

Hi all!

I was interested in using the MultiWOZ 2.4 evaluator implemented here. For my understanding, is the implementation in this PR complete & correct, and any remaining changes would be API and documentation improvements? If so I may try and use it, and could possibly even help resolve remaining issues if the PR is no longer active.

Thanks all for building such a useful tool! The discussion here has also been helpful for my understanding of the evaluation process.

alexcoca · 2023-10-03T09:32:03Z

@kingb12 This sounds like a great idea. Let's have a fresh look at the work and maybe ask one or two additional MultiWOZ experts to validate the work to be extra sure the evaluator is correct. @Tomiinek , apart from yourself, who do you think would be suited to sing off this evaluation PR?

Tomiinek · 2023-10-06T13:55:43Z

Hey guys,

I think I do not have enough capacity to meaningfully and thoroughly review (it has been a long time since I did something related to dialogs). If you guys are going to test it and finish the remaining bits, I would be more than happy to merge it ... just ping me.

Maybe @tuetschek, @vojtsek or @oplatek could chimme in

kingb12 · 2023-10-06T15:45:00Z

Thanks all! I can work on the remaining feedback from the open PR, testing, etc, unless someone else would prefer to. I'm also working on a few other things so it may take me a week or so, but I wanted to gauge whether this would be helpful. Appreciate the response and comments!

WeixuanZ added 2 commits February 15, 2023 14:22

Add MultiWOZ 2.4 DST evaluation

6d34d7b

Add to slot value normalization

9e23b12

WeixuanZ closed this Feb 15, 2023

WeixuanZ added 4 commits February 15, 2023 16:10

Fix multi-value extraction

75af0f1

Add flag for extended normalization

1abd9f5

Revert "Add flag for extended normalization"

cf3eeb6

This reverts commit 1abd9f5.

Add option to disable normalization

55fb7c2

WeixuanZ reopened this Feb 26, 2023

alexcoca reviewed Feb 27, 2023

View reviewed changes

Add DST domain leave-one-out cross-validation support

fbedcfa

WeixuanZ changed the title ~~Add MultiWOZ 2.4 DST evaluation~~ Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support Mar 3, 2023

Fix block_domain

4f2d2b0

alexcoca reviewed Mar 7, 2023

View reviewed changes

mwzeval/metrics.py Outdated Show resolved Hide resolved

alexcoca reviewed Mar 7, 2023

View reviewed changes

mwzeval/metrics.py Outdated Show resolved Hide resolved

mwzeval/metrics.py Show resolved Hide resolved

Refactor and throw warnings

6b2ca50

alexcoca mentioned this pull request Mar 17, 2023

D3ST leave-one-out setting outputs descriptions from the left-out domain google-research/task-oriented-dialogue#9

Open

WeixuanZ added 2 commits March 17, 2023 23:41

Fix bug dropping turns with empty reference states

07ec3a8

Fix empty ref pruning

1e244ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18

Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18

WeixuanZ commented Feb 15, 2023 •

edited

Loading

WeixuanZ commented Feb 15, 2023 •

edited

Loading

WeixuanZ commented Feb 26, 2023

smartyfh commented Feb 27, 2023

alexcoca Feb 27, 2023

WeixuanZ commented Feb 27, 2023

smartyfh commented Feb 27, 2023

alexcoca commented Feb 28, 2023

smartyfh commented Feb 28, 2023

WeixuanZ commented Mar 3, 2023

alexcoca Mar 7, 2023

alexcoca Mar 7, 2023

alexcoca Mar 7, 2023

kingb12 commented Oct 2, 2023 •

edited

Loading

alexcoca commented Oct 3, 2023

Tomiinek commented Oct 6, 2023

kingb12 commented Oct 6, 2023

		@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s
		# drop the blocked slots from the reference state

		@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s
		# drop the blocked slots from the reference state
		new_turn_ref = {}

		@@ -293,11 +293,11 @@ def get_dst(input_data, reference_states, include_loocv_metrics=False, fuzzy_rat
		"""
		DOMAINS = {"hotel", "train", "restaurant", "attraction", "taxi"}

Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18

Are you sure you want to change the base?

Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18

Conversation

WeixuanZ commented Feb 15, 2023 • edited Loading

WeixuanZ commented Feb 15, 2023 • edited Loading

WeixuanZ commented Feb 26, 2023

smartyfh commented Feb 27, 2023

alexcoca Feb 27, 2023

Choose a reason for hiding this comment

WeixuanZ commented Feb 27, 2023

smartyfh commented Feb 27, 2023

alexcoca commented Feb 28, 2023

smartyfh commented Feb 28, 2023

WeixuanZ commented Mar 3, 2023

alexcoca Mar 7, 2023

Choose a reason for hiding this comment

alexcoca Mar 7, 2023

Choose a reason for hiding this comment

alexcoca Mar 7, 2023

Choose a reason for hiding this comment

kingb12 commented Oct 2, 2023 • edited Loading

alexcoca commented Oct 3, 2023

Tomiinek commented Oct 6, 2023

kingb12 commented Oct 6, 2023

WeixuanZ commented Feb 15, 2023 •

edited

Loading

WeixuanZ commented Feb 15, 2023 •

edited

Loading

kingb12 commented Oct 2, 2023 •

edited

Loading