-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18
base: master
Are you sure you want to change the base?
Conversation
Oops, meant to create a PR in my fork. I'll reopen this PR once it is reviewed in my fork. |
@smartyfh I would greatly appreciate your comments on this! |
Thank you, Weixuan. That would be great. Please let me know what I can do. |
success (bool): Whether to include Inform & Success rates metrics. | ||
richness (bool): Whether to include lexical richness metric. | ||
dst (bool, optional): Whether to include DST metrics. Defaults to False. | ||
enable_normalization (bool, optional): Whether to use slot name and value normalization. Defaults to True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WeixuanZ you should state what normalisation is applied here (e.g. "same normalisation as per the 22 version") or something along these lines.
@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value |
I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics. |
Hi @smartyfh , thanks so much for engaging! To clarify, does the |
Hi @alexcoca, my pleasure. NONE is not a special value. When either a slot is not mentioned or its value has been deleted, the value is NONE. "Not Mentioned" is another value that is also used to indicate ''not mentioned'' slots. So it is safe to change "not mentioned" to "none". Regarding the last question, it sounds like a good option to add a flag. Cheers! |
Thanks so much @smartyfh ! |
@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s | |||
# drop the blocked slots from the reference state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop through the references first - we should predict for every turn in the test set and so input_states
should have all the dialogues. If something went wrong during parsing or prediction and the user has missed predictions for some dialogues and/or turns, the code should fail. As currently implemented, there will be a silent bug.
@@ -306,10 +306,10 @@ def block_domains(input_states: dict, reference_states: dict, blocked_domains: s | |||
# drop the blocked slots from the reference state | |||
new_turn_ref = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an assertion to check that turn
and turn_ref
lists have the same length
@@ -293,11 +293,11 @@ def get_dst(input_data, reference_states, include_loocv_metrics=False, fuzzy_rat | |||
""" | |||
DOMAINS = {"hotel", "train", "restaurant", "attraction", "taxi"} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make some API improvements. Instead of include_loocv_metrics
which most users won't understand we can have left_out_domain=None
in which case we can return:
- the joint goal accuracy wrt to all domains - a turn is marked as correct if all states from all domains are predicted correctly. This would be called
test-jga
. - we could also report the JGA wrt to each individual domain under the key
[domain_name]_jga
. A turn is marked correct if the states from a given domain are all correct. Errors in predicting states in other domains are ignored. - we could also report the joint accuracy of all 4 domain combinations, to facilitate comparisons with leave-one-out setting. Here we just ignore the predictions from the left out domain & the turns where only the left out domain appears. We would name the fields
except_{domain}_jga
.
left_out_domain
should be a string that the user can set to one of the 5 domains in DOMAINS
and we should assert the input is correct at the very beginning. The keys reported should be:
test-jga
where the joint accuracy with respect to all domains is computed. This number should be directly comparable with the setting whenleft_out_domain=None
[domain_name]_jga
- as before, this is the joint accuracy of each individual domain. The numbers should be comparable with the equivalents whenleft_out_domain=None
.except_{left_out_domain}_jga
- this is joint accuracy with respect to all the domains seen in training. If we also report it whenleft_out_domain=None
, then the user sees if the left out domain "helped" improve performance in the other domains or not.
I think this is largely what the current output evaluation returns but we should very carefully and clearly document this to make sure the implementation is correct. We should document the above very clearly in the docstring so that reviewers of the PR who are multiwoz experts can validate our approach in full knowledge of our logic.
Hi all! I was interested in using the MultiWOZ 2.4 evaluator implemented here. For my understanding, is the implementation in this PR complete & correct, and any remaining changes would be API and documentation improvements? If so I may try and use it, and could possibly even help resolve remaining issues if the PR is no longer active. Thanks all for building such a useful tool! The discussion here has also been helpful for my understanding of the evaluation process. |
Hey guys, I think I do not have enough capacity to meaningfully and thoroughly review (it has been a long time since I did something related to dialogs). If you guys are going to test it and finish the remaining bits, I would be more than happy to merge it ... just ping me. Maybe @tuetschek, @vojtsek or @oplatek could chimme in |
Thanks all! I can work on the remaining feedback from the open PR, testing, etc, unless someone else would prefer to. I'm also working on a few other things so it may take me a week or so, but I wanted to gauge whether this would be helpful. Appreciate the response and comments! |
We aim to conduct DST evaluation on the MultiWOZ 2.4 corpus. This PR shows our proposed extension to the existing code to achieve this.