Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18
base: master
Are you sure you want to change the base?
Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support #18
Changes from 1 commit
6d34d7b
9e23b12
75af0f1
1abd9f5
cf3eeb6
55fb7c2
fbedcfa
4f2d2b0
6b2ca50
07ec3a8
1e244ba
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make some API improvements. Instead of
include_loocv_metrics
which most users won't understand we can haveleft_out_domain=None
in which case we can return:test-jga
.[domain_name]_jga
. A turn is marked correct if the states from a given domain are all correct. Errors in predicting states in other domains are ignored.except_{domain}_jga
.left_out_domain
should be a string that the user can set to one of the 5 domains inDOMAINS
and we should assert the input is correct at the very beginning. The keys reported should be:test-jga
where the joint accuracy with respect to all domains is computed. This number should be directly comparable with the setting whenleft_out_domain=None
[domain_name]_jga
- as before, this is the joint accuracy of each individual domain. The numbers should be comparable with the equivalents whenleft_out_domain=None
.except_{left_out_domain}_jga
- this is joint accuracy with respect to all the domains seen in training. If we also report it whenleft_out_domain=None
, then the user sees if the left out domain "helped" improve performance in the other domains or not.I think this is largely what the current output evaluation returns but we should very carefully and clearly document this to make sure the implementation is correct. We should document the above very clearly in the docstring so that reviewers of the PR who are multiwoz experts can validate our approach in full knowledge of our logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop through the references first - we should predict for every turn in the test set and so
input_states
should have all the dialogues. If something went wrong during parsing or prediction and the user has missed predictions for some dialogues and/or turns, the code should fail. As currently implemented, there will be a silent bug.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an assertion to check that
turn
andturn_ref
lists have the same length