Mixing inputs that has/doesn't have `upos`, `xpos`, `feats` #1306

Jemoka · 2023-11-06T07:05:13Z

Description

Some languages, like German, OOM when training with the new PyTorch Dataset scheme as the overhead loading multiple datasets into separate DataLoader and then mixing them didn't work well. We did this because some entire input files wouldn't have upos/xpos/ufeats, and we don't want to calculate loss.
Instead, this PR elects to create a _ShadowDataset object between them, and masks out loss (by turning the upos/xpos/etc. into padding tokens at batch time) with the exact sentences which came from datasets that doesn't have upos/xpos/ufeats masked out only.

Unit test coverage

Passes all tests in stanza.tests.pos.test_tagger

Jemoka · 2023-11-06T07:07:23Z

stanza/models/pos/data.py

-
-        return DataBatch(words, words_mask, wordchars, wordchars_mask, upos, xpos, ufeats,
-                         pretrained, orig_idx, word_orig_idx, lens, word_lens, text, idx)
+        return _ShadowDataset(self).to_loader(**kwargs)


btw: this shouldn't break any previous APIs because the old .to_loader() still works, it just makes a shadow dataset on your behalf with the one Dataset

This hasn't been publicly released yet, so we should be free to change it however we like

Still, it's a pretty intuitive solution: the one Dataset version is just the N Datasets version reduced to 1 dataset

AngledLuffa · 2023-11-06T23:14:13Z

If you look in pos/model.py, you can see the part where it is checking the has_upos feature on a whole batch etc. It is there that we would need to mask the loss based on which sentences do or don't have that column defined

AngledLuffa · 2023-11-06T23:14:25Z

I can take that on, unless it's something you want to experiment with

Jemoka · 2023-11-06T23:46:55Z

I can take that on, unless it's something you want to experiment with

happy to help experiment

AngledLuffa · 2023-11-07T01:12:17Z

stanza/models/pos/data.py

+
+        # sort sentences by lens for easy RNN operations
+        lens = [torch.sum(x != PAD_ID) for x in words]
+        (words, wordchars, upos, xpos,


surely the has_whatever needs to be sorted here as well, or the items won't be aligned

AngledLuffa · 2023-11-07T01:15:14Z

stanza/models/tagger.py

    vocab = Dataset.init_vocab(train_docs, args)
    train_data = [Dataset(i, args, pretrain, vocab=vocab, evaluation=False)
                  for i in train_docs]
-    # here we make sure the model will learn to output _ for empty columns


i still think this block is necessary, unless there's some other way in which this is being calculated which i have missed. the idea is: if dataset X has, and dataset Y does not have, then we want X to have the has bit set and Y to have it set to False. but if both X and Y don't have the column, they need to be marked as has specifically so that the model will learn blank features, xpos, etc

Jemoka · 2023-11-07T05:44:46Z

If you look in pos/model.py, you can see the part where it is checking the has_upos feature on a whole batch etc. It is there that we would need to mask the loss based on which sentences do or don't have that column defined

Yes, but the cross entropy is set to ignore padding indicies, so if we set those upos/xpos etc. as padding in that area, it will not contribute to the loss. therefore, during batch time, we set the indicies for which has_upos etc. is False to idx PAD_ID, which means no loss will be calculated

…ther with the data items

… for being blank

…nputs to make sure the batching doesn't fail in some weird way. Then, redo the calls to update() for the batches and check that the losses are the same for a batch of size one or a batch of size two

Jemoka · 2023-11-07T20:47:45Z

Closing in favor of e4c2273, which is a dataloader-level mix. Feel free to reopen if we want to persue a dataset-level mix again.

Jemoka requested a review from AngledLuffa November 6, 2023 07:05

Jemoka commented Nov 6, 2023

View reviewed changes

AngledLuffa reviewed Nov 7, 2023

View reviewed changes

AngledLuffa force-pushed the pos_dataloader_mixing branch 2 times, most recently from 091611c to 80b6a6c Compare November 7, 2023 19:05

Jemoka and others added 7 commits November 7, 2023 11:06

data mixtures

9343df0

Add a test of the shuffling & whether it keeps the has_xpos flag toge…

2e0193e

…ther with the data items

Sort the has_ bits as well

025a832

if all the files have no upos we make the model output blanks

a052581

compute loss only if there's actual data in the batch

37d05bb

Perhaps better: check each column of multi-xpos and ufeats separately…

0076181

… for being blank

Add a test which builds a couple models with different batchings of i…

80b6a6c

…nputs to make sure the batching doesn't fail in some weird way. Then, redo the calls to update() for the batches and check that the losses are the same for a batch of size one or a batch of size two

Jemoka closed this Nov 7, 2023

AngledLuffa deleted the pos_dataloader_mixing branch November 20, 2023 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixing inputs that has/doesn't have `upos`, `xpos`, `feats` #1306

Mixing inputs that has/doesn't have `upos`, `xpos`, `feats` #1306

Jemoka commented Nov 6, 2023

Jemoka Nov 6, 2023

AngledLuffa Nov 6, 2023

AngledLuffa Nov 6, 2023

AngledLuffa commented Nov 6, 2023

AngledLuffa commented Nov 6, 2023

Jemoka commented Nov 6, 2023

AngledLuffa Nov 7, 2023

AngledLuffa Nov 7, 2023

Jemoka commented Nov 7, 2023

Jemoka commented Nov 7, 2023

Mixing inputs that has/doesn't have upos, xpos, feats #1306

Mixing inputs that has/doesn't have upos, xpos, feats #1306

Conversation

Jemoka commented Nov 6, 2023

Description

Unit test coverage

Jemoka Nov 6, 2023

Choose a reason for hiding this comment

AngledLuffa Nov 6, 2023

Choose a reason for hiding this comment

AngledLuffa Nov 6, 2023

Choose a reason for hiding this comment

AngledLuffa commented Nov 6, 2023

AngledLuffa commented Nov 6, 2023

Jemoka commented Nov 6, 2023

AngledLuffa Nov 7, 2023

Choose a reason for hiding this comment

AngledLuffa Nov 7, 2023

Choose a reason for hiding this comment

Jemoka commented Nov 7, 2023

Jemoka commented Nov 7, 2023

Mixing inputs that has/doesn't have `upos`, `xpos`, `feats` #1306

Mixing inputs that has/doesn't have `upos`, `xpos`, `feats` #1306