-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use the BaaL framework for named entity recognition? #121
Comments
Hello, As far as I know, this is not a subject that academia has worked on. BaaL should be able to handle it, as we do not impose any format. I will post an example tomorrow. If you have any question, we would be happy to help you! |
Hey @biro-mark, there is nothing in the literature but this is very similar to the case of the Multilabelling task and mainly the way that we handle it in BaaL is similar to our segmentation example. I didn't have a great improvement on this but it is because the models are already very good in NLP and it is hard to have a noticeable improvement. In any case, once @Dref360 posts the example you can try the same way and see if you are having better ways to deal with this. I am very interested to know the result. :) |
This is a quick example on how to use our heuristics (BALD, Entropy) with NER and in segmentation. I hope this helps you, if I missed anything, please tell me. from baal.active.heuristics import BALD
import torch
NUM_TOKENS=128
NUM_ITERATIONS=20
DATASET_LEN = 1000
NUM_CLASSES = 10
# The result of your MC iterations (see [1] below)
mc_sampling = torch.randn([DATASET_LEN, NUM_CLASSES, NUM_TOKENS, NUM_ITERATIONS])
bald = BALD()
uncertainty = bald.get_uncertainties(mc_sampling)
# Uncertainty has shape [DATASET_LEN, NUM_TOKENS] (see [2] below)
uncertainty.shape [1] We propose [2] This gives use the uncertainty per token, if we want to know the overall uncertainty of this sentence, we could use bald = BALD(reduction='mean')
uncertainty = bald.get_uncertainties(mc_sampling)
# Uncertainty has shape [DATASET_LEN]
uncertainty.shape You can also supply your own reduction function. |
Hello, Just to mention that there is work on active learning for sequence labelling tasks such as NER (not many, though). For example, here is a recent paper: Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates. It would be very nice if BaaL could provide direct support for this type of tasks. |
Hello again! I'm trying to adapt the example in nlp_bert_mcdropout.py for the NER task. If I understand correctly, the changes to be made would be:
Would that be enough? I'm just familiarising with BaaL and I think it's great, so any pointer on how to correctly use it for a sequence labelling task would be welcome. |
Reopening for visibility Yes I think that would be it. @parmidaatg worked on multilabel in the past and might give more insight. |
@feralvam yes that should be enough to have a running AL loop. You might wanna change the metrics respectively as well but that is all. It would be great if you would like to submit a PR for your example script in BaaL. we are trying to expand our support and that would help the community a lot :) let us know how your experiment goes. |
I'd be happy to submit an example for NER after I manage to make it work. There seems to be other parts of the code that need to be changed. For instance, the Its In addition, I believe Perhaps taking a quick look to the conll2003 dataset in the Datasets library could help to get a better idea of other changes that could be needed that I haven't found yet. |
yes, the example that I made with the HF wrapper only supports classification. Normally you shouldn't be needing that wrapper anyways if you handle your dataset yourself, you should be able to just use |
Perhaps we can work together for your example and upgrade baal to any necessary changes for supporting NER. Out of the box, I'd say it should work just using |
Thanks! Really appreciate all your quick replies. I just started to work on this yesterday, so hopefully I should be submitting something soon for you to take a look. |
So, here's a first attempt trying to merge the NER example from huggingface to the example in BaaL for sequence classification
I'm getting an error from the data_collator (had to use one for Token Classification). I still need to debug it to find what the real issue is. But I thought that getting some feedback about the general structure of the example could help. I can submit it as a WIP PR if that helps also. |
I found the source of the error. It was due to what columns of the dataset effectively get to the collator. The problem was in this line of
Apparently, the collator expects that all columns in the batch had been padded properly. Only columns generated by the tokenizer have that characteristic. The rest, therefore, should be removed before the data is sent to the collator. HuggingFace's
Our training dataset is actually an instance of The easy solution for the example was to remove the unused columns "manually". In this case, this was basically all of them since the tokenizer is the one that creates the ones that the model actually needs.
Here's a new version of the code:
However, now I get another error when executing
I'll keep investigating and let you know what I discover. As always, any pointers are welcome. |
So, I don't get the error if I manually set the While making that change avoids getting that exception, there's something I wanted to ask. In lines 117-118 of
According to the documentation, the returned value should be: |
Hey @feralvam, for your question, we provide |
@feralvam did you have any result on this? |
Hi! |
Hey, what you did in the end regarding lines 117-118 of |
Hi @shaked571. I didn't change it in the end, since the first dimension was the same I was expecting, and everything else was aggregated. Having said that, I stopped working on this problem (for the foreseeable at least) , so I didn't fully verify if this actually affected the final result or not. |
Is this example available now as part of the BaaL? |
I would be interested in Baal for NER as well |
We've just moved to a new documentation system (mkdocs) that should help us better structure our tutorials. If someone could run some experiements showing that BALD is at least better performing than Random on NER, I would include it on our website. Cheers, |
I think we made good progress on this on #262. I'll close this one, but for now the code should work for y'all to run experiment with. |
I'm seeking guidance on how to use the BaaL framework for a named entity recognition NLP task. In this task, every training sample is made up of a sequence of tokens and each token has a label. So each sample has not one, but many labels. And the number of labels per sample is not fixed.
Can the BaaL framework deal with this use case? I'm asking because most places in the documentation seem to assume there is one label per training sample.
The text was updated successfully, but these errors were encountered: