Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the BaaL framework for named entity recognition? #121

Closed
biro-mark opened this issue Apr 13, 2021 · 23 comments
Closed

How to use the BaaL framework for named entity recognition? #121

biro-mark opened this issue Apr 13, 2021 · 23 comments
Assignees
Labels
enhancement New feature or request

Comments

@biro-mark
Copy link

biro-mark commented Apr 13, 2021

I'm seeking guidance on how to use the BaaL framework for a named entity recognition NLP task. In this task, every training sample is made up of a sequence of tokens and each token has a label. So each sample has not one, but many labels. And the number of labels per sample is not fixed.

Can the BaaL framework deal with this use case? I'm asking because most places in the documentation seem to assume there is one label per training sample.

@biro-mark biro-mark added the enhancement New feature or request label Apr 13, 2021
@biro-mark biro-mark changed the title How to use the BAAL framework for named entity recognition? How to use the BaaL framework for named entity recognition? Apr 13, 2021
@Dref360
Copy link
Member

Dref360 commented Apr 13, 2021

Hello,

As far as I know, this is not a subject that academia has worked on.

BaaL should be able to handle it, as we do not impose any format.

I will post an example tomorrow.

If you have any question, we would be happy to help you!

@parmidaatg
Copy link
Collaborator

Hey @biro-mark, there is nothing in the literature but this is very similar to the case of the Multilabelling task and mainly the way that we handle it in BaaL is similar to our segmentation example. I didn't have a great improvement on this but it is because the models are already very good in NLP and it is hard to have a noticeable improvement. In any case, once @Dref360 posts the example you can try the same way and see if you are having better ways to deal with this. I am very interested to know the result. :)

@Dref360
Copy link
Member

Dref360 commented Apr 14, 2021

This is a quick example on how to use our heuristics (BALD, Entropy) with NER and in segmentation. I hope this helps you, if I missed anything, please tell me.

from baal.active.heuristics import BALD
import torch

NUM_TOKENS=128
NUM_ITERATIONS=20
DATASET_LEN = 1000
NUM_CLASSES = 10

# The result of your MC iterations (see [1] below)
mc_sampling = torch.randn([DATASET_LEN, NUM_CLASSES, NUM_TOKENS, NUM_ITERATIONS])
bald = BALD()
uncertainty = bald.get_uncertainties(mc_sampling)

# Uncertainty has shape [DATASET_LEN, NUM_TOKENS] (see [2] below)
uncertainty.shape

[1] We propose ModelWrapper to do this for you, or you can provide the value yourselves. More info here.

[2] This gives use the uncertainty per token, if we want to know the overall uncertainty of this sentence, we could use reduction like this:

bald = BALD(reduction='mean')
uncertainty = bald.get_uncertainties(mc_sampling)

# Uncertainty has shape [DATASET_LEN] 
uncertainty.shape

You can also supply your own reduction function.

@feralvam
Copy link

feralvam commented Nov 3, 2021

Hello,

Just to mention that there is work on active learning for sequence labelling tasks such as NER (not many, though). For example, here is a recent paper: Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates. It would be very nice if BaaL could provide direct support for this type of tasks.

@feralvam
Copy link

feralvam commented Nov 3, 2021

Hello again!

I'm trying to adapt the example in nlp_bert_mcdropout.py for the NER task. If I understand correctly, the changes to be made would be:

  • Change the dataset: instead of glue, use one for the NER task
  • Change the model: instead of BertForSequenceClassification use BertForTokenClassification (as the example provided in HuggingFace)
  • Change the heuristic: following the previous example from @Dref360, in line 69 we would need to add hyperparams["reduction"] (if we want to be flexible to have that as an argument) or we could directly add reduction="mean")

Would that be enough? I'm just familiarising with BaaL and I think it's great, so any pointer on how to correctly use it for a sequence labelling task would be welcome.

@Dref360
Copy link
Member

Dref360 commented Nov 3, 2021

Reopening for visibility

Yes I think that would be it. @parmidaatg worked on multilabel in the past and might give more insight.

@Dref360 Dref360 reopened this Nov 3, 2021
@parmidaatg
Copy link
Collaborator

@feralvam yes that should be enough to have a running AL loop. You might wanna change the metrics respectively as well but that is all. It would be great if you would like to submit a PR for your example script in BaaL. we are trying to expand our support and that would help the community a lot :) let us know how your experiment goes.

@feralvam
Copy link

feralvam commented Nov 4, 2021

I'd be happy to submit an example for NER after I manage to make it work.

There seems to be other parts of the code that need to be changed. For instance, the HuggingFaceDatasets class.

Its _tokenize function needs to be adapted in a similar way to here to handle the case where the texts are provided already tokenized, and the labels for each token need to be aligned accordingly.

In addition, I believe __getitem___ also needs to change when returning label, since it's assuming a single label per instance (unless I am mistaken).

Perhaps taking a quick look to the conll2003 dataset in the Datasets library could help to get a better idea of other changes that could be needed that I haven't found yet.

@parmidaatg
Copy link
Collaborator

parmidaatg commented Nov 4, 2021

yes, the example that I made with the HF wrapper only supports classification. Normally you shouldn't be needing that wrapper anyways if you handle your dataset yourself, you should be able to just use ActiveLearningDataset wrapper directly. The HF wrapper is only a means for people who are not that familiar with NLP to be able to run a quick experiment. I'd be happy to see a PR from you if you wanna adopt HF Wrapper, otherwise, we will work on it eventually.

@parmidaatg
Copy link
Collaborator

parmidaatg commented Nov 4, 2021

Perhaps we can work together for your example and upgrade baal to any necessary changes for supporting NER. Out of the box, I'd say it should work just using ActiveLearningDataset and our Trainer wrapper which changes the predict in HF trainer. But since I havent run NER, I'd be on this subject and if you submit any bug, I can try to fix it asap. What do you think?

@feralvam
Copy link

feralvam commented Nov 4, 2021

Thanks! Really appreciate all your quick replies. I just started to work on this yesterday, so hopefully I should be submitting something soon for you to take a look.

@parmidaatg parmidaatg self-assigned this Nov 4, 2021
@feralvam
Copy link

feralvam commented Nov 4, 2021

So, here's a first attempt trying to merge the NER example from huggingface to the example in BaaL for sequence classification

import argparse
import numpy as np
import random
from copy import deepcopy

import torch
import torch.backends
from tqdm import tqdm

# These packages are optional and not needed for BaaL main package.
# You can have access to `datasets` and `transformers` if you install
# BaaL with --dev setup.
from datasets import load_dataset, load_metric
from transformers import AutoConfig, AutoTokenizer, TrainingArguments
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification

from baal.active import get_heuristic
from baal.active import ActiveLearningDataset
from baal.active.active_loop import ActiveLearningLoop
from baal.bayesian.dropout import patch_module
from baal.transformers_trainer_wrapper import BaalTransformersTrainer

"""
Minimal example to use BaaL for NLP Token Classification (as for Named Entity Recognition).
"""


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--epoch", default=100, type=int)
    parser.add_argument("--batch_size", default=32, type=int)
    parser.add_argument("--initial_pool", default=1000, type=int)
    parser.add_argument("--model", default="bert-base-uncased", type=str)
    parser.add_argument("--n_data_to_label", default=100, type=int)
    parser.add_argument("--heuristic", default="bald", type=str)
    parser.add_argument("--iterations", default=20, type=int)
    parser.add_argument("--shuffle_prop", default=0.05, type=float)
    parser.add_argument("--reduction", default="mean", type=str)
    parser.add_argument("--learning_epoch", default=20, type=int)
    return parser.parse_args()


def get_datasets(initial_pool, tokenizer):
    raw_datasets = load_dataset("conll2003")
    features = raw_datasets["train"].features

    # In the conll2003 dataset, the labels are a `Sequence[ClassLabel]`
    label_list = features["ner_tags"].feature.names
    # No need to convert the labels since they are already ints.
    label_to_id = {i: i for i in range(len(label_list))}

    # Map that sends B-Xxx label to its I-Xxx counterpart
    b_to_i_label = []
    for idx, label in enumerate(label_list):
        if label.startswith("B-") and label.replace("B-", "I-") in label_list:
            b_to_i_label.append(label_list.index(label.replace("B-", "I-")))
        else:
            b_to_i_label.append(idx)

    # Tokenize all texts and align the labels with them.
    def tokenize_and_align_labels(examples):
        tokenized_inputs = tokenizer(
            examples["tokens"],
            padding="max_length",
            truncation=True,
            max_length=128,
            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
            is_split_into_words=True,
        )
        labels = []
        for i, label in enumerate(examples["ner_tags"]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                # Special tokens have a word id that is None. We set the label to -100 so they are automatically
                # ignored in the loss function.
                if word_idx is None:
                    label_ids.append(-100)
                # We set the label for the first token of each word.
                elif word_idx != previous_word_idx:
                    label_ids.append(label_to_id[label[word_idx]])
                # For the other tokens in a word, we set the label to -100
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx

            labels.append(label_ids)
        tokenized_inputs["labels"] = labels
        return tokenized_inputs

    # Active Training set
    train_dataset = raw_datasets["train"].map(
        tokenize_and_align_labels,
        batched=True,
        num_proc=4,
        load_from_cache_file=True,
        desc="Running tokenizer on train dataset",
    )
    active_set = ActiveLearningDataset(train_dataset)

    valid_set = raw_datasets["validation"].map(
        tokenize_and_align_labels,
        batched=True,
        num_proc=4,
        load_from_cache_file=True,
        desc="Running tokenizer on validation dataset",
    )

    # We start labeling randomly.
    active_set.label_randomly(initial_pool)
    return active_set, valid_set, label_list, label_to_id


def main():
    args = parse_args()
    use_cuda = torch.cuda.is_available()
    torch.backends.cudnn.benchmark = True
    random.seed(1337)
    torch.manual_seed(1337)
    if not use_cuda:
        print("warning, the experiments would take ages to run on cpu")

    hyperparams = vars(args)

    heuristic = get_heuristic(name=hyperparams["heuristic"], shuffle_prop=hyperparams["shuffle_prop"], reduction=hyperparams["reduction"])

    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=hyperparams["model"], use_fast=True)

    active_set, test_set, label_list, label_to_id = get_datasets(hyperparams["initial_pool"], tokenizer)

    config = AutoConfig.from_pretrained(
        hyperparams["model"],
        num_labels=len(label_list),
        label2id=label_to_id,
        id2label={i: l for l, i in label_to_id.items()},
        finetuning_task="ner",
    )

    model = AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path=hyperparams["model"], config=config)

    # change dropout layer to MCDropout
    model = patch_module(model)

    if use_cuda:
        model.cuda()

    init_weights = deepcopy(model.state_dict())

    training_args = TrainingArguments(
        output_dir="./results",  # output directory
        num_train_epochs=hyperparams["learning_epoch"],  # total # of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=8,  # batch size for evaluation
        weight_decay=0.01,  # strength of weight decay
        logging_dir="./logs",  # directory for storing logs
    )

    # Data collator
    data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=None)

    # Metrics
    metric = load_metric("seqeval")

    def compute_metrics(p):
        predictions, labels = p
        predictions = np.argmax(predictions, axis=2)

        # Remove ignored index (special tokens)
        true_predictions = [[label_list[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
        true_labels = [[label_list[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]

        results = metric.compute(predictions=true_predictions, references=true_labels)
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

    # We wrap the huggingface Trainer to create an Active Learning Trainer
    model = BaalTransformersTrainer(
        model=model,
        args=training_args,
        train_dataset=active_set,
        eval_dataset=test_set,
        tokenizer=None,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    logs = {}
    logs["epoch"] = 0

    # In this case, nlp data is fast to process and we do NoT need to use a smaller batch_size
    active_loop = ActiveLearningLoop(
        active_set,
        model.predict_on_dataset,
        heuristic,
        hyperparams.get("n_data_to_label", 1),
        iterations=hyperparams["iterations"],
    )

    for epoch in tqdm(range(args.epoch)):
        # we use the default setup of HuggingFace for training (ex: epoch=1).
        # The setup is adjustable when BaalHuggingFaceTrainer is defined.
        model.train()

        # Validation!
        eval_metrics = model.evaluate()

        # We reorder the unlabelled pool at the frequency of learning_epoch
        # This helps with speed while not changing the quality of uncertainty estimation.
        should_continue = active_loop.step()

        # We reset the model weights to relearn from the new trainset.
        model.load_state_dict(init_weights)
        model.lr_scheduler = None
        if not should_continue:
            break
        active_logs = {
            "epoch": epoch,
            "labeled_data": active_set._labelled,
            "Next Training set size": len(active_set),
        }

        logs = {**eval_metrics, **active_logs}
        print(logs)


if __name__ == "__main__":
    main()

I'm getting an error from the data_collator (had to use one for Token Classification). I still need to debug it to find what the real issue is. But I thought that getting some feedback about the general structure of the example could help. I can submit it as a WIP PR if that helps also.

@feralvam
Copy link

feralvam commented Nov 5, 2021

I found the source of the error. It was due to what columns of the dataset effectively get to the collator. The problem was in this line of DataCollatorForTokenClassification:

batch = {k: torch.tensor(v, dtype=torch.int64) for k, v in batch.items()}

Apparently, the collator expects that all columns in the batch had been padded properly. Only columns generated by the tokenizer have that characteristic. The rest, therefore, should be removed before the data is sent to the collator. HuggingFace's Trainer removes these unused columns internally. However it has a line that checks that the dataset is an instance of datasets.Dataset:

if is_datasets_available() and isinstance(train_dataset, datasets.Dataset):
        train_dataset = self._remove_unused_columns(train_dataset, description="training")

Our training dataset is actually an instance of ActiveLearningDataset that does not inherit from datasets.Dataset. As such, that condition gets ignored and the columns are not removed.

The easy solution for the example was to remove the unused columns "manually". In this case, this was basically all of them since the tokenizer is the one that creates the ones that the model actually needs.

features = raw_datasets["train"].features
# Active Training Set
train_dataset = raw_datasets["train"].map(
    tokenize_and_align_labels,
    batched=True,
    num_proc=hyperparams["preprocessing_num_workers"],
    load_from_cache_file=True,
    remove_columns=list(features.keys()),
    desc="Running tokenizer on train dataset",
)
active_set = ActiveLearningDataset(train_dataset)

Here's a new version of the code:

import argparse
import numpy as np
import random
from copy import deepcopy

import torch
import torch.backends
from tqdm import tqdm

# These packages are optional and not needed for BaaL main package.
# You can have access to `datasets` and `transformers` if you install
# BaaL with --dev setup.
from datasets import load_dataset, load_metric
from transformers import AutoConfig, AutoTokenizer, TrainingArguments
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification

from baal.active import get_heuristic
from baal.active import ActiveLearningDataset
from baal.active.active_loop import ActiveLearningLoop
from baal.bayesian.dropout import patch_module
from baal.transformers_trainer_wrapper import BaalTransformersTrainer

"""
Minimal example to use BaaL for NLP Token Classification (as for Named Entity Recognition).
"""


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--epoch", default=100, type=int)
    parser.add_argument("--batch_size", default=32, type=int)
    parser.add_argument("--initial_pool", default=1000, type=int)
    parser.add_argument("--model", default="bert-base-uncased", type=str)
    parser.add_argument("--n_data_to_label", default=100, type=int)
    parser.add_argument("--heuristic", default="bald", type=str)
    parser.add_argument("--iterations", default=20, type=int)
    parser.add_argument("--shuffle_prop", default=0.05, type=float)
    parser.add_argument("--reduction", default="mean", type=str)
    parser.add_argument("--learning_epoch", default=20, type=int)
    parser.add_argument("--preprocessing_num_workers", default=None, type=int)
    parser.add_argument("--pad_to_max_length", default=False, type=bool)
    parser.add_argument("--max_seq_length", default=None, type=int)

    return parser.parse_args()


def get_datasets(hyperparams, tokenizer):
    raw_datasets = load_dataset("conll2003")
    features = raw_datasets["train"].features

    # In the conll2003 dataset, the labels are a `Sequence[ClassLabel]`
    label_list = features["ner_tags"].feature.names
    # No need to convert the labels since they are already ints.
    label_to_id = {i: i for i in range(len(label_list))}

    # Map that sends B-Xxx label to its I-Xxx counterpart
    b_to_i_label = []
    for idx, label in enumerate(label_list):
        if label.startswith("B-") and label.replace("B-", "I-") in label_list:
            b_to_i_label.append(label_list.index(label.replace("B-", "I-")))
        else:
            b_to_i_label.append(idx)

    # Preprocessing the dataset
    # Padding strategy
    padding = "max_length" if hyperparams["pad_to_max_length"] else False

    # Tokenize all texts and align the labels with them.
    def tokenize_and_align_labels(examples):
        tokenized_inputs = tokenizer(
            examples["tokens"],
            padding=padding,
            truncation=True,
            max_length=hyperparams["max_seq_length"],
            add_special_tokens=True,
            return_token_type_ids=False,
            # We use this argument because the texts in our dataset are lists of words (with a label for each word).
            is_split_into_words=True,
        )
        labels = []
        for i, label in enumerate(examples["ner_tags"]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                # Special tokens have a word id that is None. We set the label to -100 so they are automatically
                # ignored in the loss function.
                if word_idx is None:
                    label_ids.append(-100)
                # We set the label for the first token of each word.
                elif word_idx != previous_word_idx:
                    label_ids.append(label_to_id[label[word_idx]])
                # For the other tokens in a word, we set the label to -100
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx

            labels.append(label_ids)
        tokenized_inputs["labels"] = labels
        return tokenized_inputs

    # Active Training Set
    train_dataset = raw_datasets["train"].map(
        tokenize_and_align_labels,
        batched=True,
        num_proc=hyperparams["preprocessing_num_workers"],
        load_from_cache_file=True,
        remove_columns=list(features.keys()),
        desc="Running tokenizer on train dataset",
    )
    active_set = ActiveLearningDataset(train_dataset)

    # Validation Set
    valid_set = raw_datasets["validation"].map(
        tokenize_and_align_labels,
        batched=True,
        num_proc=hyperparams["preprocessing_num_workers"],
        load_from_cache_file=True,
        remove_columns=list(features.keys()),
        desc="Running tokenizer on validation dataset",
    )

    # We start labeling randomly.
    active_set.label_randomly(hyperparams["initial_pool"])
    return active_set, valid_set, label_list, label_to_id


def main():
    args = parse_args()
    use_cuda = torch.cuda.is_available()
    torch.backends.cudnn.benchmark = True
    random.seed(1337)
    torch.manual_seed(1337)
    if not use_cuda:
        print("warning, the experiments would take ages to run on cpu")

    hyperparams = vars(args)

    heuristic = get_heuristic(name=hyperparams["heuristic"], shuffle_prop=hyperparams["shuffle_prop"], reduction=hyperparams["reduction"])

    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=hyperparams["model"], use_fast=True)

    active_set, test_set, label_list, label_to_id = get_datasets(hyperparams, tokenizer)

    config = AutoConfig.from_pretrained(
        hyperparams["model"],
        num_labels=len(label_list),
        label2id=label_to_id,
        id2label={i: l for l, i in label_to_id.items()},
        finetuning_task="ner",
    )

    model = AutoModelForTokenClassification.from_pretrained(pretrained_model_name_or_path=hyperparams["model"], config=config)

    # change dropout layer to MCDropout
    model = patch_module(model)

    if use_cuda:
        model.cuda()

    init_weights = deepcopy(model.state_dict())

    training_args = TrainingArguments(
        output_dir="./results",  # output directory
        num_train_epochs=hyperparams["learning_epoch"],  # total # of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=8,  # batch size for evaluation
        weight_decay=0.01,  # strength of weight decay
        logging_dir="./logs",  # directory for storing logs
    )

    # Data collator
    data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=None)

    # Metrics
    metric = load_metric("seqeval")

    def compute_metrics(p):
        predictions, labels = p
        predictions = np.argmax(predictions, axis=2)

        # Remove ignored index (special tokens)
        true_predictions = [[label_list[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
        true_labels = [[label_list[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]

        results = metric.compute(predictions=true_predictions, references=true_labels)
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
        }

    # We wrap the huggingface Trainer to create an Active Learning Trainer
    model = BaalTransformersTrainer(
        model=model,
        args=training_args,
        train_dataset=active_set,
        eval_dataset=test_set,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    logs = {}
    logs["epoch"] = 0

    # In this case, nlp data is fast to process and we do NoT need to use a smaller batch_size
    active_loop = ActiveLearningLoop(
        active_set,
        model.predict_on_dataset,
        heuristic,
        hyperparams.get("n_data_to_label", 1),
        iterations=hyperparams["iterations"],
    )

    for epoch in tqdm(range(args.epoch)):
        # we use the default setup of HuggingFace for training (ex: epoch=1).
        # The setup is adjustable when BaalHuggingFaceTrainer is defined.
        model.train()

        # Validation!
        eval_metrics = model.evaluate()

        # We reorder the unlabelled pool at the frequency of learning_epoch
        # This helps with speed while not changing the quality of uncertainty estimation.
        should_continue = active_loop.step()

        # We reset the model weights to relearn from the new trainset.
        model.load_state_dict(init_weights)
        model.lr_scheduler = None
        if not should_continue:
            break
        active_logs = {
            "epoch": epoch,
            "labeled_data": active_set._labelled,
            "Next Training set size": len(active_set),
        }

        logs = {**eval_metrics, **active_logs}
        print(logs)


if __name__ == "__main__":
    main()

However, now I get another error when executing should_continue = active_loop.step():

Traceback (most recent call last):
  File "nlp_ner_bert_mcdropout.py", line 245, in <module>
    main()
  File "nlp_ner_bert_mcdropout.py", line 227, in main
    should_continue = active_loop.step()
  File "/experiments/falva/tools/baal/baal/active/active_loop.py", line 72, in step
    probs = self.get_probabilities(pool, **self.kwargs)
  File "/experiments/falva/tools/baal/baal/transformers_trainer_wrapper.py", line 119, in predict_on_dataset
    return np.vstack(preds)
  File "<__array_function__ internals>", line 6, in vstack
  File "/home/falva/miniconda3/envs/baal/lib/python3.7/site-packages/numpy/core/shape_base.py", line 282, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 49 and the array at index 1 has size 51

I'll keep investigating and let you know what I discover. As always, any pointers are welcome.

@feralvam
Copy link

feralvam commented Nov 5, 2021

So, I don't get the error if I manually set the max_seq_length (e.g. to 128). According to the HuggingFace documentation, if you don't specify this for the tokenizer, then each batch will have the length of the max sequence of that batch. However, I think BaaL is assuming that all batches have the same max_seq_length?

While making that change avoids getting that exception, there's something I wanted to ask. In lines 117-118 of transformers_trainer_wrapper.py we have:

if len(preds) > 0 and not isinstance(preds[0], Sequence):
            # Is an Array or a Tensor
            return np.vstack(preds)

According to the documentation, the returned value should be: [n_samples, n_outputs, ..., n_iterations]. Or as @Dref360 mentioned before: [DATASET_LEN, NUM_CLASSES, NUM_TOKENS, NUM_ITERATIONS]. However, it actually is [n_samples, max_seq_length, n_classes, n_iterations]. Would the fact that dimensions 1 and 2 are in a different order have an effect on the rest of the active learning process? For instance, in how the scores/uncertanties are computed or aggregated? Thanks!

@parmidaatg
Copy link
Collaborator

Hey @feralvam,
I think it is great if you submit a WIP PR.
To clarify a bit (since I had a similar problem recently), for now, I'd suggest making sure that after tokenization all the samples have the same length (meaning using padding and max_length). You are correct about what happens if you do not specify the max_length but also this creates a bit of inconsistency in uncertainty calculation which we prefer to avoid. Let alone that my experiments in a similar context were not producing reliable results. So this is something that I don't think we should encourage specifically given the context of uncertainty. I am absolutely open to your thought on that if you have counterexamples.

for your question, we provide reduction argument in the heuristics to get rid of the extra dimension(s) given different tasks. Generally speaking, for any heuristic, we would like to have the outcome probability distribution per sample per iteration and any other dimension is not used in the heuristic calculation. Now depending on the task, you can decide which type of reduction might be beneficial to give you the final shape for using the heuristics. For example, in segmentation, we take the average over the pixels. I'd suggest start from mean and then try other reduction types to see which one works best.
This is where reductions are defined, and it can be set when you define your heuristic:
https://github.com/ElementAI/baal/blob/master/baal/active/heuristics/heuristics.py#L15

@parmidaatg
Copy link
Collaborator

@feralvam did you have any result on this?

@feralvam
Copy link

Hi!
I managed to make the code work, but I didn't run any full/extensive experiments to verify if I could reproduce results of previous work on NER using the code. I'm working on other sequence labelling task atm, so I'll get back to this in a few weeks most likely. Thanks for all your help so far!

@shaked571
Copy link

Hey, what you did in the end regarding lines 117-118 of transformers_trainer_wrapper.py?

@feralvam
Copy link

Hi @shaked571. I didn't change it in the end, since the first dimension was the same I was expecting, and everything else was aggregated. Having said that, I stopped working on this problem (for the foreseeable at least) , so I didn't fully verify if this actually affected the final result or not.

@glahoti6
Copy link

Is this example available now as part of the BaaL?

@Jonathanpro
Copy link

I would be interested in Baal for NER as well

@Dref360
Copy link
Member

Dref360 commented Oct 31, 2022

We've just moved to a new documentation system (mkdocs) that should help us better structure our tutorials.

If someone could run some experiements showing that BALD is at least better performing than Random on NER, I would include it on our website.

Cheers,

@Dref360
Copy link
Member

Dref360 commented Sep 22, 2023

I think we made good progress on this on #262. I'll close this one, but for now the code should work for y'all to run experiment with.

@Dref360 Dref360 closed this as completed Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants