An implementation of NER (Named Entity Recognition) task in dynet, using biLSTM both on character level and word level to extract informative embeddings, as well as using a linear chain CRF to calculate loss and make prediction.
In fact, this repository is a re-implementation of the model from Guillaume Genthial, who gave a tensorflow version of codes on his github page as well as a detailed, helpful, even enlightening tutorial for overall understanding of the model on his blog post. After reading his fantastic codes, an idea came to me that what if I implement the model using Dynet? As far as I know, Dynet is much younger and less mature than tensorflow, but it is said to be powerful when it comes to NLP tasks, in which data samples (sentences probably) do not have a fixed size and require special care (such as padding) to take in many neural network toolkits around, because Dynet has the capability of performing minibatching automatically. So, I tested it out and luckily, it did work! Even though the result was not bad and auto-batching did help me a lot in tackling those flexible sizes, I got countless pains in using Dynet as well: Many times I wanted to use a function or a building block to help me simplify codes, I told myself for sure that it must had been written well as a built-in utility, and then I went over to the reference manual pages and couldn't find anything! For example, since Dynet 2.0.3 didn't offer linear chain CRF functionality, I had to implement that in this model. Anyway, Dynet is indeed helpful in its own rights, and is undergoing rapid development with the help of opensource community. I am still looking forward to its success!
No more personal comments from now. In order to run codes in this repository, you will want to:
- install DyNet 2.0.3 for Python following instructions here.
- download glove word vectors using command
make glove
.(look at themakefile
in this repository) - download dataset from CoNLL2013 shared task. (For the sake of downloading correctly-formatted data as well as understanding corresponding code snippets, a small clipping of several sentences from the dataset is offered in
data/ner.en/ex.XXX.txt
) - run with the command
make run
.
Notes:
-
Please use DyNet 2.0.3, other versions are not promised to run correctly.
-
make run
comprises three python executing procedures:python build_data.py
, this also comprises two stages:- BUILD_VOCAB
- TRIM_DATA
python train.py
python test.py
You can decide to run only one or some of them, but make sure that all the steps before that have been executed beforehand.
-
All the configurations of the model is placed in
model/config.py
. If you want to tune some hyperparameters, revise certain files' paths, or decide whether or not to use some building blocks, just go make changes yourself. All the item names are either self-explained or with elaborate comments. For instance:use_crf
: whether to use linear chain CRF for decodingbatch_sz
: size of minibatchnepochs
: number of epochs to godropout
: dropout probabilityclip_thr
: gradient clipping thresholddim_word
: dimension for word embeddings in glovedim_char
: dimension for trainable character embeddingsBUILD_VOCAB
: whether to perform building vocabularies inbuild_data.py
TRIM_DATA
: whether to perform trimming dataset inbuild_data.py
- and so on ...
-
By default, it is required to run on a GPU device. To switch back to CPU, just delete following line at the very beginning of
train.py
andtest.py
:dynet_config.set_gpu()
-
By default, auto-batching is enabled. To cancel it, just delete following line at the very beginning of
train.py
andtest.py
:dynet_config.set(mem="1024", autobatch=True)
-
Feel free to pull requests and add comments!