Stanza v1.4.0 #1013

AngledLuffa · 2022-04-23T06:01:01Z

AngledLuffa
Apr 23, 2022
Maintainer

Stanza v1.4.0: Transformer integration to NER and conparse

Overview

As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.

Pipeline interface improvements

Download resources.json and models into temp dirs first to avoid race conditions between multiple processors
Using stanza.download in multiprocessing leads to JSON decode bugs #213
Download files to a tempdir created underneath the expected destinati… #1001
Download models for Pipelines automatically, without needing to call stanza.download(...)
Using models offline #486
Download deps #943
Add ability to turn off downloads
68455d8
Add a new interface where both processors and package can be set
tokenize with spacy #917
f370429
When using pretokenized tokens, get character offsets from text if available
tokenize_pretokenized=True given fixed text and tokens #967
Tokenize preexisting #975
If Bert or other transformers are used, cache the models rather than loading multiple times
Cache loading bert/transformer models in the pipeline #980
Allow for disabling processors on individual runs of a pipeline
Disable processors dynamically in the call function #945
Allow for selectively using processors. Answers #945 #947

Other general improvements

Add # text and # sent_id to conll output
output sentences to file #918
Add a #text comment to each sentence in a doc if it doesn't already exist #983
Process sentence ids from the corpus, if available. Change sentence.… #995
Add ner to the token conll output
Ner column for Italian connlu format #993
Token ner #996
Fix missing Slovak MWT model
Slovak multiword doesn't work #971
5aa19ec
Upgrades to EN, IT, and Indonesian models
Lemmatization does not appear to be working for Indonesian (GSD). #1003
En combined #1008
IT improvements with the help of @attardi and @msimi
Fix improper tokenization of Chinese text with leading whitespace
Strange sentences division in Chinese #920
Fix charoffset in edge case when a token begins with whitespace and skip_newline is enabled #924
Check if a CoreNLP model exists before downloading it (thank you @Internull)
Check corenlp model file existence before downloading #965
Convert the run_charlm script to python
Run charlm #942
Typing and lint fixes (thank you @asears)
fix args tagmethod error message, formats, types #833
Add typing, remove unused imports, add missing re import, format #856
stanza-train examples now compatible with the python training scripts
ValueError: Cannot find '# text' #896

NER features

Bert integration (not by default, thank you @vythaihn)
Add bert embeddings to the bottom layer of the NER. #976
Swedish model (thank you @EmilStenstrom)
Feature request: Support for Swedish NER #912
Sv ner #857
Persian model
Stanza with Persian Models #797
Danish model
3783cc4
Norwegian model (both NB and NN)
31fa23e
Use updated Ukrainian data (thank you @gawy)
Split uk sentences #873
Myanmar model (thank you UCSY)
My ner #845
Training improvements for finetuning models
Is there an API to update existing NER models? #788
Minor ner #791
Fix inconsistencies in B/S/I/E tags
How can i run multiple stanza NER models parallel to eachother? #928 (comment)
Fix a variety of tagging errors which can occur when tag sequences st… #961
Add an option for multiple NER models at the same time, merging the results together
How can i run multiple stanza NER models parallel to eachother? #928
Add a field for multiple NER annotations in a tuple #955

Constituency parser

Dynamic oracle (improves accuracy a bit)
Con oracle #866
Missing tags now okay in the parser
KeyError: "Constituency parser not trained with tag 'GW'" #862
04dbf4f
bugfix of () not being escaped when output in a tree
eaf134c
charlm integration by default
Con charlm #799
Bert integration (not the default model) (thank you @vythaihn and @hungbui0411)
05a0b04
0bbe8d1
Preemptive bugfix for incompatible devices from @zhaochaocs
argument tensor problem (dev branch) #989
fix: move pe to the appropriate device #1002
New models:
DA, based on Arboretum
IT, based on the Turin treebank
JA, based on ALT
PT, based on Cintil
TR, based on Starlang
ZH, based on CTB7

This discussion was created from the release Stanza v1.4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stanza v1.4.0 #1013

{{title}}

Replies: 0 comments

Select a reply

Stanza v1.4.0 #1013

AngledLuffa Apr 23, 2022 Maintainer

Stanza v1.4.0: Transformer integration to NER and conparse

Overview

Pipeline interface improvements

Other general improvements

NER features

Constituency parser

Replies: 0 comments

AngledLuffa
Apr 23, 2022
Maintainer