-
Notifications
You must be signed in to change notification settings - Fork 42
The Preprocessor
The Preprocessor consists of a general pipeline of processing steps used to perform various operations on text corpora in the Interaction XML format. Most commonly, the Preprocessor is used to import a corpus into the Interaction XML format and run syntactic parsing, named entity recognition and other supporting analyses required for event extraction. The preprocessed Interaction XML files can then be used for training a new TEES model, or events can be extracted from them by classification.
The Preprocessor is run automatically as part of the classify.py and train.py programs, so in basic event extraction use cases you don't need to directly interact with the Preprocessor. However, the Preprocessor can also be used separately, for example to produce customized analyses for corpora to be used with TEES or with a completely different text mining tool. The Preprocessor is used via the 'preprocess.py' program, available on the main level of the TEES package.
To see a list of the available preprocessor steps, run the program without any arguments:
python preprocess.py
To use the preprocessor to process your data, you must define a pipeline of one or more of these steps, using the '--steps' argument:
python preprocess.py -i [INPUT] -o [OUTPUT] --steps A,B,C
Each step takes at least the arguments 'input' and 'output', with optional other arguments shown in the step list. As a convenience, the 'input' argument of the first step and the 'output' argument of the last step can be defined separately with the '--input' (or '-i') and --output (or '-o') command line options. Therefore, the above example command is identical with the command:
python preprocess.py --steps "A(input='[INPUT]'),B,C(output='[OUTPUT]')"
Passing the input and output arguments directly for the relevant steps demonstrates how any step argument can be configured. The '--steps' list is evaluated as a Python expression, with step arguments comparable to function arguments. Any python data structure that can be evaluated without the need for additional libraries can thus be passed as a preprocessing step argument.
In the above example, the 'input' and 'output' arguments will receive the string literals '[INPUT]' and '[OUTPUT]'. It should be noted that not all Python syntax is compatible with most command line environments, so you will usually need to enclose the list of steps within quotation signs.
To use the preprocessor, you must define a pipeline of steps and optionally their arguments. The output of each preprocessing step is the input of the following step, so only steps producing and taking in compatible formats can be used consecutively. In practice, most preprocessing steps will use an Interaction XML file as both the input and output arguments.
As an example of a real preprocessing pipeline, let's analyze the contents of the GE09 corpus (if you have not yet installed the corpora, please use 'configure.py' to install this corpus). Run the following command:
python preprocess.py -i GE09 --steps LOAD,ANALYZE_STRUCTURE
This preprocessing pipeline will load (with the step 'LOAD') the BioNLP'09 Shared Task corpus (GE09), merging its train, devel and test sets into a single Interaction XML file. Next, the step 'ANALYZE_STRUCTURE' receives as its input the output from the 'LOAD' step (the Interaction XML) and will show a list of the annotation types within this corpus. Since 'ANALYZE_STRUCTURE' has no output argument, we don't need to define the '-o' option.
To see how arguments can be passed directly to the preprocessing steps, we can run the above example with the equivalent command:
python preprocess.py --steps "LOAD(input='GE09'),ANALYZE_STRUCTURE"
Most pipelines start with the 'LOAD' step, to either load a pre-existing corpus, or to convert a corpus (and optionally some parses) into the Interaction XML format. Depending on what you give 'LOAD' as the input argument, you will get the following results:
Input | Example | Result |
---|---|---|
PubMed ID | 9668063 | Interaction XML with a single document for the PubMed abstract |
Corpus identifier | GE09 | Installed corpus with all sets merged into single Interaction XML |
Exact path to Interaction XML file | ~/.tees/corpora/GE09-devel.xml | Load Interaction XML file |
Directory with BioNLP ST-format (txt, a1, a2) files | ~/corpus | XML with (optionally annotated) document elements |
Directory with BioNLP ST-format (txt, a1, a2) and parse (ptb, sd, conll, etc.) files | ~/corpus | XML with (optionally annotated) sentences elements |
For the 'SAVE' step, using the '*' wildcard in the output argument can be used to modify the behaviour in the following ways:
Output | Example | Result |
---|---|---|
Filename | /tmp/corpus.xml | Interaction XML is saved to a file |
Filename with wildcard | /tmp/corpus-*.xml | Subsets (usually train, devel and test) are saved to their own Interaction XML files |
Path with file type | /tmp/corpus/*.txt | Export the corpus documents (txt, a1 or a2) or the parses (e.g. conll) into one of the supported formats |
python preprocess.py -i GE09-devel.xml -o "/tmp/GE09-devel.xml" --steps LOAD,SAVE
Loads the development set of the GE09 corpus and saves it under the tmp directory.
python preprocess.py -i GE09 -o "/tmp/GE09-*.xml" --steps LOAD,SAVE
Loads the GE09 corpus and saves its subsets (train, devel and test) into Interaction XML files.
python preprocess.py -i GE09 -o "/tmp/GE09-corpus/*.st" --steps LOAD,SAVE
Loads the GE09 corpus and saves it in the BioNLP Shared Task format (txt, a1 and a2).
python preprocess.py -i GE09 -o "/tmp/GE09-corpus/*.conll" --steps LOAD,SAVE
Loads the GE09 corpus and exports its included parses in the CoNLL format.
The Interaction XML file format consists of 'document' elements (representing spans of text such as publication abstracts) which can contain 'sentence' elements. An unparsed corpus has no information about sentence boundaries, so such Interaction XML files contain only 'document' elements with the 'entity' and 'interaction' elements stored directly under the document elements. Once the corpus is processed with a sentence splitter (such as 'GENIA_SPLITTER', or 'IMPORT_PARSE' for an unparsed corpus) 'sentences' elements will be added within the document elements, and the 'entity' and 'interaction' elements will be moved under the 'sentence' elements. To undo sentence splitting (remove 'sentence' elements and move the 'entity' and 'interaction' elements back under the 'document' elements you can use the 'MERGE_SENTENCES' preprocessing step.
The TEES preprocessor contains a number of wrappers for different parsers. Most of these parsers can be installed using the 'configure.py' program. Using the parsers, it is possible to preprocess any Interaction XML file (or even plain txt documents) into a form usable for event extraction. As an example, let's download a PubMed abstract and prepare it for classification with TEES. Please note that you will need to have installed the GENIA Sentence Splitter, BANNER Named Entity Recognizer and the BLLIP and Stanford Parser tools with 'configure.py' before trying this example.
python preprocess.py -i 9668063 -o /tmp/PM-9668063.xml --steps LOAD,GENIA_SPLITTER,BANNER,BLLIP_BIO,STANFORD_CONVERT,SPLIT_NAMES,FIND_HEADS,SAVE
Here a PubMed abstract is downloaded by giving its ID for the 'LOAD' step. Then, 'GENIA_SPLITTER' is used to divide the abstract into sentences. 'BANNER' will find named entities (proteins and genes) within these sentences, and the 'BLLIP_BIO' and 'STANFORD_CONVERT' steps will produce syntactic parses for these sentences. The 'SPLIT_NAMES' and 'FIND_HEADS' steps finalize the parses into the form commonly used by TEES, and finally 'SAVE' is used to save the Interaction XML file.
This preprocessing pipeline is the same as the one performed automatically by 'classify.py' when classifying unpreprocessed input.
Sometimes, you may wish to reparse an existing corpus (usually with a different parser). To reparse a corpus, the following pipeline can be used:
python preprocess.py -i GE09 -o "/tmp/GE09-*.xml" --steps LOAD,CLEAR_PARSE,GENIA_SPLITTER,BLLIP_BIO,STANFORD_CONVERT,SPLIT_NAMES,FIND_HEADS,SAVE
Here, all parsing related information is first removed from the GENIA corpus with the step 'CLEAR_PARSE', which is an alias for the three consecutive steps 'REMOVE_ANALYSES' (removes 'analyses' elements), 'REMOVE_HEADS' (removes entity head offsets) and 'MERGE_SENTENCES' (removes sentences and moves entities and interactions back under document elements). After all parse information has been removed, the default parsing pipeline is run using the same steps as in the previous example.
Similarly to reparsing, existing parses can be imported from common parser formats. Currently supported formats include the Penn Treebank (ptb), Stanford Dependencies (sd), CoNLL (conll, conllx, conllu and tt), CoreNLP (corenlp), EPE json (epe) and simple tokenized text with one sentence per line (tok). The parser output for each 'document' element must be saved to a file named using the document's 'origId' attribute as the file name and the parse type as the extension, for example "1527859.conll".
As an example, we will first export the default parses of the GE09 corpus development set in the CoNLL format:
python preprocess.py -i GE09-devel.xml -o "/tmp/GE09-devel-parse/*.conll" --steps LOAD,SAVE
Then, we will remove all parsing and sentence splitting information from the same GE09 corpus file and re-insert the parses from the CoNLL format files exported in the previous step:
python preprocess.py -i GE09-devel.xml -o "/tmp/GE09-devel-imported-parse.xml" --steps "LOAD,CLEAR_PARSE,IMPORT_PARSE(parseDir='/tmp/GE09-devel-parse'),SPLIT_NAMES,FIND_HEADS,SAVE"
Here the directory of the parse files is defined as an argument of the 'IMPORT_PARSE' preprocessing step. Sentence splitting has been removed with the 'REMOVE_HEADS' step (included in the 'CLEAR_PARSE' alias), but the 'IMPORT_PARSE' step uses the sentence division from the CoNLL files to redivide the document elements into sentences.
Some of the tools used in the preprocessing pipeline can either crash or go to an infinite loop when encountering some unusual text structures (See Björne et al. 2010). The TEES Preprocessor was designed for batch processing of large (PubMed scale) datasets and to minimize the impact of such preprocessing errors. Therefore, the tools in the preprocessing pipeline are run through a wrapper that follows their progress and when detecting an error skips just the problematic sentence. This means that parsing errors in individual sentences will not halt the TEES Preprocessor process.
Therefore, please check the preprocessing log files and the output files to see what the parsing coverage for your documents is. The Preprocessor log file is by default stored in a file named as '[OUTPUT]-log.txt', based on the output option. It is also possible to explicitly define a filename for the log file using the '--logPath' option.