-
Notifications
You must be signed in to change notification settings - Fork 42
Interaction XML
Interaction XML is the file format TEES uses for event and relation extraction. In a sense, interaction XML is the interface of TEES, as all TEES components read and write it. Customized processing pipelines can thus be made from components that are linked by only a shared file format, and TEES can be linked to larger NLP pipelines by converting to and from this interaction XML representation. Below is an example of an interaction XML file, containing one sentence from the BioNLP'11 GENIA development set.
<corpus source="GE">
<document id="GE.d6">
<sentence charOffset="0-33" id="GE.d6.s0" tail=" " text="BMP-6 induces upregulation of Id1">
<entity charOffset="0-5" headOffset="0-5" id="GE.d6.s0.e0" isName="True" origId="PMC-1134658-06-Results-05.T1" origOffset="0-5" text="BMP-6" type="Protein" />
<entity charOffset="30-33" headOffset="30-33" id="GE.d6.s0.e1" isName="True" origId="PMC-1134658-06-Results-05.T2" origOffset="30-33" text="Id1" type="Protein" />
<entity charOffset="6-13" headOffset="6-13" id="GE.d6.s0.e28" isName="False" origId="PMC-1134658-06-Results-05.T29" origOffset="6-13" text="induces" type="Positive_regulation" />
<entity charOffset="14-26" headOffset="14-26" id="GE.d6.s0.e29" isName="False" origId="PMC-1134658-06-Results-05.T30" origOffset="14-26" text="upregulation" type="Positive_regulation" />
<interaction directed="True" e1="GE.d6.s0.e28" e2="GE.d6.s0.e29" id="GE.d6.s0.i0" origId="PMC-1134658-06-Results-05.E1.0" type="Theme" />
<interaction directed="True" e1="GE.d6.s0.e28" e2="GE.d6.s0.e0" id="GE.d6.s0.i1" origId="PMC-1134658-06-Results-05.E1.1" type="Cause" />
<interaction directed="True" e1="GE.d6.s0.e29" e2="GE.d6.s0.e1" id="GE.d6.s0.i2" origId="PMC-1134658-06-Results-05.E2.0" type="Theme" />
<analyses>
<tokenization ProteinNameSplitter="True" source="BioNLP'11" tokenizer="McCC">
<token POS="NN" charOffset="0-5" headScore="1" id="st_0" text="BMP-6" />
<token POS="VBZ" charOffset="6-13" headScore="1" id="st_1" text="induces" />
<token POS="NN" charOffset="14-26" headScore="1" id="st_2" text="upregulation" />
<token POS="IN" charOffset="27-29" headScore="0" id="st_3" text="of" />
<token POS="NN" charOffset="30-33" headScore="1" id="st_4" text="Id1" />
</tokenization>
<parse ProteinNameSplitter="True" parser="McCC" pennstring="(S1 (S (NP (NN BMP-6)) (VP (VBZ induces) (NP (NP (NN upregulation)) (PP (IN of) (NP (NN Id1)))))))" source="BioNLP'11" stanford="ok" tokenizer="McCC">
<dependency id="sd_0" t1="st_1" t2="st_0" type="nsubj" />
<dependency id="sd_1" t1="st_1" t2="st_2" type="dobj" />
<dependency id="sd_2" t1="st_2" t2="st_4" type="prep_of" />
<phrase begin="0" charOffset="0-5" end="0" id="bp_0" type="NP" />
<phrase begin="0" charOffset="0-33" end="4" id="bp_1" type="S" />
<phrase begin="0" charOffset="0-33" end="4" id="bp_2" type="S1" />
<phrase begin="1" charOffset="6-33" end="4" id="bp_3" type="VP" />
<phrase begin="2" charOffset="14-26" end="2" id="bp_4" type="NP" />
<phrase begin="2" charOffset="14-33" end="4" id="bp_5" type="NP" />
<phrase begin="3" charOffset="27-33" end="4" id="bp_6" type="PP" />
<phrase begin="4" charOffset="30-33" end="4" id="bp_7" type="NP" />
</parse>
</analyses>
</sentence>
</document>
</corpus>
The root node of each interaction XML file is a corpus element. A corpus consists of documents, which can represent spans of text such as abstracts. The main annotation elements in the interaction XML format are entities and interactions. Entities are the named entities (such as proteins) and the interaction triggers (such as verbs), and represent a span of the text with a specific role. Interactions are the relations that exist between the entities, and they can have a direction and a type. The annotation in interaction XML format is a graph: the entities are the nodes, and the interactions are the edges.
Interactions can exist between words in a document, but cannot cross document boundaries. The text of the document can be divided into a number of sentences, which contain the entities and interactions that make up relations or events. Interactions can cross sentence boundaries, although at present TEES cannot detect such interactions. If an interaction connects two entities in different sentences, it must be placed in the same sentence with the entity it's e1-attribute refers to.
TEES represents both binary relations and complex events with the same graph format. Events are defined implicitly: an entity and it's set of outgoing interaction edges define the trigger and arguments of a single event. The example sentence above contains two events, defined by the trigger entities and their outgoing edges, the first event consisting of the entity GE.d6.s0.e28 and its outgoing edges GE.d6.s0.i0 and GE.d6.s0.i1, and the second of the entity GE.d6.s0.e29 and its outgoing edge GE.d6.s0.i2.
Interaction elements that represent event arguments must have the attribute "event" set to "True". The "event" attribute can also be set for the trigger node, but it is used only when converting the interaction XML to the BioNLP Shared Task file format. Interaction elements that do not have the "event" attribute as "True" are treated as pairwise relations. Event arguments are always directed, but relations can also be undirected, when their "directed" attribute is not "True".
Each sentence can contain an analyses-element which stores supporting information for event/relation extraction, such as automatically generated parses. Most elements have an id-attribute which should be unique for that element within its scope. To ensure uniqueness, when TEES needs to automatically generate these ids, it uses a nested hierarchical id scheme which also allows e.g. grepping of the XML document for a specific level of items of interest. For users already familiar with interaction XML, the main change in TEES 2.0 is that for character offsets the end of the span is the index immediately after its last character, a common convention in programming languages such as Python and Java (in earlier versions of TEES the end of the span was the index of the last character in the span).