Skip to content
jbjorne edited this page Feb 25, 2013 · 6 revisions

Interaction XML is the file format TEES uses for event and relation extraction. In a sense, interaction XML is the interface of TEES, as all TEES components read and write it. Customized processing pipelines can thus be made from components that are linked by only a shared file format, and TEES can be linked to larger NLP pipelines by converting to and from this interaction XML representation. Below is an example of an interaction XML file, containing one sentence from the BioNLP'11 GENIA development set.

<corpus source="GE11">
  <document id="GE11.d6">
    <sentence charOffset="0-33" id="GE11.d6.s0" tail="&#10;" text="BMP-6 induces upregulation of Id1">
      <entity charOffset="0-5" given="True" headOffset="0-5" id="GE11.d6.s0.e0" origId="PMC-1134658-06-Results-05.T1" origOffset="0-5" text="BMP-6" type="Protein" />
      <entity charOffset="30-33" given="True" headOffset="30-33" id="GE11.d6.s0.e1" origId="PMC-1134658-06-Results-05.T2" origOffset="30-33" text="Id1" type="Protein" />
      <entity charOffset="6-13" event="True" headOffset="6-13" id="GE11.d6.s0.e28" origId="PMC-1134658-06-Results-05.T29" origOffset="6-13" text="induces" type="Positive_regulation" />
      <entity charOffset="14-26" event="True" headOffset="14-26" id="GE11.d6.s0.e29" origId="PMC-1134658-06-Results-05.T30" origOffset="14-26" text="upregulation" type="Positive_regulation" />
      <interaction directed="True" e1="GE11.d6.s0.e28" e2="GE11.d6.s0.e29" event="True" id="GE11.d6.s0.i0" origId="PMC-1134658-06-Results-05.E1.0" type="Theme" />
      <interaction directed="True" e1="GE11.d6.s0.e28" e2="GE11.d6.s0.e0" event="True" id="GE11.d6.s0.i1" origId="PMC-1134658-06-Results-05.E1.1" type="Cause" />
      <interaction directed="True" e1="GE11.d6.s0.e29" e2="GE11.d6.s0.e1" event="True" id="GE11.d6.s0.i2" origId="PMC-1134658-06-Results-05.E2.0" type="Theme" />
      <analyses>
        <tokenization ProteinNameSplitter="True" source="BioNLP&apos;11" tokenizer="McCC">
          <token POS="NN" charOffset="0-5" headScore="1" id="bt_0" text="BMP-6" />
          <token POS="VBZ" charOffset="6-13" headScore="1" id="bt_1" text="induces" />
          <token POS="NN" charOffset="14-26" headScore="1" id="bt_2" text="upregulation" />
          <token POS="IN" charOffset="27-29" headScore="0" id="bt_3" text="of" />
          <token POS="NN" charOffset="30-33" headScore="1" id="bt_4" text="Id1" />
        </tokenization>
        <parse ProteinNameSplitter="True" parser="McCC" pennstring="(S1 (S (NP (NN BMP-6)) (VP (VBZ induces) (NP (NP (NN upregulation)) (PP (IN of) (NP (NN Id1)))))))" source="BioNLP&apos;11" stanford="ok" tokenizer="McCC">
          <dependency id="sd_0" t1="bt_1" t2="bt_0" type="nsubj" />
          <dependency id="sd_1" t1="bt_1" t2="bt_2" type="dobj" />
          <dependency id="sd_2" t1="bt_2" t2="bt_4" type="prep_of" />
          <phrase begin="0" charOffset="0-5" end="0" id="bp_0" type="NP" />
          <phrase begin="0" charOffset="0-33" end="4" id="bp_1" type="S" />
          <phrase begin="0" charOffset="0-33" end="4" id="bp_2" type="S1" />
          <phrase begin="1" charOffset="6-33" end="4" id="bp_3" type="VP" />
          <phrase begin="2" charOffset="14-26" end="2" id="bp_4" type="NP" />
          <phrase begin="2" charOffset="14-33" end="4" id="bp_5" type="NP" />
          <phrase begin="3" charOffset="27-33" end="4" id="bp_6" type="PP" />
          <phrase begin="4" charOffset="30-33" end="4" id="bp_7" type="NP" />
        </parse>
      </analyses>
    </sentence>
  </document>
</corpus>

The root node of each interaction XML file is a corpus element. A corpus consists of documents, which can represent spans of text such as abstracts. The main annotation elements in the interaction XML format are entities and interactions. Entities are the named entities (such as proteins) and the interaction triggers (such as verbs), and represent a span of the text with a specific role. Interactions are the relations that exist between the entities, and they can have a direction and a type. The annotation in interaction XML format is a graph: the entities are the nodes, and the interactions are the edges.

The "given" attribute in entities and interactions is used to mark known data that TEES does not predict, but can use to learn from. For example, entities produced by the BANNER named entity recognizer would be given entities for TEES. The "given" attribute was called "isName" in TEES versions prior to 2.1 and was limited only to entity elements. While it can currently be defined also for interactions, it is not yet supported in edge detection.

Interactions as relations and event arguments

Interactions can exist between words in a document, but cannot cross document boundaries. The text of the document can be divided into a number of sentences, which contain the entities and interactions that make up relations or events. Interactions can cross sentence boundaries, although at present TEES cannot detect such interactions. If an interaction connects two entities in different sentences, it must be placed in the same sentence with the entity its e1-attribute refers to.

TEES represents both binary relations and complex events with the same graph format. Events are defined implicitly: an entity and its set of outgoing interaction edges define the trigger and arguments of a single event. The example sentence above contains two events, defined by the trigger entities and their outgoing edges, the first event consisting of the entity GE11.d6.s0.e28 and its outgoing edges GE11.d6.s0.i0 and GE.d6.s0.i1, and the second of the entity GE11.d6.s0.e29 and its outgoing edge GE11.d6.s0.i2.

Interaction elements that represent event arguments must have the attribute "event" set to "True". The "event" attribute can also be set for the trigger node, but it is used only when converting the interaction XML to the BioNLP Shared Task file format. Interaction elements that do not have the "event" attribute as "True" are treated as pairwise relations. Event arguments are always directed, but relations can also be undirected, when their "directed" attribute is not "True".

Supporting analyses

Each sentence can contain an analyses-element which stores supporting information for event/relation extraction, such as automatically generated parses. Most elements have an id-attribute which should be unique for that element within its scope. To ensure uniqueness, when TEES needs to automatically generate these ids, it uses a nested hierarchical id scheme which also allows e.g. grepping of the XML document for a specific level of items of interest. For users already familiar with interaction XML, the main change in TEES 2.0 is that for character offsets the end of the span is the index immediately after its last character, a common convention in programming languages such as Python and Java (in earlier versions of TEES the end of the span was the index of the last character in the span).

The tokenization element defines the word tokens in a sentence and is usually produced by a parser. The token elements must have a continuous span. The tokenization used by TEES is defined by the chosen parse-element's "tokenizer" attribute.

The parse element contains both the Penn tree-style parse (e.g. from BLLIP) and the dependency parse (e.g. from the Stanford Parser). The Penn tree is stored in the "pennstring" attribute and is also expanded into a set of phrase elements. The dependency parse is represented by a set of depedency elements, directed edges connecting the tokens.

The tokenization, Penn tree-style parse and dependency parse can be exported into flat files (of formats corresponding to those used in the BioNLP Shared Task) with the ExportParse.py program located in the Utils/InteractionXML directory of TEES. The parser tool wrappers in the Tools directory can insert parse information in these formats to an interaction XML file.

Clone this wiki locally