Skip to content
Christian Lück edited this page Sep 21, 2022 · 8 revisions

Getting Started

Installation

  1. Install Stack, Haskell's build tool.

  2. Clone StandOff Tools.

  3. In a terminal, cd into the standoff-tools folder.

  4. stack build -- this may need 15-30 minutes and downloads several hundreds of MBs, the largest part of it is the Haskell compiler.

  5. stack test to run the unit tests.

  6. optionally: stack install will install the standoff command on your system. You can skip this and use the sandbox build only.

If you skipped the last step, please prefix all standoff commands in this guide with stack exec --. You also have to stay in the standoff-tools folder or its subfolders, so that stack can find the sandboxed compilation.

Basics of the command line interface (CLI)

The CLI of StandOff Tools is the standoff command line program. It's a program with subcommands, like git. The basic structure of the CLI is:

standoff GLOBAL-PARAMETERS COMMAND LOCAL-PARAMETERS

The internal order of the global parameters and the local parameters does not matter, but the global parameters must precede the command.

Getting help

There is help via -h or --help on each level:

standoff -h

will give you help on global parameters and will print a list of commands.

standoff COMMAND -h

will give you help on the command and its local parameters.

-h and --help can be mixed up with other parameters. So, when you started entering a command and forgot a certain option, just append -h to get help.

Input and output files

Use the global parameter -i INPUT_FILE or its long form --input INPUT_FILE to pass an input file named INPUT_FILE to standoff.

standoff -i INPUT_FILE command [LOCAL-OPTIONS]

See more in file IO.

Commands

Let's explore the function of StandOff Tools' commands by using them with simple hand-crafted annotations. Therefore, we're going to write CSV files. This way, we quickly learn about the basic structure of annotations required for the tools.

We also work on simple TEI files given in doc/getting-started/.

In the further chapters, we assume, that we've cded into this folder.

internalize

Let's start with the internalizer and hand-crafted annotations.

Get offsets from a hex dump

For hand-crafted annotations, it's nice to have simple access to each character's offset. Most editors provide simple functions for getting the character offset (aka character index). But these are interactive and thus not so suitable for this introduction.

Since our XML source document in doc/getting-started/Trawr-Gesang.xml contains only ASCII characters and is encoded in UTF-8, we can use a hex editor or a hex dump in canonical form for accessing character offsets. But please note: Generally the hex dump gives access to byte offsets as opposed to character offsets. However, since there are no multi-byte characters in the XML source doc here, there's no difference.

The output of hexdump -C Trawr-Gesang.xml is:

00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  54 45 49 20 78 6d 6c 6e  |F-8"?>.<TEI xmln|
00000030  73 3d 22 68 74 74 70 3a  2f 2f 77 77 77 2e 74 65  |s="http://www.te|
00000040  69 2d 63 2e 6f 72 67 2f  6e 73 2f 31 2e 30 22 20  |i-c.org/ns/1.0" |
00000050  78 6d 6c 3a 6c 61 6e 67  3d 22 64 65 22 3e 0a 20  |xml:lang="de">. |
00000060  20 20 3c 74 65 69 48 65  61 64 65 72 3e 0a 20 20  |  <teiHeader>.  |
00000070  20 20 20 20 3c 66 69 6c  65 44 65 73 63 3e 0a 20  |    <fileDesc>. |
00000080  20 20 20 20 20 20 20 20  3c 74 69 74 6c 65 53 74  |        <titleSt|
00000090  6d 74 3e 0a 20 20 20 20  20 20 20 20 20 20 20 20  |mt>.            |
000000a0  3c 74 69 74 6c 65 3e 54  72 61 77 72 2d 47 65 73  |<title>Trawr-Ges|
000000b0  61 6e 74 20 76 6f 6e 20  64 65 72 20 6e 6f 74 68  |ant von der noth|
000000c0  20 43 68 72 69 73 74 69  20 61 6d 20 4f 65 6c 62  | Christi am Oelb|
000000d0  65 72 67 20 69 6e 20 64  65 6d 20 47 61 72 74 65  |erg in dem Garte|
000000e0  6e 3c 2f 74 69 74 6c 65  3e 0a 20 20 20 20 20 20  |n</title>.      |
000000f0  20 20 20 20 20 20 3c 61  75 74 68 6f 72 3e 46 72  |      <author>Fr|
00000100  69 65 64 72 69 63 68 20  53 70 65 65 3c 2f 61 75  |iedrich Spee</au|
00000110  74 68 6f 72 3e 0a 20 20  20 20 20 20 20 20 20 3c  |thor>.         <|
00000120  2f 74 69 74 6c 65 53 74  6d 74 3e 0a 20 20 20 20  |/titleStmt>.    |
00000130  20 20 20 20 20 3c 70 75  62 6c 69 63 61 74 69 6f  |     <publicatio|
00000140  6e 53 74 6d 74 3e 0a 20  20 20 20 20 20 20 20 20  |nStmt>.         |
00000150  20 20 20 3c 70 3e 31 36  34 39 3c 2f 70 3e 0a 20  |   <p>1649</p>. |
00000160  20 20 20 20 20 20 20 20  3c 2f 70 75 62 6c 69 63  |        </public|
00000170  61 74 69 6f 6e 53 74 6d  74 3e 0a 20 20 20 20 20  |ationStmt>.     |
00000180  20 20 20 20 3c 73 6f 75  72 63 65 44 65 73 63 3e  |    <sourceDesc>|
00000190  0a 20 20 20 20 20 20 20  20 20 20 20 20 3c 70 3e  |.            <p>|
000001a0  74 61 6b 65 6e 20 66 72  6f 6d 20 3c 70 74 72 20  |taken from <ptr |
000001b0  74 61 72 67 65 74 3d 22  68 74 74 70 73 3a 2f 2f  |target="https://|
000001c0  64 65 2e 77 69 6b 69 70  65 64 69 61 2e 6f 72 67  |de.wikipedia.org|
000001d0  2f 77 69 6b 69 2f 5a 25  43 33 25 41 34 73 75 72  |/wiki/Z%C3%A4sur|
000001e0  22 2f 3e 3c 2f 70 3e 0a  20 20 20 20 20 20 20 20  |"/></p>.        |
000001f0  20 3c 2f 73 6f 75 72 63  65 44 65 73 63 3e 0a 20  | </sourceDesc>. |
00000200  20 20 20 20 20 3c 2f 66  69 6c 65 44 65 73 63 3e  |     </fileDesc>|
00000210  0a 20 20 20 3c 2f 74 65  69 48 65 61 64 65 72 3e  |.   </teiHeader>|
00000220  0a 20 20 20 3c 74 65 78  74 3e 0a 20 20 20 20 20  |.   <text>.     |
00000230  20 3c 62 6f 64 79 3e 0a  20 20 20 20 20 20 20 20  | <body>.        |
00000240  20 3c 6c 67 3e 0a 20 20  20 20 20 20 20 20 20 20  | <lg>.          |
00000250  20 20 3c 68 65 61 64 3e  54 72 61 77 72 2d 47 65  |  <head>Trawr-Ge|
00000260  73 61 6e 67 20 76 6f 6e  20 64 65 72 20 6e 6f 74  |sang von der not|
00000270  68 20 43 68 72 69 73 74  69 20 61 6d 20 26 23 78  |h Christi am &#x|
00000280  64 36 3b 6c 62 65 72 67  20 69 6e 20 64 65 6d 20  |d6;lberg in dem |
00000290  47 61 72 74 65 6e 3c 2f  68 65 61 64 3e 0a 20 20  |Garten</head>.  |
000002a0  20 20 20 20 20 20 20 20  20 20 3c 6c 67 3e 0a 20  |          <lg>. |
000002b0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 3c 6c  |              <l|
000002c0  3e 42 65 79 20 73 74 69  6c 6c 65 72 20 6e 61 63  |>Bey stiller nac|
000002d0  68 74 3c 63 61 65 73 75  72 61 2f 3e 20 7a 75 72  |ht<caesura/> zur|
000002e0  20 65 72 73 74 65 6e 20  77 61 63 68 74 3c 2f 6c  | ersten wacht</l|
000002f0  3e 0a 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |>.              |
00000300  20 3c 6c 3e 45 69 6e 20  3c 68 69 3e 73 74 69 6d  | <l>Ein <hi>stim|
00000310  6d 20 73 69 63 68 20 67  75 6e 64 3c 2f 68 69 3e  |m sich gund</hi>|
00000320  20 7a 75 20 6b 6c 61 67  65 6e 2e 3c 2f 6c 3e 0a  | zu klagen.</l>.|
00000330  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 3c  |               <|
00000340  6c 3e 4a 63 68 20 6e 61  6d 20 69 6e 20 61 63 68  |l>Jch nam in ach|
00000350  74 20 3c 63 61 65 73 75  72 61 2f 3e 20 77 61 c3  |t <caesura/> wa.|
00000360  9f 20 64 69 65 20 64 6f  63 68 20 73 61 67 74 3b  |. die doch sagt;|
00000370  3c 2f 6c 3e 0a 20 20 20  20 20 20 20 20 20 20 20  |</l>.           |
00000380  20 20 20 20 3c 6c 3e 54  68 61 74 20 68 69 6e 20  |    <l>That hin |
00000390  6d 69 74 20 61 75 67 65  6e 20 73 63 68 6c 61 67  |mit augen schlag|
000003a0  65 6e 2e 3c 2f 6c 3e 0a  20 20 20 20 20 20 20 20  |en.</l>.        |
000003b0  20 20 20 20 3c 2f 6c 67  3e 0a 20 20 20 20 20 20  |    </lg>.      |
000003c0  20 20 20 3c 2f 6c 67 3e  0a 20 20 20 20 20 20 3c  |   </lg>.      <|
000003d0  2f 62 6f 64 79 3e 0a 20  20 20 3c 2f 74 65 78 74  |/body>.   </text|
000003e0  3e 0a 3c 2f 54 45 49 3e  0a 3c 3f 41 53 4d 20 4d  |>.</TEI>.<?ASM M|
000003f0  56 20 44 52 37 20 41 52  30 20 3f 3e 0a 3c 21 2d  |V DR7 AR0 ?>.<!-|
00000400  2d 20 65 70 69 6c 6f 67  20 2d 2d 3e 0a           |- epilog -->.|
0000040d

The left-most column is the address of the beginning of a 16-bytes block, the right most column is a representation of these bytes as characters, while non-ASCII bytes and some characters (e.g. newlines) are represented with a dot. The other columns give hex representations of the 16 bytes. To get a character's offset, add it's column number 0,...,9,a,b.c,d,e,f to the block's address. E.g. the address of the first capital letter "T" (from UTF-8) at the end of the second 16 bytes block is 0x00000010 + 0xf = 0x1f.

You can use decimal and hexadecimal representations of character offsets in standoff.

If not otherwise defined by command line parameters, standoff uses zero-indexed character offsets. I.e. the first character's offset is 0.

In a hex dump, byte addresses are zero-indexed.

Hand-crafted annotations

Let's craft our first annotation. At least we need a start and an end point. The annotation should reference the word "Christi" starting with an "C" at 0x272 and ending with an "i" at 0x278. We put this into Trawr-Gesang.minimal.csv:

start,end
0x272,0x278

Let's use the internalize command to wrap the "Christi" into a <seg>.

standoff -i Trawr-Gesang.xml internalize --const seg --csv-start-end Trawr-Gesang.minimal.csv

The global option -i passes the XML source docment, and with --csv-start-end we declare, that the annotations are given in CSV format with start and end character offsets. --const seg (short -c seg) defines, that we use a constant tag name for the internalized annotation: seg.

The result is:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
            <author>Friedrich Spee</author>
         </titleStmt>
         <publicationStmt>
            <p>1649</p>
         </publicationStmt>
         <sourceDesc>
            <p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <body>
         <lg>
            <head>Trawr-Gesang von der noth <seg>Christi</seg> am &#xd6;lberg in dem Garten</head>
            <lg>
               <l>Bey stiller nacht<caesura/> zur ersten wacht</l>
               <l>Ein <hi>stimm sich gund</hi> zu klagen.</l>
               <l>Jch nam in acht <caesura/> waß die doch sagt;</l>
               <l>That hin mit augen schlagen.</l>
            </lg>
         </lg>
      </body>
   </text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->

Annotations exceeding into tags

Annotations may exceed into tags, character references or the like. You can not produce non-wellformed XML this way, since annotations are split and the tag-exceeding part is silently dropped.

`Trawr-Gesant.restricted.csv

start,end
0x255,0x263
0x282,0x287

Is internalized to:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
            <author>Friedrich Spee</author>
         </titleStmt>
         <publicationStmt>
            <p>1649</p>
         </publicationStmt>
         <sourceDesc>
            <p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <body>
         <lg>
            <head><seg>Trawr-Gesang</seg> von der noth Christi am &#xd6;<seg>lberg</seg> in dem Garten</head>
            <lg>
               <l>Bey stiller nacht<caesura/> zur ersten wacht</l>
               <l>Ein <hi>stimm sich gund</hi> zu klagen.</l>
               <l>Jch nam in acht <caesura/> waß die doch sagt;</l>
               <l>That hin mit augen schlagen.</l>
            </lg>
         </lg>
      </body>
   </text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->

Annotations exceeding into prolog or 'epilog'

There are annotations, that would not result in a well-formed document when internalized. Annotating the XML prolog would result in a forrest, not in a tree. Same with the 'epilog', the white space text nodes and PIs after the root elements closing tag. Annotations exceeding into the prolog or the 'epilog' result in exit code.

Trawr-Gesang.epilog.csv

start,end
0x255,0x263
0x282,0x287
0x387,0x40c
standoff -i Trawr-Gesang.xml internalize -c seg --csv-start-end Trawr-Gesang.epilog.csv

Resulting error message should be similar to:

standoff: user error (Annotation extends into restricted span: GenericMarkup {genmrkp_start = 903, genmrkp_end = 1036, genmrkp_features = fromList [("end","0x40c"),("start","0x387")], genmrkp_splitNum = Nothing})
echo $?
1

Getting annotation features into XML

Wrapping "Christi" into <seg> is not really what we want. We rather want <persName> and we also want to provide a persistent identifier, to this "Christ" from other ones.

To get there, first we change -c seg to -c persName. And then we provide an existent identifier for the persons and also a config file, which maps annotation features to XML attributes.

Trawr-gesant.pid.csv

start,end,wikidata
0x272,0x278,Q302

The config file is in yaml format, see features-map.jaml. We map the wikidata column of the CSV to an attribute with the name ref and we also prefix the value from the column with an URI part. Prefixing the value is optional but may be useful in many cases.

wikidata:
  name: "ref"
  valuePrefix: "https://m.wikidata.org/wiki/"

Let's internalize:

standoff -i Trawr-Gesang.xml internalize -c persName --csv-start-end Trawr-Gesang.pid.csv -a features-map.jaml

The result is:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
            <author>Friedrich Spee</author>
         </titleStmt>
         <publicationStmt>
            <p>1649</p>
         </publicationStmt>
         <sourceDesc>
            <p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <body>
         <lg>
            <head>Trawr-Gesang von der noth <persName ref="https://m.wikidata.org/wiki/Q302">Christi</persName> am &#xd6;lberg in dem Garten</head>
            <lg>
               <l>Bey stiller nacht<caesura/> zur ersten wacht</l>
               <l>Ein <hi>stimm sich gund</hi> zu klagen.</l>
               <l>Jch nam in acht <caesura/> waß die doch sagt;</l>
               <l>That hin mit augen schlagen.</l>
            </lg>
         </lg>
      </body>
   </text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->

Splitting and Re-Aggregation

The internalize command splits annotations if they overlap each other or the internal markup in the XML source file.

We annotate the Words "Ein stimm" as np. This annotation overlaps the internal <hi>.

Trawr-Gesang.overlap.csv:

start,end,phrase,id
0x304,0x310,np,a1

We also add the following to the feature mapping:

phrase:
  name: "type"

The document resulting from the following command:

standoff -i Trawr-Gesang.xml internalize -c seg --csv-start-end Trawr-Gesang.overlap.csv -a feature-map.xml
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
            <author>Friedrich Spee</author>
         </titleStmt>
         <publicationStmt>
            <p>1649</p>
         </publicationStmt>
         <sourceDesc>
            <p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <body>
         <lg>
            <head>Trawr-Gesang von der noth Christi am &#xd6;lberg in dem Garten</head>
            <lg>
               <l>Bey stiller nacht<caesura/> zur ersten wacht</l>
               <l><seg type="np">Ein </seg><hi><seg type="np">stimm</seg> sich gund</hi> zu klagen.</l>
               <l>Jch nam in acht <caesura/> waß die doch sagt;</l>
               <l>That hin mit augen schlagen.</l>
            </lg>
         </lg>
      </body>
   </text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->

Now, this is nicely wellformed, since the internalizer splits the annotation into to segments.

However, the information that "Ein stimm" is a single nominal phrase is lost. That's were special annotation features generated by standoff come into the play. They enable us to re-aggregate segments that originate from the same annotation by the means of TEI's aggregation mechanism.

The CSV record contains an id column. We use this column as a base identifier for the annotation and append standoffs special features to it. Here's what we append to the feature map:

__standoff_special__splitId:
  prefix: xml
  name: xml:id

__standoff_special__prevId:
  name: prev
  valuePrefix: "#"

That is: The generated feature __standoff_special__splitId is serialized as the @xml:id and the generated feature __standoff_special__prevId is serialized as @prev and prefixed with with # to make it a same-doc reference.--Are you missing information here? Yes! Seems like using the id column for generating the special features is hard-wired into standoff. We definitively have to make this configurable (TODO).

Here's the result of

standoff -i Trawr-Gesang.xml internalize -c seg -a feature-map.yaml --csv-start-end Trawr-Gesang.overlap.csv
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
            <author>Friedrich Spee</author>
         </titleStmt>
         <publicationStmt>
            <p>1649</p>
         </publicationStmt>
         <sourceDesc>
            <p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <body>
         <lg>
            <head>Trawr-Gesang von der noth Christi am &#xd6;lberg in dem Garten</head>
            <lg>
               <l>Bey stillert nacht<caesura/> zur ersten wacht</l>
               <l><seg xml:id="a1" type="np">Ein </seg><hi><seg prev="#a1" xml:id="a1-1" type="np">stimm</seg> sich gund</hi> zu klagen.</l>
               <l>Jch nam in acht <caesura/> waß die doch sagt;</l>
               <l>That hin mit augen schlagen.</l>
            </lg>
         </lg>
      </body>
   </text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->

Note, that there's @prev only, no @next. standoff produces a single-linked aggregation, no double linked-aggregation. I (Christian) think that a double-linked list only introduces redundancy; and generating the information required for this would need another iteration over the annotations in standoff's algorithm and thus would make it a bit slower. However, adding it to the codebase would be quite simple. If you definitively want double-linked aggregation, don't hesitate to write a ticket.

Wellformed, but invalid

Currently, StandOff Tools are not schema-aware. They produce invalid markup in certain situations. A common situation is an annotation spanning white space text nodes between pLike elements.

E.g. we want to annotate the words "zur ersten wacht Ein stimm" because the jambus is continued. (We wouldn't annotate metre and cadences in real world this way, but a some-how motivated example is needed here.)

Trawr-Gesang.invalid.csv

start,end,metre
0x2dd,0x310,jc
standoff -i Trawr-Gesang.xml internalize -c seg --csv-start-end Trawr-Gesang.invalid.csv
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
            <author>Friedrich Spee</author>
         </titleStmt>
         <publicationStmt>
            <p>1649</p>
         </publicationStmt>
         <sourceDesc>
            <p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <body>
         <lg>
            <head>Trawr-Gesang von der noth Christi am &#xd6;lberg in dem Garten</head>
            <lg>
               <l>Bey stiller nacht<caesura/> <seg>zur ersten wacht</seg></l><seg>
               </seg><l><seg>Ein </seg><hi><seg>stimm</seg> sich gund</hi> zu klagen.</l>
               <l>Jch nam in acht <caesura/> waß die doch sagt;</l>
               <l>That hin mit augen schlagen.</l>
            </lg>
         </lg>
      </body>
   </text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->

equidist

The equidist command generates a certain flavour of plain text: equidistant text. Every character originating from text nodes has the same offset as in the XML source document.

standoff -i Trawr-Gesang.xml equidist

This produces

                                      
                                                       
              
                
                    
                   Trawr-Gesant von der noth Christi am Oelberg in dem Garten        
                    Friedrich Spee         
                     
                          
               1649    
                           
                     
               taken from                                                             
                      
                 
               
         
            
             
                  Trawr-Gesang von der noth Christi am       lberg in dem Garten       
                
                  Bey stiller nacht           zur ersten wacht    
                  Ein     stimm sich gund      zu klagen.    
                  Jch nam in acht            waß die doch sagt;    
                  That hin mit augen schlagen.    
                 
              
             
          
      
                   
               

This text can be sent to an NLP or NER tool etc. And the character offsets, this tool returns, directly references character offsets in the XML source document. Thus, they can be feed to the internalizer.

However, we clearly see the limits: What about the character reference? What about entity references? And the many spaces may be noise to the tagging tool.

Equidistant text is an introductory device. Shrinked text (see below) is the thing to go with in real world scenarios.

shrink

The shrink command produces shrinked text. It is similar to equidistant text, however

  • character references and entity references are evaluated

  • tags are shrinked to the empty string or a configurable string

  • subtrees can be muted (TODO)

  • an offset mapping is generated

We need a config file, which has namespace definitions at least. shrinked.yaml:

default-namespace: "http://www.tei-c.org/ns/1.0"
prefixes:
  tei: "http://www.tei-c.org/ns/1.0"

And we have to define a sink for the offset mapping. The -f FILE or its long form --offsets FILE is used to define it.

standoff -i Trawr-Gesang.xml shrink --config shrink.yaml --offsets /tmp/offsets.dat

This produces the following plain text:



   
      
         
            Trawr-Gesant von der noth Christi am Oelberg in dem Garten
            Friedrich Spee
         
         
            1649
         
         
            taken from 
         
      
   
   
      
         
            Trawr-Gesang von der noth Christi am Ölberg in dem Garten
            
               Bey stiller nacht zur ersten wacht
               Ein stimm sich gund zu klagen.
               Jch nam in acht  waß die doch sagt;
               That hin mit augen schlagen.
            
         
      
   


Replacement strings

We can use the config file to define replacement strings for the tags. Different strings for open, close and empty tags can be defined. For example, we define a string to replace the empty <caesura>, and an extra new line after each <l>:

tags:

  caesura:
    empty: " || "

  l:
    close: "\n"
standoff -i Trawr-Gesang.xml shrink --config shrink.yaml --offsets /tmp/offsets.dat

results in:



   
      
         
            Trawr-Gesant von der noth Christi am Oelberg in dem Garten
            Friedrich Spee
         
         
            1649
         
         
            taken from 
         
      
   
   
      
         
            Trawr-Gesang von der noth Christi am Ölberg in dem Garten
            
               Bey stiller nacht ||  zur ersten wacht

               Ein stimm sich gund zu klagen.

               Jch nam in acht  ||  waß die doch sagt;

               That hin mit augen schlagen.

            
         
      
   


Is there something we can do about the leading spaces in the lines? Is there something like normalize-space() here?

TODO