-
Notifications
You must be signed in to change notification settings - Fork 0
Home
-
Install Stack, Haskell's build tool.
-
Clone StandOff Tools.
-
In a terminal,
cd
into thestandoff-tools
folder. -
stack build
-- this may need 15-30 minutes and downloads several hundreds of MBs, the largest part of it is the Haskell compiler. -
stack test
to run the unit tests. -
optionally:
stack install
will install thestandoff
command on your system. You can skip this and use the sandbox build only.
If you skipped the last step, please prefix all standoff
commands in
this guide with stack exec --
. You also have to stay in the
standoff-tools
folder or its subfolders, so that stack can find the
sandboxed compilation.
The CLI of StandOff Tools is the standoff
command line
program. It's a program with subcommands, like git. The basic
structure of the CLI is:
standoff GLOBAL-PARAMETERS COMMAND LOCAL-PARAMETERS
The internal order of the global parameters and the local parameters does not matter, but the global parameters must precede the command.
There is help via -h
or --help
on each level:
standoff -h
will give you help on global parameters and will print a list of commands.
standoff COMMAND -h
will give you help on the command and its local parameters.
-h
and --help
can be mixed up with other parameters. So, when you
started entering a command and forgot a certain option, just append
-h
to get help.
Use the global parameter -i INPUT_FILE
or its long form --input INPUT_FILE
to pass an input file named INPUT_FILE
to standoff.
standoff -i INPUT_FILE command [LOCAL-OPTIONS]
See more in file IO.
Let's explore the function of StandOff Tools' commands by using them with simple hand-crafted annotations. Therefore, we're going to write CSV files. This way, we quickly learn about the basic structure of annotations required for the tools.
We also work on simple TEI files given in doc/getting-started/.
In the further chapters, we assume, that we've cd
ed into this folder.
Let's start with the internalizer and hand-crafted annotations.
For hand-crafted annotations, it's nice to have simple access to each character's offset. Most editors provide simple functions for getting the character offset (aka character index). But these are interactive and thus not so suitable for this introduction.
Since our XML source document in doc/getting-started/Trawr-Gesang.xml contains only ASCII characters and is encoded in UTF-8, we can use a hex editor or a hex dump in canonical form for accessing character offsets. But please note: Generally the hex dump gives access to byte offsets as opposed to character offsets. However, since there are no multi-byte characters in the XML source doc here, there's no difference.
The output of hexdump -C Trawr-Gesang.xml
is:
00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version="1|
00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0" encoding="UT|
00000020 46 2d 38 22 3f 3e 0a 3c 54 45 49 20 78 6d 6c 6e |F-8"?>.<TEI xmln|
00000030 73 3d 22 68 74 74 70 3a 2f 2f 77 77 77 2e 74 65 |s="http://www.te|
00000040 69 2d 63 2e 6f 72 67 2f 6e 73 2f 31 2e 30 22 20 |i-c.org/ns/1.0" |
00000050 78 6d 6c 3a 6c 61 6e 67 3d 22 64 65 22 3e 0a 20 |xml:lang="de">. |
00000060 20 20 3c 74 65 69 48 65 61 64 65 72 3e 0a 20 20 | <teiHeader>. |
00000070 20 20 20 20 3c 66 69 6c 65 44 65 73 63 3e 0a 20 | <fileDesc>. |
00000080 20 20 20 20 20 20 20 20 3c 74 69 74 6c 65 53 74 | <titleSt|
00000090 6d 74 3e 0a 20 20 20 20 20 20 20 20 20 20 20 20 |mt>. |
000000a0 3c 74 69 74 6c 65 3e 54 72 61 77 72 2d 47 65 73 |<title>Trawr-Ges|
000000b0 61 6e 74 20 76 6f 6e 20 64 65 72 20 6e 6f 74 68 |ant von der noth|
000000c0 20 43 68 72 69 73 74 69 20 61 6d 20 4f 65 6c 62 | Christi am Oelb|
000000d0 65 72 67 20 69 6e 20 64 65 6d 20 47 61 72 74 65 |erg in dem Garte|
000000e0 6e 3c 2f 74 69 74 6c 65 3e 0a 20 20 20 20 20 20 |n</title>. |
000000f0 20 20 20 20 20 20 3c 61 75 74 68 6f 72 3e 46 72 | <author>Fr|
00000100 69 65 64 72 69 63 68 20 53 70 65 65 3c 2f 61 75 |iedrich Spee</au|
00000110 74 68 6f 72 3e 0a 20 20 20 20 20 20 20 20 20 3c |thor>. <|
00000120 2f 74 69 74 6c 65 53 74 6d 74 3e 0a 20 20 20 20 |/titleStmt>. |
00000130 20 20 20 20 20 3c 70 75 62 6c 69 63 61 74 69 6f | <publicatio|
00000140 6e 53 74 6d 74 3e 0a 20 20 20 20 20 20 20 20 20 |nStmt>. |
00000150 20 20 20 3c 70 3e 31 36 34 39 3c 2f 70 3e 0a 20 | <p>1649</p>. |
00000160 20 20 20 20 20 20 20 20 3c 2f 70 75 62 6c 69 63 | </public|
00000170 61 74 69 6f 6e 53 74 6d 74 3e 0a 20 20 20 20 20 |ationStmt>. |
00000180 20 20 20 20 3c 73 6f 75 72 63 65 44 65 73 63 3e | <sourceDesc>|
00000190 0a 20 20 20 20 20 20 20 20 20 20 20 20 3c 70 3e |. <p>|
000001a0 74 61 6b 65 6e 20 66 72 6f 6d 20 3c 70 74 72 20 |taken from <ptr |
000001b0 74 61 72 67 65 74 3d 22 68 74 74 70 73 3a 2f 2f |target="https://|
000001c0 64 65 2e 77 69 6b 69 70 65 64 69 61 2e 6f 72 67 |de.wikipedia.org|
000001d0 2f 77 69 6b 69 2f 5a 25 43 33 25 41 34 73 75 72 |/wiki/Z%C3%A4sur|
000001e0 22 2f 3e 3c 2f 70 3e 0a 20 20 20 20 20 20 20 20 |"/></p>. |
000001f0 20 3c 2f 73 6f 75 72 63 65 44 65 73 63 3e 0a 20 | </sourceDesc>. |
00000200 20 20 20 20 20 3c 2f 66 69 6c 65 44 65 73 63 3e | </fileDesc>|
00000210 0a 20 20 20 3c 2f 74 65 69 48 65 61 64 65 72 3e |. </teiHeader>|
00000220 0a 20 20 20 3c 74 65 78 74 3e 0a 20 20 20 20 20 |. <text>. |
00000230 20 3c 62 6f 64 79 3e 0a 20 20 20 20 20 20 20 20 | <body>. |
00000240 20 3c 6c 67 3e 0a 20 20 20 20 20 20 20 20 20 20 | <lg>. |
00000250 20 20 3c 68 65 61 64 3e 54 72 61 77 72 2d 47 65 | <head>Trawr-Ge|
00000260 73 61 6e 67 20 76 6f 6e 20 64 65 72 20 6e 6f 74 |sang von der not|
00000270 68 20 43 68 72 69 73 74 69 20 61 6d 20 26 23 78 |h Christi am &#x|
00000280 64 36 3b 6c 62 65 72 67 20 69 6e 20 64 65 6d 20 |d6;lberg in dem |
00000290 47 61 72 74 65 6e 3c 2f 68 65 61 64 3e 0a 20 20 |Garten</head>. |
000002a0 20 20 20 20 20 20 20 20 20 20 3c 6c 67 3e 0a 20 | <lg>. |
000002b0 20 20 20 20 20 20 20 20 20 20 20 20 20 20 3c 6c | <l|
000002c0 3e 42 65 79 20 73 74 69 6c 6c 65 72 20 6e 61 63 |>Bey stiller nac|
000002d0 68 74 3c 63 61 65 73 75 72 61 2f 3e 20 7a 75 72 |ht<caesura/> zur|
000002e0 20 65 72 73 74 65 6e 20 77 61 63 68 74 3c 2f 6c | ersten wacht</l|
000002f0 3e 0a 20 20 20 20 20 20 20 20 20 20 20 20 20 20 |>. |
00000300 20 3c 6c 3e 45 69 6e 20 3c 68 69 3e 73 74 69 6d | <l>Ein <hi>stim|
00000310 6d 20 73 69 63 68 20 67 75 6e 64 3c 2f 68 69 3e |m sich gund</hi>|
00000320 20 7a 75 20 6b 6c 61 67 65 6e 2e 3c 2f 6c 3e 0a | zu klagen.</l>.|
00000330 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 3c | <|
00000340 6c 3e 4a 63 68 20 6e 61 6d 20 69 6e 20 61 63 68 |l>Jch nam in ach|
00000350 74 20 3c 63 61 65 73 75 72 61 2f 3e 20 77 61 c3 |t <caesura/> wa.|
00000360 9f 20 64 69 65 20 64 6f 63 68 20 73 61 67 74 3b |. die doch sagt;|
00000370 3c 2f 6c 3e 0a 20 20 20 20 20 20 20 20 20 20 20 |</l>. |
00000380 20 20 20 20 3c 6c 3e 54 68 61 74 20 68 69 6e 20 | <l>That hin |
00000390 6d 69 74 20 61 75 67 65 6e 20 73 63 68 6c 61 67 |mit augen schlag|
000003a0 65 6e 2e 3c 2f 6c 3e 0a 20 20 20 20 20 20 20 20 |en.</l>. |
000003b0 20 20 20 20 3c 2f 6c 67 3e 0a 20 20 20 20 20 20 | </lg>. |
000003c0 20 20 20 3c 2f 6c 67 3e 0a 20 20 20 20 20 20 3c | </lg>. <|
000003d0 2f 62 6f 64 79 3e 0a 20 20 20 3c 2f 74 65 78 74 |/body>. </text|
000003e0 3e 0a 3c 2f 54 45 49 3e 0a 3c 3f 41 53 4d 20 4d |>.</TEI>.<?ASM M|
000003f0 56 20 44 52 37 20 41 52 30 20 3f 3e 0a 3c 21 2d |V DR7 AR0 ?>.<!-|
00000400 2d 20 65 70 69 6c 6f 67 20 2d 2d 3e 0a |- epilog -->.|
0000040d
The left-most column is the address of the beginning of a 16-bytes block, the right most column is a representation of these bytes as characters, while non-ASCII bytes and some characters (e.g. newlines) are represented with a dot. The other columns give hex representations of the 16 bytes. To get a character's offset, add it's column number 0,...,9,a,b.c,d,e,f to the block's address. E.g. the address of the first capital letter "T" (from UTF-8) at the end of the second 16 bytes block is 0x00000010 + 0xf = 0x1f.
You can use decimal and hexadecimal representations of character
offsets in standoff
.
If not otherwise defined by command line parameters, standoff
uses
zero-indexed character offsets. I.e. the first character's offset is
0.
In a hex dump, byte addresses are zero-indexed.
Let's craft our first annotation. At least we need a start and an end
point. The annotation should reference the word "Christi" starting
with an "C" at 0x272 and ending with an "i" at 0x278. We put this into
Trawr-Gesang.minimal.csv
:
start,end
0x272,0x278
Let's use the internalize
command to wrap the "Christi" into a <seg>
.
standoff -i Trawr-Gesang.xml internalize --const seg --csv-start-end Trawr-Gesang.minimal.csv
The global option -i
passes the XML source docment, and with
--csv-start-end
we declare, that the annotations are given in CSV
format with start and end character offsets. --const seg
(short -c seg
) defines, that we use a constant tag name for the internalized
annotation: seg
.
The result is:
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
<author>Friedrich Spee</author>
</titleStmt>
<publicationStmt>
<p>1649</p>
</publicationStmt>
<sourceDesc>
<p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<lg>
<head>Trawr-Gesang von der noth <seg>Christi</seg> am Ölberg in dem Garten</head>
<lg>
<l>Bey stiller nacht<caesura/> zur ersten wacht</l>
<l>Ein <hi>stimm sich gund</hi> zu klagen.</l>
<l>Jch nam in acht <caesura/> waß die doch sagt;</l>
<l>That hin mit augen schlagen.</l>
</lg>
</lg>
</body>
</text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->
Annotations may exceed into tags, character references or the like. You can not produce non-wellformed XML this way, since annotations are split and the tag-exceeding part is silently dropped.
`Trawr-Gesant.restricted.csv
start,end
0x255,0x263
0x282,0x287
Is internalized to:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
<author>Friedrich Spee</author>
</titleStmt>
<publicationStmt>
<p>1649</p>
</publicationStmt>
<sourceDesc>
<p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<lg>
<head><seg>Trawr-Gesang</seg> von der noth Christi am Ö<seg>lberg</seg> in dem Garten</head>
<lg>
<l>Bey stiller nacht<caesura/> zur ersten wacht</l>
<l>Ein <hi>stimm sich gund</hi> zu klagen.</l>
<l>Jch nam in acht <caesura/> waß die doch sagt;</l>
<l>That hin mit augen schlagen.</l>
</lg>
</lg>
</body>
</text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->
There are annotations, that would not result in a well-formed document when internalized. Annotating the XML prolog would result in a forrest, not in a tree. Same with the 'epilog', the white space text nodes and PIs after the root elements closing tag. Annotations exceeding into the prolog or the 'epilog' result in exit code.
Trawr-Gesang.epilog.csv
start,end
0x255,0x263
0x282,0x287
0x387,0x40c
standoff -i Trawr-Gesang.xml internalize -c seg --csv-start-end Trawr-Gesang.epilog.csv
Resulting error message should be similar to:
standoff: user error (Annotation extends into restricted span: GenericMarkup {genmrkp_start = 903, genmrkp_end = 1036, genmrkp_features = fromList [("end","0x40c"),("start","0x387")], genmrkp_splitNum = Nothing})
echo $?
1
Wrapping "Christi" into <seg>
is not really what we want. We rather
want <persName>
and we also want to provide a persistent identifier,
to this "Christ" from other ones.
To get there, first we change -c seg
to -c persName
. And then we
provide an existent identifier for the persons and also a config file,
which maps annotation features to XML attributes.
Trawr-gesant.pid.csv
start,end,wikidata
0x272,0x278,Q302
The config file is in yaml format, see features-map.jaml
. We map the
wikidata column of the CSV to an attribute with the name ref
and we
also prefix the value from the column with an URI part. Prefixing the
value is optional but may be useful in many cases.
wikidata:
name: "ref"
valuePrefix: "https://m.wikidata.org/wiki/"
Let's internalize:
standoff -i Trawr-Gesang.xml internalize -c persName --csv-start-end Trawr-Gesang.pid.csv -a features-map.jaml
The result is:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
<author>Friedrich Spee</author>
</titleStmt>
<publicationStmt>
<p>1649</p>
</publicationStmt>
<sourceDesc>
<p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<lg>
<head>Trawr-Gesang von der noth <persName ref="https://m.wikidata.org/wiki/Q302">Christi</persName> am Ölberg in dem Garten</head>
<lg>
<l>Bey stiller nacht<caesura/> zur ersten wacht</l>
<l>Ein <hi>stimm sich gund</hi> zu klagen.</l>
<l>Jch nam in acht <caesura/> waß die doch sagt;</l>
<l>That hin mit augen schlagen.</l>
</lg>
</lg>
</body>
</text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->
The internalize
command splits annotations if they overlap each
other or the internal markup in the XML source file.
We annotate the Words "Ein stimm" as np
. This annotation overlaps
the internal <hi>
.
Trawr-Gesang.overlap.csv
:
start,end,phrase,id
0x304,0x310,np,a1
We also add the following to the feature mapping:
phrase:
name: "type"
The document resulting from the following command:
standoff -i Trawr-Gesang.xml internalize -c seg --csv-start-end Trawr-Gesang.overlap.csv -a feature-map.xml
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
<author>Friedrich Spee</author>
</titleStmt>
<publicationStmt>
<p>1649</p>
</publicationStmt>
<sourceDesc>
<p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<lg>
<head>Trawr-Gesang von der noth Christi am Ölberg in dem Garten</head>
<lg>
<l>Bey stiller nacht<caesura/> zur ersten wacht</l>
<l><seg type="np">Ein </seg><hi><seg type="np">stimm</seg> sich gund</hi> zu klagen.</l>
<l>Jch nam in acht <caesura/> waß die doch sagt;</l>
<l>That hin mit augen schlagen.</l>
</lg>
</lg>
</body>
</text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->
Now, this is nicely wellformed, since the internalizer splits the annotation into to segments.
However, the information that "Ein stimm" is a single nominal phrase
is lost. That's were special annotation features generated by
standoff
come into the play. They enable us to re-aggregate segments
that originate from the same annotation by the means of TEI's
aggregation
mechanism.
The CSV record contains an id
column. We use this column as a base
identifier for the annotation and append standoff
s special features
to it. Here's what we append to the feature map:
__standoff_special__splitId:
prefix: xml
name: xml:id
__standoff_special__prevId:
name: prev
valuePrefix: "#"
That is: The generated feature __standoff_special__splitId
is
serialized as the @xml:id
and the generated feature
__standoff_special__prevId
is serialized as @prev
and prefixed
with with #
to make it a same-doc reference.--Are you missing
information here? Yes! Seems like using the id
column for generating
the special features is hard-wired into standoff
. We definitively
have to make this configurable (TODO).
Here's the result of
standoff -i Trawr-Gesang.xml internalize -c seg -a feature-map.yaml --csv-start-end Trawr-Gesang.overlap.csv
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
<author>Friedrich Spee</author>
</titleStmt>
<publicationStmt>
<p>1649</p>
</publicationStmt>
<sourceDesc>
<p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<lg>
<head>Trawr-Gesang von der noth Christi am Ölberg in dem Garten</head>
<lg>
<l>Bey stillert nacht<caesura/> zur ersten wacht</l>
<l><seg xml:id="a1" type="np">Ein </seg><hi><seg prev="#a1" xml:id="a1-1" type="np">stimm</seg> sich gund</hi> zu klagen.</l>
<l>Jch nam in acht <caesura/> waß die doch sagt;</l>
<l>That hin mit augen schlagen.</l>
</lg>
</lg>
</body>
</text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->
Note, that there's @prev
only, no @next
. standoff
produces a
single-linked aggregation, no double linked-aggregation. I (Christian)
think that a double-linked list only introduces redundancy; and
generating the information required for this would need another
iteration over the annotations in standoff
's algorithm and thus
would make it a bit slower. However, adding it to the codebase would
be quite simple. If you definitively want double-linked aggregation,
don't hesitate to write a ticket.
Currently, StandOff Tools are not schema-aware. They produce invalid
markup in certain situations. A common situation is an annotation
spanning white space text nodes between pLike
elements.
E.g. we want to annotate the words "zur ersten wacht Ein stimm" because the jambus is continued. (We wouldn't annotate metre and cadences in real world this way, but a some-how motivated example is needed here.)
Trawr-Gesang.invalid.csv
start,end,metre
0x2dd,0x310,jc
standoff -i Trawr-Gesang.xml internalize -c seg --csv-start-end Trawr-Gesang.invalid.csv
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="de">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Trawr-Gesant von der noth Christi am Oelberg in dem Garten</title>
<author>Friedrich Spee</author>
</titleStmt>
<publicationStmt>
<p>1649</p>
</publicationStmt>
<sourceDesc>
<p>taken from <ptr target="https://de.wikipedia.org/wiki/Z%C3%A4sur"/></p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<lg>
<head>Trawr-Gesang von der noth Christi am Ölberg in dem Garten</head>
<lg>
<l>Bey stiller nacht<caesura/> <seg>zur ersten wacht</seg></l><seg>
</seg><l><seg>Ein </seg><hi><seg>stimm</seg> sich gund</hi> zu klagen.</l>
<l>Jch nam in acht <caesura/> waß die doch sagt;</l>
<l>That hin mit augen schlagen.</l>
</lg>
</lg>
</body>
</text>
</TEI>
<?ASM MV DR7 AR0 ?>
<!-- epilog -->
The equidist
command generates a certain flavour of plain text:
equidistant text. Every character originating from text nodes has the
same offset as in the XML source document.
standoff -i Trawr-Gesang.xml equidist
This produces
Trawr-Gesant von der noth Christi am Oelberg in dem Garten
Friedrich Spee
1649
taken from
Trawr-Gesang von der noth Christi am lberg in dem Garten
Bey stiller nacht zur ersten wacht
Ein stimm sich gund zu klagen.
Jch nam in acht waß die doch sagt;
That hin mit augen schlagen.
This text can be sent to an NLP or NER tool etc. And the character offsets, this tool returns, directly references character offsets in the XML source document. Thus, they can be feed to the internalizer.
However, we clearly see the limits: What about the character reference? What about entity references? And the many spaces may be noise to the tagging tool.
Equidistant text is an introductory device. Shrinked text (see below) is the thing to go with in real world scenarios.
The shrink
command produces shrinked text. It is similar to
equidistant text, however
-
character references and entity references are evaluated
-
tags are shrinked to the empty string or a configurable string
-
subtrees can be muted (TODO)
-
an offset mapping is generated
We need a config file, which has namespace definitions at
least. shrinked.yaml
:
default-namespace: "http://www.tei-c.org/ns/1.0"
prefixes:
tei: "http://www.tei-c.org/ns/1.0"
And we have to define a sink for the offset mapping. The -f FILE
or
its long form --offsets FILE
is used to define it.
standoff -i Trawr-Gesang.xml shrink --config shrink.yaml --offsets /tmp/offsets.dat
This produces the following plain text:
Trawr-Gesant von der noth Christi am Oelberg in dem Garten
Friedrich Spee
1649
taken from
Trawr-Gesang von der noth Christi am Ölberg in dem Garten
Bey stiller nacht zur ersten wacht
Ein stimm sich gund zu klagen.
Jch nam in acht waß die doch sagt;
That hin mit augen schlagen.
We can use the config file to define replacement strings for the
tags. Different strings for open, close and empty tags can be
defined. For example, we define a string to replace the empty
<caesura>
, and an extra new line after each <l>
:
tags:
caesura:
empty: " || "
l:
close: "\n"
standoff -i Trawr-Gesang.xml shrink --config shrink.yaml --offsets /tmp/offsets.dat
results in:
Trawr-Gesant von der noth Christi am Oelberg in dem Garten
Friedrich Spee
1649
taken from
Trawr-Gesang von der noth Christi am Ölberg in dem Garten
Bey stiller nacht || zur ersten wacht
Ein stimm sich gund zu klagen.
Jch nam in acht || waß die doch sagt;
That hin mit augen schlagen.
Is there something we can do about the leading spaces in the lines? Is
there something like normalize-space()
here?
TODO