Skip to content

Commit

Permalink
Add --auto-textsigle <start-sigle> option
Browse files Browse the repository at this point in the history
Also allows for processing plain TEI P5 files without any IDs.

Change-Id: Ic16b089c916d2e50458aa1aa6cb80ce4d37d97ba
  • Loading branch information
kupietz authored and Akron committed Nov 15, 2024
1 parent 6b1f26b commit fc3a0ee
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 3 deletions.
1 change: 1 addition & 0 deletions Changes
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
2.6.0 2024-09-19
- Add -o parameter.
- Add support for inline dependency relations.
- Add support for --auto-textsigle.

2.5.0 2024-01-24
- Upgrade minimal Perl version to 5.36 to improve
Expand Down
11 changes: 11 additions & 0 deletions Readme.pod
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,17 @@ C<--no-skip-inline-token-annotations>.
Expects a comma-separated list of tags to be ignored when the structure
is parsed. Content of these tags however will be processed.

=item B<--auto-textsigle> <textsigle>

Expects a text sigle thats serves as fallback if no text sigles
are given in the input data.
The auto text sigle will be incremented for each text processed.

Example:

tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
< data.i5.xml > korapxml.zip

=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>

Expects a regular replacement expression (separated by B<@> between the
Expand Down
14 changes: 13 additions & 1 deletion lib/KorAP/XML/TEI.pm
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ use strict;
use warnings;

use Exporter 'import';
our @EXPORT_OK = qw(remove_xml_comments escape_xml escape_xml_minimal replace_entities);
our @EXPORT_OK = qw(remove_xml_comments escape_xml escape_xml_minimal replace_entities increase_auto_textsigle);

# convert '&', '<' and '>' into their corresponding sgml-entities
my %ent_without_quot = (
Expand Down Expand Up @@ -180,4 +180,16 @@ sub replace_entities {
return($_);
};

sub increase_auto_textsigle {
my $sigle = shift;

if ($sigle =~ /(\d+)$/) {
my $number = $1;
my $length = length($number);
$number++;
my $new_number = sprintf("%0${length}d", $number);
$sigle =~ s/\d+$/$new_number/;
}
return $sigle;
}
1;
20 changes: 18 additions & 2 deletions script/tei2korapxml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ use Log::Any '$log';
use Log::Any::Adapter;
use Pod::Usage;
use Getopt::Long qw(GetOptions :config no_auto_abbrev);
use KorAP::XML::TEI qw(increase_auto_textsigle);

use File::Basename qw(dirname);

Expand Down Expand Up @@ -45,6 +46,7 @@ my $inline_deps_exclusive = 0;

# Parse options from the command line
GetOptions(
'auto-textsigle|A=s' => \(my $auto_textsigle = ''),
'root|r=s' => \(my $root_dir = '.'),
'input|i=s' => \(my $input_fname = ''),
'output|o=s' => \(my $output_fname = ''),
Expand Down Expand Up @@ -460,8 +462,11 @@ MAIN: while (<$input_fh>) {
};

# Parse header
my $header = KorAP::XML::TEI::Header->new($content, $input_enc, $text_id_esc)->parse($input_fh);

my $header = KorAP::XML::TEI::Header->new($content, $input_enc, $text_id_esc // $auto_textsigle)->parse($input_fh);
if ($auto_textsigle) {
$auto_textsigle = increase_auto_textsigle($auto_textsigle);
$log->debug("Auto-incremented text sigle to $auto_textsigle");
};
# Header was parseable
if ($header) {

Expand Down Expand Up @@ -666,6 +671,17 @@ C<--no-skip-inline-token-annotations>.
Expects a comma-separated list of tags to be ignored when the structure
is parsed. Content of these tags however will be processed.
=item B<--auto-textsigle> <textsigle>
Expects a text sigle thats serves as fallback if no text sigles
are given in the input data.
The auto text sigle will be incremented for each text processed.
Example:
tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
< data.i5.xml > korapxml.zip
=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
Expects a regular replacement expression (separated by B<@> between the
Expand Down

0 comments on commit fc3a0ee

Please sign in to comment.