Skip to content

This is the TEPROLIN Romanian text processing platform, developed in the ReTeRom project.

License

Notifications You must be signed in to change notification settings

racai-ai/TEPROLIN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Teprolin

Teprolin is a Python platform for text pre-processing that has been developed in the Teprolin project. It is described in the following paper (click to read it from the conference proceedings):

Ion, Radu. (2018). TEPROLIN: An Extensible, Online Text Preprocessing Platform for Romanian. In Proceedings of the International Conference on Linguistic Resources and Tools for Processing Romanian Language (ConsILR 2018), November 22-23, 2018, Iași, România.

Installation

Teprolin only works with Python 3 and it has been tested with versions 3.6, 3.7 and 3.8 on both Windows 10 and Linux Ubuntu 20.04. Teprolin includes the TTL text pre-processor which runs in Perl. In Windows, we used Strawberry Perl and in Ubuntu, the default perl installation.

TTL

To make sure TTL works, issue the following commands in a perl-enabled command prompt (perl has to be in PATH):

cpan install Unicode::String

cpan install Algorithm::Diff

cpan install BerkeleyDB

cpan install File::Which

cpan install File::HomeDir

Check that the script named TeproTTL.pl compiles OK by executing perl -c TeproTTL.pl.

NLP-Cube and UD-Pipe

NLP-Cube and UD-Pipe 1 have their own repositories at GitHub.

TTS Frontend

SSLA is a Text-To-Speech library developed by Tiberiu Boroș et al. Read about it on arXiv. The source code can be found on GitHub at SSLA. MLPLA is the text preprocessing front-end for SSLA and it is used in TEPROLIN for:

  • word hyphenation
  • word stress identification
  • phonetic transcription

Additionally, we ported some code from our ROBIN Dialog Manager project to do numeral rewriting, also for the benefit of TTS tools. In order to run MLPLA, you need Java Runtime Engine 15 installed and available in PATH.

If you want to build the MLPLAServer yourself, install the MLPLA text preprocessing library in your local Maven repository by running this command:

mvn install::install-file -Dfile=ttsops/MLPLAServer/lib/MLPLA.jar -DgroupId=ro.racai -DartifactId=mlpla -Dversion=1.0.0 -Dpackaging=jar -DgeneratePom=true

and, you need to run the following mvn command in order to generate the jar with all dependencies:

mvn clean compile test assembly:single antrun:run@copy-uber-jar

Teprolin resource files

The resource files are models, lexicons, mapping files, etc. that are loaded by all NLP apps of Teprolin. They sit in the .teprolin folder, under your home folder. In Windows 10 this is %USERPROFILE% and in Linux, ~. These files are now automatically installed by TEPROLIN.

Python 3 dependencies

To install all the related Python 3 packages in two commands, using a virtual environment, do this:

python3 -m venv /path/to/new/virtual/environment

then activate the new environment executing the source /path/to/new/virtual/environment/bin/activate. Finally, run

pip3 install -r requirements.txt

Testing

For a quick test session, using small texts (say up to 1000 chars), head to RELATE's test page. If you want to test different algorithms (e.g. UD-Pipe vs. NLP-Cube), you can access this link.

If you want to test the installation, issue pytest -v tests from the root of this repository. Please be patient, it will take a bit:

Running the REST web service

To quickly test the REST service, logging to console, run the following command from the root of this repository:

python3 TeproREST.py

to start the server in the foreground, with a single-process, in development mode.

Only on Linux: to start/stop the server in production mode using uwsgi for the RELATE platform, do this:

pip3 install uwsgi

start-ws.sh

stop-ws.sh

To start the server on three different ports for faster, multi-threaded processing, do this:

start-ws-mt.sh

stop-ws-mt.sh

Docker container

The easiest way to use the Teprolin text processing platform is to get the already-built Docker container:

docker pull raduion/teprolin:1.1

from Docker Hub.

If you want to build the image yourself, just issue:

docker build --pull --rm -f "Dockerfile" -t teprolin:1.1 "."

or use the Visual Studio Code Docker extension along with Docker Desktop.