This repository contains code for a web crawler to download parallel texts in standard German and plain or easy-to-understand German. For each source, we aimed at downloading the complete relevant content of the websites. We removed parts such as navigation, advertisement, contact data, and other unnecessary stuff. For some web pages, the documents for which no parallel document could be found are also downloaded.
Furthermore, the web crawler recognizes paragraph endings and save the text also including a paragraph marker "SEPL|||SEPR" (see text_par directories).
The resulting parallel documents can be used for document-level simplification.
The documents can be also split into sentences (see sentence_split.py) and label their complexity level, which facilitate a usage for text leveling, too.
The current list of supported web pages can be found in the table below.
The data folder contains a preview of the full corpus. Please check the copyright of each website yourself, to make sure you are allowed to use it (the copyright for academic or industry purpose might be different).
The parallel documents can be also uploaded in the text simplification annotation tool TS-ANNO (Stodden, Kallmeyer (2022)) for further annotation, e.g., sentence-wise alignment. The output format of this code is identical to the input format of TS-ANNO.
The web crawler was already used to download the data for the DEplain-web corpus. See Stodden et. al., 2023 for more information regarding the corpus and the web harvester.
: This table summarizes the web pages (including metadata) which can be extracted with the web crawler. The data provider of the documents marked with ‡ explicitely state that their documents are professionally simplified and reviewed by the target group.
- Please install Python 3.
- Get your own copy of the code with git clone.
- Install required packages (see requirements.txt)
python get_urls_list.py
: get all urls of parallel documents and save the html contentpython extract_text_data.py
: save plain text (and plain text with paragraph border "SEPL|||SEPR") of parallel documents
python -m spacy download de_dep_news_trf
download the NLP pipeline of spacypython sentence_split.py
: split the sentences of the parallel documents (and simple only) into sentences and label them with their complexity level
Each plain text file, follows the same data format. Parallel simple and complex files are named with the same identifier (e.g., simple_111.txt and complex_111.txt). An overview of all parallel files and all meta data is provided in [url_overview.tsv] and [url_overview_text.tsv]. The first line of each parallel file contains meta data and the second line contains the plain text (without linebreaks). The format of the meta data of the first line looks like this: ``# © Origin: source_of_data [last accessed: YYYY-MM-DD]\ttitle_of_document`
The documents are aligned with three strategies in the following order:
- automatic alignment by the reference to the simple document within the complex documents,
- automatically matching the titles of the documents on the website, and
- aligning the documents manually (see links/).
All the books in the fiction domain were manually aligned on the document level as the complex data is provided on another web page (i.e. Projekt Gutenberg) than the simplified data (i.e., Spaß am Lesen Verlag, Passanten Verlag, or NDR).
Web content can change very frquently, so maybe the web crawler does not suport all web pages named above anymore. Some web pages might have meanwhile changed their URLs or the HTML structure. In main function of get_urls_list.py you can disable web pages for which the crawler currently does not work (or web pages you are not interested in). We plan to overcome this issue in future by providing links to archived versions of the web pages.
Feel free to add more webpages or add code to crawl the webpages.
This code is licensed under GPL-3.0 license.
If you use part of this work, please cite our paper:
@inproceedings{stodden-etal-2023-deplain,
title = "{DE}-plain: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification",
author = "Stodden, Regina and
Momen, Omar and
Kallmeyer, Laura",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
notes = "preprint: https://arxiv.org/abs/2305.18939",
}