Add some Readme's and update main one

soduco · May 17, 2022 · e04359c · e04359c
1 parent 86b5e34
commit e04359c
Show file tree

Hide file tree

Showing 5 changed files with 86 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -1,19 +1,14 @@
 # paper-ner-bench-das22
 Sources (latex and code?) for our DAS 2022 paper (NER benchmark)
 
-## Writing the paper
-Official instructions for authors: <https://das2022.univ-lr.fr/index.php/author-instructions/>
+## Easy-to-download items
+
+- Dataset: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6394464.svg)](https://doi.org/10.5281/zenodo.6394464)
+- Paper's PDF: HAL | arXiv | GitHub Release
+- Supplementary material: GitHub Release
+
 
-Key info:
-- Springer LNCS format
-  - template documentation available as [llncsdoc.pdf](llncsdoc.pdf)
-- Up to 15 pages
-- Deadline: 4 Jan. 2022 (expect 1 week extension)
 
-Structure:
-- latex sources are in `src-latex`
-- main latex file for the paper is `src-latex/main-paper.tex`
-- sub-parts are under `src-latex/parts/` to limit edition conflicts ← edit `.tex` files here
 
 ## Code
 Code is in `src/` for the most part. I guess some of the code is still on python notebooks ?
@@ -33,10 +28,13 @@ The SpaCy best model is not shared. It's 630+Mb large so storing it in this repo
 Where do we share it (if so) ?
 
 
-## Official assets
-Maybe the current repo should remain public, and we copy/paste the relevant content on a public repo upon publication.
-Or we just set it public after.
-No strong opinion here yet.
+## Latex sources
+
+Structure:
+- latex sources are in `src-latex`
+- main latex file for the paper is `src-latex/main-paper.tex`
+- sub-parts are under `src-latex/parts/`
+- some supplementary material is available in `src-latex/main-supplementary-material.tex`
 
 ## Interesting related work
 - http://spacetime.nypl.org/city-directory-meetup
diff --git a/dataset/Readme.md b/dataset/Readme.md
@@ -0,0 +1,60 @@
+# Internal dataset we used for this paper
+
+**Please do not use this dataset except to check our results: use the official, cleaned and well-documented dataset shared on Zenodo instead:**  
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6394464.svg)](https://doi.org/10.5281/zenodo.6394464)
+
+
+## Copyright and License
+The images were extracted from the original source https://gallica.bnf.fr, owned by the *Bibliothèque nationale de France* (French national library).
+Original contents from the *Bibliothèque nationale de France* can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.  
+**Researchers do not have to pay any fee for reusing the original contents in research publications or academic works. ** 
+*Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.*
+
+The original contents were significantly transformed before being included in this dataset.
+All **derived content** is licensed under the permissive **Creative Commons Attribution 4.0 International** license.
+
+
+## Content of this folder
+For completeness, here is a content summary of the current folder:
+
+- `supervised/00-dataset_article_das22.pdf/`:
+  Raw JSON objects exported from our storage system after human annotation.
+  There are some buggy regions which were later filtered.
+- `supervised/00-deskewed-page-images/`:
+  Images exactly as they were presented to used when annotated.
+  There color and skew differs from original content.
+- `supervised/00-ocr-pero-raw-json/`:
+  Raw OCR output for each entry from PERO OCR engine.
+- `supervised/00-ocr-tess-raw-json/`:
+  *Same for Tesseract v4 engine.*
+- `supervised/01-ocr-kraken-raw-json/`:
+  *Same for Kraken engine.*
+- `supervised/10-ref-ocr-ner-json/`:
+  Cleaned-up human annotations: for each entry we provide the bounding box, whether the box is valid, the directory, the page, human-labeled text, human-label entities.
+- `supervised/21-ocr-pero-final/`:
+  Normalized OCR output for PERO OCR engine.
+- `supervised/22-ocr-tess-final/`:
+  *Same for Tesseract v4 engine .*
+- `supervised/23-ocr-krak-final/`:
+  *Same for Kraken engine.*
+- `supervised/31-ner_align_pero/`:
+  Reference entities from human labels projected onto PERO OCR predictions.
+- `supervised/32-ner_align_tess/`:
+  *Same for Tesseract v4 engine .*
+- `supervised/33-ner_align_krak/`:
+  *Same for Kraken engine.*
+- `supervised/40-ner_aligned_valid_subset/`:
+  Selection of NER targets valid for reference, PERO OCR and Tesseract v4.
+- `supervised/41-ner_aligned_valid_subset_with_kraken/`:
+  Selection of NER targets valid for reference, PERO OCR, Tesseract v4 **and Kraken**.
+- `supervised/80-ocr-text-files/`:
+  OCR predictions as a separate file for each entry, to test our Python wrapper for UNLV-ISRI OCR evaluation tools.
+- `supervised/81-eval-ocr-files/`:
+  Evaluation outputs from UNLV-ISRI OCR evaluation tools for PERO and Tesseract systems.
+- `supervised/annotation_table.csv`:
+  Our internal annotation tracking table, which contains extra comments about each page (even those not included in the final dataset).
+- `unsupervised_pretraining/00-raw_json.tar.gz`:
+  Raw entries detected from our platform, with PERO OCR predictions, without human correction, for approx. 7000 pages.
+- `unsupervised_pretraining/10-normalized/`:
+  Normalized and cleaned file from `unsupervised_pretraining/00-raw_json.tar.gz`
+
diff --git a/llncsdoc.pdf b/llncsdoc.pdf
diff --git a/material/Readme.md b/material/Readme.md
@@ -0,0 +1,6 @@
+# Raw experimental material for NER experiments
+This folder contains detailed results for each NER run.
+
+It contains the code to generate the "performance vs training set size" figure from raw data.
+
+It also contains per-entity results which are not shown in the paper, because we did not have enough space.
diff --git a/preparation/Readme.md b/preparation/Readme.md
@@ -0,0 +1,7 @@
+# Content used for paper preparation
+
+There is nothing very interesting here, mostly some tests and tools used for early work.
+
+We keep it as a personal backup, to avoid fragmenting our files.
+
+Please do not waste your time here.