Skip to content

Commit

Permalink
Add some Readme's and update main one
Browse files Browse the repository at this point in the history
  • Loading branch information
jchazalon committed May 17, 2022
1 parent 86b5e34 commit e04359c
Show file tree
Hide file tree
Showing 5 changed files with 86 additions and 15 deletions.
28 changes: 13 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,14 @@
# paper-ner-bench-das22
Sources (latex and code?) for our DAS 2022 paper (NER benchmark)

## Writing the paper
Official instructions for authors: <https://das2022.univ-lr.fr/index.php/author-instructions/>
## Easy-to-download items

- Dataset: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6394464.svg)](https://doi.org/10.5281/zenodo.6394464)
- Paper's PDF: HAL | arXiv | GitHub Release
- Supplementary material: GitHub Release


Key info:
- Springer LNCS format
- template documentation available as [llncsdoc.pdf](llncsdoc.pdf)
- Up to 15 pages
- Deadline: 4 Jan. 2022 (expect 1 week extension)

Structure:
- latex sources are in `src-latex`
- main latex file for the paper is `src-latex/main-paper.tex`
- sub-parts are under `src-latex/parts/` to limit edition conflicts ← edit `.tex` files here

## Code
Code is in `src/` for the most part. I guess some of the code is still on python notebooks ?
Expand All @@ -33,10 +28,13 @@ The SpaCy best model is not shared. It's 630+Mb large so storing it in this repo
Where do we share it (if so) ?


## Official assets
Maybe the current repo should remain public, and we copy/paste the relevant content on a public repo upon publication.
Or we just set it public after.
No strong opinion here yet.
## Latex sources

Structure:
- latex sources are in `src-latex`
- main latex file for the paper is `src-latex/main-paper.tex`
- sub-parts are under `src-latex/parts/`
- some supplementary material is available in `src-latex/main-supplementary-material.tex`

## Interesting related work
- http://spacetime.nypl.org/city-directory-meetup
60 changes: 60 additions & 0 deletions dataset/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Internal dataset we used for this paper

**Please do not use this dataset except to check our results: use the official, cleaned and well-documented dataset shared on Zenodo instead:**
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6394464.svg)](https://doi.org/10.5281/zenodo.6394464)


## Copyright and License
The images were extracted from the original source https://gallica.bnf.fr, owned by the *Bibliothèque nationale de France* (French national library).
Original contents from the *Bibliothèque nationale de France* can be reused non-commercially, provided the mention "Source gallica.bnf.fr / Bibliothèque nationale de France" is kept.
**Researchers do not have to pay any fee for reusing the original contents in research publications or academic works. **
*Original copyright mentions extracted from https://gallica.bnf.fr/edit/und/conditions-dutilisation-des-contenus-de-gallica on March 29, 2022.*

The original contents were significantly transformed before being included in this dataset.
All **derived content** is licensed under the permissive **Creative Commons Attribution 4.0 International** license.


## Content of this folder
For completeness, here is a content summary of the current folder:

- `supervised/00-dataset_article_das22.pdf/`:
Raw JSON objects exported from our storage system after human annotation.
There are some buggy regions which were later filtered.
- `supervised/00-deskewed-page-images/`:
Images exactly as they were presented to used when annotated.
There color and skew differs from original content.
- `supervised/00-ocr-pero-raw-json/`:
Raw OCR output for each entry from PERO OCR engine.
- `supervised/00-ocr-tess-raw-json/`:
*Same for Tesseract v4 engine.*
- `supervised/01-ocr-kraken-raw-json/`:
*Same for Kraken engine.*
- `supervised/10-ref-ocr-ner-json/`:
Cleaned-up human annotations: for each entry we provide the bounding box, whether the box is valid, the directory, the page, human-labeled text, human-label entities.
- `supervised/21-ocr-pero-final/`:
Normalized OCR output for PERO OCR engine.
- `supervised/22-ocr-tess-final/`:
*Same for Tesseract v4 engine .*
- `supervised/23-ocr-krak-final/`:
*Same for Kraken engine.*
- `supervised/31-ner_align_pero/`:
Reference entities from human labels projected onto PERO OCR predictions.
- `supervised/32-ner_align_tess/`:
*Same for Tesseract v4 engine .*
- `supervised/33-ner_align_krak/`:
*Same for Kraken engine.*
- `supervised/40-ner_aligned_valid_subset/`:
Selection of NER targets valid for reference, PERO OCR and Tesseract v4.
- `supervised/41-ner_aligned_valid_subset_with_kraken/`:
Selection of NER targets valid for reference, PERO OCR, Tesseract v4 **and Kraken**.
- `supervised/80-ocr-text-files/`:
OCR predictions as a separate file for each entry, to test our Python wrapper for UNLV-ISRI OCR evaluation tools.
- `supervised/81-eval-ocr-files/`:
Evaluation outputs from UNLV-ISRI OCR evaluation tools for PERO and Tesseract systems.
- `supervised/annotation_table.csv`:
Our internal annotation tracking table, which contains extra comments about each page (even those not included in the final dataset).
- `unsupervised_pretraining/00-raw_json.tar.gz`:
Raw entries detected from our platform, with PERO OCR predictions, without human correction, for approx. 7000 pages.
- `unsupervised_pretraining/10-normalized/`:
Normalized and cleaned file from `unsupervised_pretraining/00-raw_json.tar.gz`

Binary file removed llncsdoc.pdf
Binary file not shown.
6 changes: 6 additions & 0 deletions material/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Raw experimental material for NER experiments
This folder contains detailed results for each NER run.

It contains the code to generate the "performance vs training set size" figure from raw data.

It also contains per-entity results which are not shown in the paper, because we did not have enough space.
7 changes: 7 additions & 0 deletions preparation/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Content used for paper preparation

There is nothing very interesting here, mostly some tests and tools used for early work.

We keep it as a personal backup, to avoid fragmenting our files.

Please do not waste your time here.

0 comments on commit e04359c

Please sign in to comment.