Skip to content

mzhadigerov/checkify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Installation

  1. Go to your project directory (where this repository is cloned)
  2. Go to checkify package (where ocr and data packages located).
  3. Download roberta-base.zip from here and extract it to models package.
  4. Download Tesseract from [here](Download Tesseract from https://github.com/UB-Mannheim/tesseract/wiki)
  5. Substitute tess.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe" in .\checkify\ocr\ocr.py with path\to\your\tesseract.exe.
  6. Download all the neccesary libraries from toml file.
  7. Run preparation.sh (make sure that you have nltk downloaded).

Test

python .\main.py check-contract --path=test_file.pdf

Description

Diagram

This program adds OCR layer upon robera-base model by TheAtticusProject. The model was fine-tuned using contract documents, manually annotated by Law students. Detailed description of CUAD dataset and annotation process can be found here.

For further fine-tuning, new data can be annotated using SQuAD format. Code for training can be found in the original repository.

eBrevia can be used for data annotation as stated here, under Annotations section.

Code for prediction was taken from this repository.

About

Contract document checker

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published