- Go to your project directory (where this repository is cloned)
- Go to
checkify
package (whereocr
anddata
packages located). - Download
roberta-base.zip
from here and extract it tomodels
package. - Download
Tesseract
from [here](Download Tesseract from https://github.com/UB-Mannheim/tesseract/wiki) - Substitute
tess.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"
in.\checkify\ocr\ocr.py
withpath\to\your\tesseract.exe
. - Download all the neccesary libraries from
toml
file. - Run
preparation.sh
(make sure that you havenltk
downloaded).
python .\main.py check-contract --path=test_file.pdf
This program adds OCR layer upon robera-base model by TheAtticusProject. The model was fine-tuned using contract documents, manually annotated by Law students. Detailed description of CUAD dataset and annotation process can be found here.
For further fine-tuning, new data can be annotated using SQuAD format. Code for training can be found in the original repository.
eBrevia can be used for data annotation as stated here, under Annotations
section.
Code for prediction was taken from this repository.