Add streamlit app

VikParuchuri · Feb 10, 2024 · 3d0e487 · 3d0e487
1 parent 9d3e906
commit 3d0e487
Show file tree

Hide file tree

Showing 13 changed files with 544 additions and 77 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,7 @@ wandb
 notebooks
 results
 data
+slices
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/README.md b/README.md
@@ -32,7 +32,7 @@ Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who
 | Presentation     |   [Image](static/images/pres.png)   |     [Image](static/images/pres_text.png) |
 | Scientific Paper |  [Image](static/images/paper.png)   |    [Image](static/images/paper_text.png) |
 | Scanned Document | [Image](static/images/scanned.png)  |  [Image](static/images/scanned_text.png) |
-| Scanned Form     |  [Image](static/images/funsd.png)   |                                          |
+| Scanned Old Form |  [Image](static/images/funsd.png)   |    [Image](static/images/funsd_text.jpg) |
 
 # Installation
 
@@ -51,6 +51,15 @@ Model weights will automatically download the first time you run surya.  Note th
 - Inspect the settings in `surya/settings.py`.  You can override any settings with environment variables.
 - Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`. For text detection, the `mps` device has a bug (on the [Apple side](https://github.com/pytorch/pytorch/issues/84936)) that may prevent it from working properly.
 
+## Interactive App
+
+I've included a streamlit app that lets you interactively try Surya on images or PDF files.  Run it with:
+
+```
+pip install streamlit
+surya_gui
+```
+
 ## OCR (text recognition)
 
 You can detect text in an image, pdf, or folder of images/pdfs with the following command.  This will write out a json file with the detected text and bboxes, and optionally save images of the reconstructed page.
@@ -78,10 +87,7 @@ The `results.json` file will contain these keys for each page of the input docum
 
 **Performance tips**
 
-Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU.  Each batch item will use `40MB` of VRAM, so very high batch sizes are possible.  The default is a batch size `256`, which will use about 10GB of VRAM.
-
-Depending on your CPU core count, `RECOGNITION_BATCH_SIZE` might make a difference there too - the default CPU batch size is `32`.
-
+Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU.  Each batch item will use `40MB` of VRAM, so very high batch sizes are possible.  The default is a batch size `256`, which will use about 10GB of VRAM.  Depending on your CPU core count, it may help, too - the default CPU batch size is `32`.
 
 ### From python
 
@@ -94,20 +100,15 @@ from surya.model.recognition.processor import load_processor as load_rec_process
 
 image = Image.open(IMAGE_PATH)
 langs = ["en"] # Replace with your languages
-
-det_processor = load_det_processor()
-det_model = load_det_model()
-
-rec_model = load_rec_model()
-rec_processor = load_rec_processor()
+det_processor, det_model = load_det_processor(), load_det_model()
+rec_model, rec_processor = load_rec_model(), load_rec_processor()
 
 predictions = run_ocr([image], langs, det_model, det_processor, rec_model, rec_processor)
 ```
 
-
 ## Text line detection
 
-You can detect text lines in an image, pdf, or folder of images/pdfs with the following command.  This will write out a json file with the detected bboxes, and optionally save images of the pages with the bboxes.
+You can detect text lines in an image, pdf, or folder of images/pdfs with the following command.  This will write out a json file with the detected bboxes.
 
 ```
 surya_detect DATA_PATH --images
@@ -128,12 +129,7 @@ The `results.json` file will contain these keys for each page of the input docum
 
 **Performance tips**
 
-Setting the `DETECTOR_BATCH_SIZE` env var properly will make a big difference when using a GPU.  Each batch item will use `280MB` of VRAM, so very high batch sizes are possible.  The default is a batch size `32`, which will use about 9GB of VRAM.
-
-Depending on your CPU core count, `DETECTOR_BATCH_SIZE` might make a difference there too - the default CPU batch size is `2`.
-
-You can adjust `DETECTOR_NMS_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results.  Try lowering them to detect more text, and vice versa.
-
+Setting the `DETECTOR_BATCH_SIZE` env var properly will make a big difference when using a GPU.  Each batch item will use `280MB` of VRAM, so very high batch sizes are possible.  The default is a batch size `32`, which will use about 9GB of VRAM.  Depending on your CPU core count, it might help, too - the default CPU batch size is `2`.
 
 ### From python
 
@@ -149,9 +145,20 @@ model, processor = load_model(), load_processor()
 predictions = batch_detection([image], model, processor)
 ```
 
-## Table and chart detection
+# Limitations
+
+- This is specialized for document OCR.  It will likely not work on photos or other images.
+- It is for printed text, not handwriting (though it may work on some handwriting).
+- The model has trained itself to ignore advertisements.
+- You can find language support for OCR in `surya/languages.py`.  Text detection should work with any language.
+
+## Troubleshooting
+
+If OCR isn't working properly:
+
+- If the lines aren't detected properly, try increasing resolution of the image if the width is below `896px`, and vice versa.  Very high width images don't work well with the detector.
+- You can adjust `DETECTOR_BLANK_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results.  `DETECTOR_BLANK_THRESHOLD` controls the space between lines - any prediction below this number will be considered blank space.  `DETECTOR_TEXT_THRESHOLD` controls how text is joined - any number above this is considered text.  `DETECTOR_TEXT_THRESHOLD` should always be higher than `DETECTOR_BLANK_THRESHOLD`, and both should be in the 0-1 range.  Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
 
-Coming soon.
 
 # Manual install
 
@@ -162,13 +169,6 @@ If you want to develop surya, you can install it manually:
 - `poetry install` - installs main and dev dependencies
 - `poetry shell` - activates the virtual environment
 
-# Limitations
-
-- This is specialized for document OCR.  It will likely not work on photos or other images.
-- It is for printed text, not handwriting (though it may work on some handwriting).
-- The model has trained itself to ignore advertisements.
-- You can find language support for OCR in `surya/languages.py`.  Text detection should work with any language.
-
 # Benchmarks
 
 ## OCR

diff --git a/demo_app.py b/demo_app.py
diff --git a/ocr_app.py b/ocr_app.py
@@ -0,0 +1,119 @@
+import io
+
+import pypdfium2
+import streamlit as st
+from surya.detection import batch_detection
+from surya.model.detection.segformer import load_model, load_processor
+from surya.model.recognition.model import load_model as load_rec_model
+from surya.model.recognition.processor import load_processor as load_rec_processor
+from surya.postprocessing.heatmap import draw_polys_on_image
+from surya.ocr import run_ocr
+from surya.postprocessing.text import draw_text_on_image
+from PIL import Image
+from surya.languages import CODE_TO_LANGUAGE
+from surya.input.langs import replace_lang_with_code
+
+
+@st.cache_resource()
+def load_det_cached():
+    return load_model(), load_processor()
+
+
+@st.cache_resource()
+def load_rec_cached():
+    return load_rec_model(), load_rec_processor()
+
+
+def text_detection(img):
+    preds = batch_detection([img], det_model, det_processor)[0]
+    det_img = draw_polys_on_image(preds["polygons"], img.copy())
+    return det_img, preds
+
+
+# Function for OCR
+def ocr(img, langs):
+    replace_lang_with_code(langs)
+    pred = run_ocr([img], [langs], det_model, det_processor, rec_model, rec_processor)[0]
+    rec_img = draw_text_on_image(pred["bboxes"], pred["text_lines"], img.size)
+    return rec_img, pred
+
+
+def open_pdf(pdf_file):
+    stream = io.BytesIO(pdf_file.getvalue())
+    return pypdfium2.PdfDocument(stream)
+
+
+@st.cache_data()
+def get_page_image(pdf_file, page_num, dpi=96):
+    doc = open_pdf(pdf_file)
+    renderer = doc.render(
+        pypdfium2.PdfBitmap.to_pil,
+        page_indices=[page_num - 1],
+        scale=dpi / 72,
+    )
+    png = list(renderer)[0]
+    png_image = png.convert("RGB")
+    return png_image
+
+
+@st.cache_data()
+def page_count(pdf_file):
+    doc = open_pdf(pdf_file)
+    return len(doc)
+
+
+st.set_page_config(layout="wide")
+col1, col2 = st.columns([.5, .5])
+
+det_model, det_processor = load_det_cached()
+rec_model, rec_processor = load_rec_cached()
+
+
+st.markdown("""
+# Surya OCR Demo
+
+This app will let you try surya, a multilingual OCR model. It supports text detection in any language, and text recognition in 90+ languages.
+
+Notes:
+- This works best on documents with printed text.
+- Try to keep the image width around 896, especially if you have large text.
+- This supports 90+ languages, see [here](https://github.com/VikParuchuri/surya/tree/master/surya/languages.py) for a full list of codes.
+
+Find the project [here](https://github.com/VikParuchuri/surya).
+""")
+
+in_file = st.sidebar.file_uploader("PDF file or image:", type=["pdf", "png", "jpg", "jpeg", "gif", "webp"])
+languages = st.sidebar.multiselect("Languages", sorted(list(CODE_TO_LANGUAGE.values())), default=["English"], max_selections=4)
+
+if in_file is None:
+    st.stop()
+
+filetype = in_file.type
+whole_image = False
+if "pdf" in filetype:
+    page_count = page_count(in_file)
+    page_number = st.sidebar.number_input(f"Page number out of {page_count}:", min_value=1, value=1, max_value=page_count)
+
+    pil_image = get_page_image(in_file, page_number)
+else:
+    pil_image = Image.open(in_file).convert("RGB")
+
+text_det = st.sidebar.button("Run Text Detection")
+text_rec = st.sidebar.button("Run OCR")
+
+# Run Text Detection
+if text_det and pil_image is not None:
+    det_img, preds = text_detection(pil_image)
+    with col1:
+        st.image(det_img, caption="Detected Text", use_column_width=True)
+        st.json(preds)
+
+# Run OCR
+if text_rec and pil_image is not None:
+    rec_img, pred = ocr(pil_image, languages)
+    with col1:
+        st.image(rec_img, caption="OCR Result", use_column_width=True)
+        st.json(pred)
+
+with col2:
+    st.image(pil_image, caption="Uploaded Image", use_column_width=True)
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ wandb @@
     notebooks
     results
     data
+    slices
     # Byte-compiled / optimized / DLL files
     __pycache__/
@@ Expand Down @@