This project provides an automated pipeline for processing Piping & Instrumentation Diagrams (P&IDs) PDFs. It converts PDFs into high-resolution images, extracts text boxes using OCR, compares extracted numerical values (e.g., PSIG and Temperature) against standard limits defined in a Standard Operating Procedure (SOP) document, and then annotates the images to highlight discrepancies.
- OCR-Based Text Extraction: Uses Tesseract OCR to identify and extract text boxes from images.
- Discrepancy Checking: Compares extracted numerical values against predefined limits.
- Annotated Results: Draws color-coded rectangles on images — green for values within limits and red for values exceeding limits.
- Python 3.12
- Tesseract-OCR: Download and install Tesseract OCR. Make sure it’s in your system PATH or configure the path in your code.
- Poppler: Required for PDF-to-image conversion.
- Ubuntu:
sudo apt-get install poppler-utils
- Windows/Mac: Download Poppler and add to your PATH.
- Ubuntu:
Install the required Python packages via pip:
pip install opencv-python pytesseract pdf2image python-docx
diagram-analysis/
├── data/
│ ├── p&id/ # Directory for PDF diagram files
│ └── sop/ # Directory for SOP DOCX files
├── check_discrepancy.py
├── config.py
├── extract_text_box.py
├── load_sop.py
├── main.py
├── pdf_to_image.py
└── README.md
-
Prepare Input Files:
- Place your PDF diagram file(s) in the
data/p&id/
directory. - Place your SOP DOCX file(s) in the
data/sop/
directory.
- Place your PDF diagram file(s) in the
-
Run the Program:
Execute the main script by running:
python3 main.py "./data/p&id/diagram.pdf" "./data/sop/sop.docx"
-
Output:
- Annotated images are saved in an output directory named after the PDF file name.
- The console will also display the extracted limits and the comparison results for each processed image.
-
OCR Settings:
Modifyfirst_config
andsecond_config
inconfig.py
to adjust Tesseract OCR performance based on your diagram quality. -
Merging Parameters:
Change parameters such asMIN_BLOCK_DISTANCE
andBOX_PADDING
inconfig.py
to fine-tune how text boxes are merged. -
Image Scaling:
Change SCALE_FACTOR parameters inconfig.py
to cater for different image resolutions.
-
Better Layout Analysis: Locating component locations by grouping Tesseract's text boxes can be improved by training a YOLO model.
-
Graph Construction: Thoughts are to connect modules based on horizontal/vertical alignment if can achieve higher component recognition quality.
-
Limit Mapping: Mapping logic will be improved if can achieve higher component recognition quality.