An advanced document processing system for commercial real estate underwriting, capable of extracting and analyzing information from various document types including rent rolls, P&L statements, operating statements, and lease documents.
-
Multi-Format Document Support
- PDF (with OCR capabilities)
- Excel spreadsheets
- Word documents
-
Specialized Document Extractors
- Rent Roll Extractor
- Tenant information
- Lease terms
- Occupancy data
- Rent analysis
- P&L Statement Extractor
- Revenue items
- Expense categories
- NOI calculations
- Historical comparisons
- Operating Statement Extractor
- Combined financial metrics
- Budget variance analysis
- Period comparisons
- Lease Document Extractor
- Key lease terms
- Financial obligations
- Special provisions
- Key dates
- Rent Roll Extractor
-
Advanced Analysis
- Confidence scoring for extracted data
- Automated data validation
- Risk factor identification
- Financial metrics calculation
-
API Features
- Document upload and processing
- Extraction results retrieval
- Status monitoring
- Statistical analysis
- Python 3.9+
- MongoDB 4.4+
- Tesseract OCR
- Poppler (for PDF processing)
- Azure OpenAI API access
- Clone the repository:
git clone https://github.com/yourusername/ai-underwriting.git
cd ai-underwriting
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Install Tesseract OCR:
- Windows: Download and install from GitHub
- Mac:
brew install tesseract
- Linux:
sudo apt-get install tesseract-ocr
- Install Poppler:
- Windows: Download from Poppler Releases
- Mac:
brew install poppler
- Linux:
sudo apt-get install poppler-utils
- Configure environment variables:
cp backend/.env.example backend/.env
Edit .env
with your settings.
Key environment variables:
# MongoDB Configuration
MONGODB_URL=mongodb://localhost:27017
MONGODB_DB_NAME=ai_underwriting
# OCR Configuration
POPPLER_PATH=/path/to/poppler
TESSERACT_PATH=/path/to/tesseract
# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your-endpoint
AZURE_OPENAI_API_KEY=your-key
AZURE_OPENAI_DEPLOYMENT_NAME=your-deployment
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=your-embedding-deployment
- Start the server:
cd backend
uvicorn main:app --reload
- Access the API documentation:
http://localhost:8000/api/v1/docs
-
POST /api/v1/documents/upload
- Upload and process a document
- Supports PDF, DOCX, XLSX
-
GET /api/v1/documents/{document_id}/status
- Check processing status
- Get confidence scores
-
GET /api/v1/documents/{document_id}/content
- Get extracted content
- Access all extractions
-
GET /api/v1/documents/{document_id}/extraction/{extractor_type}
- Get specific extraction results
- Filter by extractor type
-
GET /api/v1/documents/types
- List supported document types
- Get extractor descriptions
backend/
├── api/
│ └── documents.py
├── services/
│ ├── extractors/
│ │ ├── base.py
│ │ ├── rent_roll.py
│ │ ├── pl_statement.py
│ │ ├── operating_statement.py
│ │ └── lease.py
│ └── ocr.py
├── db/
│ └── mongodb.py
├── config/
│ └── settings.py
└── main.py
- Create a new extractor class in
services/extractors/
- Inherit from
BaseExtractor
- Implement required methods:
can_handle()
extract()
validate()
Example:
from .base import BaseExtractor
class NewDocumentExtractor(BaseExtractor):
def can_handle(self, content: str, filename: str) -> bool:
# Implement document type detection
pass
def extract(self, content: str) -> Dict[str, Any]:
# Implement data extraction
pass
def validate(self) -> bool:
# Implement validation rules
pass
Run tests:
pytest
- Fork the repository
- Create a feature branch
- Commit changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.