Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is the selected Classification indicated? #205

Open
mophilly opened this issue Jan 18, 2025 · 2 comments
Open

How is the selected Classification indicated? #205

mophilly opened this issue Jan 18, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@mophilly
Copy link

mophilly commented Jan 18, 2025

I have set up a set of classifications; an abbreviated example below.
The result returned includes the document title and confidence.
In my test case, there are four major document layouts. Consider the document type "revenue". The source documents have different titles. For example, a revenue document be titled "Monthy Income", "Revenue August 2024","Rent Statement", "Gross Collections", etc. In practice there could be dozens of unique titles in the source document set. However all need to be classified as "revenue statements".

What I expected in the return value of extractor.classify was an indication of the matching classification, such as the classification name or description, along with other interesting information.

What I get is the actual document title, which is fine and likely useful. I don't see an indication for the matching classification.

How does ExtractThinker indicate which classification was selected?

# Level 1 primary document types
class RevenueStatementContract0101(Contract):
    document_type: str = DocumentType.REVENUE.value   # 'Revenue_Statement'
    statement_type: str = Field(description='a part of the document name or title, often found in the upper left corner of the document',
                                examples=["Revenue Statement"])
    statement_payor: str 
    statement_payee: str
    check_number: int
    check_total: float
    check_date: str

class DocumentClassifier:
    def classify_document(self, source_path):
        # Initialize the extractor and load the document loader
        tesseract_path = os.getenv("TESSERACT_PATH")
        extractor = Extractor()
        extractor.load_document_loader(DocumentLoaderTesseract(tesseract_path))
        extractor.load_llm("gpt-4o-mini")

        # Define classifications
        classifications = [
            Classification(
                name=DocumentType.TITLE.value,
                description="Title document T0101",
                contract=TitleDocumentContract,
                extractor=extractor,
                image="../models/images/document class T0101.png",
            ),
            Classification(
                name=DocumentType.LEASE.value,
                description="Lease document L0101",
                contract=LeaseDocumentContract,
                extractor=extractor,
                image="../models/images/document class L0101.png",
            ),
            Classification(
                name=DocumentType.REVENUE.value,
                description="Revenue statement C0101",
                contract=RevenueStatementContract0101,
                extractor=extractor,
                image="../models/images/document class C0101.png",
            ),
        ]

        # Classify the document directly using the extractor
        result = extractor.classify(
            source_path,    # Can be a file path or IO stream
            classifications,
            vision=False    # Set to True for image-based classification
        )
        
        return result
@enoch3712
Copy link
Owner

Hello @mophilly!

Yes, this is a great question, and the answer is simple: So it can be barebone.

You should later match the classification Name with the classification, in this case "DocumentType.TITLE.value", so it works like an id.

I think this should be changed, at least for now include the entire selected classification.

I will add this in the next release.

@enoch3712 enoch3712 self-assigned this Jan 19, 2025
@enoch3712 enoch3712 added the enhancement New feature or request label Jan 19, 2025
@mophilly
Copy link
Author

Thank you! If it doesn’t complicate matters, I am happy to test a modification. I understand that such a mod can be overwritten by the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants