How is the selected Classification indicated? #205

mophilly · 2025-01-18T18:43:30Z

I have set up a set of classifications; an abbreviated example below.
The result returned includes the document title and confidence.
In my test case, there are four major document layouts. Consider the document type "revenue". The source documents have different titles. For example, a revenue document be titled "Monthy Income", "Revenue August 2024","Rent Statement", "Gross Collections", etc. In practice there could be dozens of unique titles in the source document set. However all need to be classified as "revenue statements".

What I expected in the return value of extractor.classify was an indication of the matching classification, such as the classification name or description, along with other interesting information.

What I get is the actual document title, which is fine and likely useful. I don't see an indication for the matching classification.

How does ExtractThinker indicate which classification was selected?

# Level 1 primary document types
class RevenueStatementContract0101(Contract):
    document_type: str = DocumentType.REVENUE.value   # 'Revenue_Statement'
    statement_type: str = Field(description='a part of the document name or title, often found in the upper left corner of the document',
                                examples=["Revenue Statement"])
    statement_payor: str 
    statement_payee: str
    check_number: int
    check_total: float
    check_date: str

class DocumentClassifier:
    def classify_document(self, source_path):
        # Initialize the extractor and load the document loader
        tesseract_path = os.getenv("TESSERACT_PATH")
        extractor = Extractor()
        extractor.load_document_loader(DocumentLoaderTesseract(tesseract_path))
        extractor.load_llm("gpt-4o-mini")

        # Define classifications
        classifications = [
            Classification(
                name=DocumentType.TITLE.value,
                description="Title document T0101",
                contract=TitleDocumentContract,
                extractor=extractor,
                image="../models/images/document class T0101.png",
            ),
            Classification(
                name=DocumentType.LEASE.value,
                description="Lease document L0101",
                contract=LeaseDocumentContract,
                extractor=extractor,
                image="../models/images/document class L0101.png",
            ),
            Classification(
                name=DocumentType.REVENUE.value,
                description="Revenue statement C0101",
                contract=RevenueStatementContract0101,
                extractor=extractor,
                image="../models/images/document class C0101.png",
            ),
        ]

        # Classify the document directly using the extractor
        result = extractor.classify(
            source_path,    # Can be a file path or IO stream
            classifications,
            vision=False    # Set to True for image-based classification
        )
        
        return result

The text was updated successfully, but these errors were encountered:

enoch3712 · 2025-01-19T09:37:56Z

Hello @mophilly!

Yes, this is a great question, and the answer is simple: So it can be barebone.

You should later match the classification Name with the classification, in this case "DocumentType.TITLE.value", so it works like an id.

I think this should be changed, at least for now include the entire selected classification.

I will add this in the next release.

mophilly · 2025-01-19T16:14:01Z

Thank you! If it doesn’t complicate matters, I am happy to test a modification. I understand that such a mod can be overwritten by the next release.

enoch3712 self-assigned this Jan 19, 2025

enoch3712 added the enhancement New feature or request label Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is the selected Classification indicated? #205

How is the selected Classification indicated? #205

mophilly commented Jan 18, 2025 •

edited

Loading

enoch3712 commented Jan 19, 2025

mophilly commented Jan 19, 2025

How is the selected Classification indicated? #205

How is the selected Classification indicated? #205

Comments

mophilly commented Jan 18, 2025 • edited Loading

enoch3712 commented Jan 19, 2025

mophilly commented Jan 19, 2025

mophilly commented Jan 18, 2025 •

edited

Loading