You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have set up a set of classifications; an abbreviated example below.
The result returned includes the document title and confidence.
In my test case, there are four major document layouts. Consider the document type "revenue". The source documents have different titles. For example, a revenue document be titled "Monthy Income", "Revenue August 2024","Rent Statement", "Gross Collections", etc. In practice there could be dozens of unique titles in the source document set. However all need to be classified as "revenue statements".
What I expected in the return value of extractor.classify was an indication of the matching classification, such as the classification name or description, along with other interesting information.
What I get is the actual document title, which is fine and likely useful. I don't see an indication for the matching classification.
How does ExtractThinker indicate which classification was selected?
# Level 1 primary document types
class RevenueStatementContract0101(Contract):
document_type: str = DocumentType.REVENUE.value # 'Revenue_Statement'
statement_type: str = Field(description='a part of the document name or title, often found in the upper left corner of the document',
examples=["Revenue Statement"])
statement_payor: str
statement_payee: str
check_number: int
check_total: float
check_date: str
class DocumentClassifier:
def classify_document(self, source_path):
# Initialize the extractor and load the document loader
tesseract_path = os.getenv("TESSERACT_PATH")
extractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(tesseract_path))
extractor.load_llm("gpt-4o-mini")
# Define classifications
classifications = [
Classification(
name=DocumentType.TITLE.value,
description="Title document T0101",
contract=TitleDocumentContract,
extractor=extractor,
image="../models/images/document class T0101.png",
),
Classification(
name=DocumentType.LEASE.value,
description="Lease document L0101",
contract=LeaseDocumentContract,
extractor=extractor,
image="../models/images/document class L0101.png",
),
Classification(
name=DocumentType.REVENUE.value,
description="Revenue statement C0101",
contract=RevenueStatementContract0101,
extractor=extractor,
image="../models/images/document class C0101.png",
),
]
# Classify the document directly using the extractor
result = extractor.classify(
source_path, # Can be a file path or IO stream
classifications,
vision=False # Set to True for image-based classification
)
return result
The text was updated successfully, but these errors were encountered:
I have set up a set of classifications; an abbreviated example below.
The result returned includes the document title and confidence.
In my test case, there are four major document layouts. Consider the document type "revenue". The source documents have different titles. For example, a revenue document be titled "Monthy Income", "Revenue August 2024","Rent Statement", "Gross Collections", etc. In practice there could be dozens of unique titles in the source document set. However all need to be classified as "revenue statements".
What I expected in the return value of extractor.classify was an indication of the matching classification, such as the classification name or description, along with other interesting information.
What I get is the actual document title, which is fine and likely useful. I don't see an indication for the matching classification.
How does ExtractThinker indicate which classification was selected?
The text was updated successfully, but these errors were encountered: