Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: (experimental) introduce new document format #21

Merged
merged 43 commits into from
Oct 1, 2024
Merged

Conversation

cau-git
Copy link
Contributor

@cau-git cau-git commented Sep 17, 2024

No description provided.

@cau-git cau-git force-pushed the cau/new-format-dev branch 2 times, most recently from 334e1b7 to 5df614b Compare September 17, 2024 13:23
@cau-git cau-git changed the title Draft new docling document format, pydantic model and tests feat: Draft new docling document format, pydantic model and tests Sep 17, 2024
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
cau-git and others added 4 commits September 27, 2024 15:17
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
)
from docling_core.types.experimental.labels import DocItemLabel, GroupLabel

def test_docitems():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cau-git added a unit-test, so we can detect if new classes appear or we have added the serialisation of a DocItem subclass


class GroupItem(NodeItem): # Container type, can't be a leaf node
"""GroupItem."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add provenance field to GroupItem or NodeItem,
to support provenance of entire lists, or slides (that are collections of elements)
In case of lists - it might be that individual bounding boxes are not possible or hard to obtain, but bounding box of an entire list is readily available. And in any way we would like to keep bounding boxes of lists, for this Group should support provenance.

PeterStaar-IBM
PeterStaar-IBM previously approved these changes Sep 30, 2024
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this!

@cau-git cau-git dismissed PeterStaar-IBM’s stale review September 30, 2024 12:55

Still missing work.

Comment on lines 21 to 42
name: "_root_"
self_ref: "#/body"
parent: null # Only root elements have no parent.
children: # only the first-level children appear here, as references (RefItem)
- $ref: "/texts/1"
- $ref: "/pictures/0"
- $ref: "/texts/3"
- $ref: "/tables/0"

# All groups of items nested deeper in body or furniture roots, type List[GroupItem]
groups: [] # The parent + children relations capture nesting and reading-order.

# All elements that have a text-string representation, type TextItem.
# This is a flat list of all elements without implied order.
texts:
- orig: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
text: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022"
self_ref: "#/texts/0"
label: "page_header"
parent:
$ref: "#/furniture"
children: []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cau-git I see something that may be an inconsistency:

  • the parent has the # prefix (fragment separator for URI fragment): $ref: "#/furniture", while
  • the children do not have this prefix, e.g. - $ref: "/texts/1"

Can you clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted. The children must have it too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagenas Can you do the update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix needed is only to correct manually this file, since it was hand-crafted.
We should also consider validating the ref property to ensure it always follows the #{}/{} pattern.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterStaar-IBM @cau-git

  1. validation added (which individually as expected causes the tests to fail due to the invalid data) and
  2. test data fixed

fig_caption = doc.add_text(
label=DocItemLabel.CAPTION, text="This is the caption of figure 1."
)
fig_item = doc.add_picture(data=BasePictureData(), caption=fig_caption)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cau-git fig_item here is not used.

@PeterStaar-IBM PeterStaar-IBM changed the title feat: Draft new docling document format, pydantic model and tests feat: (experimental) introduce new document format Oct 1, 2024
@cau-git cau-git merged commit 688789e into main Oct 1, 2024
5 checks passed
@cau-git cau-git deleted the cau/new-format-dev branch October 1, 2024 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants