-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: (experimental) introduce new document format #21
Conversation
334e1b7
to
5df614b
Compare
Signed-off-by: Christoph Auer <[email protected]>
5df614b
to
a90cc19
Compare
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
7cd81c0
to
9264b1b
Compare
Signed-off-by: Christoph Auer <[email protected]>
9264b1b
to
7dcbde7
Compare
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
e0073e8
to
0685709
Compare
Signed-off-by: Christoph Auer <[email protected]>
425858a
to
f791f74
Compare
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
1ae674a
to
0a1e6ce
Compare
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
) | ||
from docling_core.types.experimental.labels import DocItemLabel, GroupLabel | ||
|
||
def test_docitems(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cau-git added a unit-test, so we can detect if new classes appear or we have added the serialisation of a DocItem subclass
Signed-off-by: Christoph Auer <[email protected]>
…to cau/new-format-dev
Signed-off-by: Christoph Auer <[email protected]>
|
||
class GroupItem(NodeItem): # Container type, can't be a leaf node | ||
"""GroupItem.""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add provenance field to GroupItem or NodeItem,
to support provenance of entire lists, or slides (that are collections of elements)
In case of lists - it might be that individual bounding boxes are not possible or hard to obtain, but bounding box of an entire list is readily available. And in any way we would like to keep bounding boxes of lists, for this Group should support provenance.
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
…endency Signed-off-by: Cesar Berrospi Ramis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this!
Signed-off-by: Cesar Berrospi Ramis <[email protected]>
name: "_root_" | ||
self_ref: "#/body" | ||
parent: null # Only root elements have no parent. | ||
children: # only the first-level children appear here, as references (RefItem) | ||
- $ref: "/texts/1" | ||
- $ref: "/pictures/0" | ||
- $ref: "/texts/3" | ||
- $ref: "/tables/0" | ||
|
||
# All groups of items nested deeper in body or furniture roots, type List[GroupItem] | ||
groups: [] # The parent + children relations capture nesting and reading-order. | ||
|
||
# All elements that have a text-string representation, type TextItem. | ||
# This is a flat list of all elements without implied order. | ||
texts: | ||
- orig: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022" | ||
text: "arXiv:2206.01062v1 [cs.CV] 2 Jun 2022" | ||
self_ref: "#/texts/0" | ||
label: "page_header" | ||
parent: | ||
$ref: "#/furniture" | ||
children: [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cau-git I see something that may be an inconsistency:
- the parent has the
#
prefix (fragment separator for URI fragment):$ref: "#/furniture"
, while - the children do not have this prefix, e.g.
- $ref: "/texts/1"
Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted. The children must have it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vagenas Can you do the update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix needed is only to correct manually this file, since it was hand-crafted.
We should also consider validating the ref
property to ensure it always follows the #{}/{}
pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- validation added (which individually as expected causes the tests to fail due to the invalid data) and
- test data fixed
fig_caption = doc.add_text( | ||
label=DocItemLabel.CAPTION, text="This is the caption of figure 1." | ||
) | ||
fig_item = doc.add_picture(data=BasePictureData(), caption=fig_caption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cau-git fig_item here is not used.
Signed-off-by: Panos Vagenas <[email protected]>
No description provided.