Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: DS4SD/docling-core
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v2.13.1
Choose a base ref
...
head repository: DS4SD/docling-core
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref
Loading
Showing with 22,363 additions and 6,129 deletions.
  1. +10 −1 .github/workflows/checks.yml
  2. +122 −0 CHANGELOG.md
  3. +1 −1 README.md
  4. +1 −1 docling_core/transforms/chunker/base.py
  5. +5 −2 docling_core/transforms/chunker/hierarchical_chunker.py
  6. +82 −109 docling_core/transforms/chunker/hybrid_chunker.py
  7. +1 −0 docling_core/types/doc/__init__.py
  8. +261 −45 docling_core/types/doc/base.py
  9. +1,012 −453 docling_core/types/doc/document.py
  10. +136 −0 docling_core/types/doc/labels.py
  11. +21 −98 docling_core/types/doc/tokens.py
  12. +27 −0 docling_core/types/doc/utils.py
  13. +3 −3 docling_core/utils/legacy.py
  14. +470 −5 docs/DoclingDocument.json
  15. +550 −688 poetry.lock
  16. +13 −2 pyproject.toml
  17. +112 −504 test/data/chunker/0_out_chunks.json
  18. +112 −504 test/data/chunker/1_out_chunks.json
  19. +28 −1 test/data/chunker/2a_out_chunks.json
  20. +28 −1 test/data/chunker/2b_out_chunks.json
  21. +15 −1 test/data/chunker/2c_out_chunks.json
  22. +154 −0 test/data/chunker/2d_out_ser_chunks.json
  23. +1 −0 test/data/doc/2206.01062-1.0.0.json
  24. +15,577 −2,958 test/data/doc/2206.01062.yaml
  25. +160 −217 test/data/doc/2206.01062.yaml.dt
  26. +88 −87 test/data/doc/2206.01062.yaml.et
  27. +59 −23 test/data/doc/2206.01062.yaml.html
  28. +74 −55 test/data/doc/2206.01062.yaml.md
  29. +2 −3 test/data/doc/bad_doc.yaml.dt
  30. +1 −1 test/data/doc/bad_doc.yaml.et
  31. +15 −1 test/data/doc/bad_doc.yaml.html
  32. +1 −1 test/data/doc/bad_doc.yaml.md
  33. +32 −18 test/data/doc/constructed_doc.dt
  34. +32 −18 test/data/doc/constructed_doc.dt.gt
  35. +49 −2 test/data/doc/constructed_doc.embedded.html.gt
  36. +606 −55 test/data/doc/constructed_doc.embedded.json.gt
  37. +31 −5 test/data/doc/constructed_doc.embedded.md.gt
  38. +416 −50 test/data/doc/constructed_doc.embedded.yaml.gt
  39. +48 −1 test/data/doc/constructed_doc.placeholder.html.gt
  40. +30 −4 test/data/doc/constructed_doc.placeholder.md.gt
  41. +49 −2 test/data/doc/constructed_doc.referenced.html.gt
  42. +606 −55 test/data/doc/constructed_doc.referenced.json.gt
  43. +30 −4 test/data/doc/constructed_doc.referenced.md.gt
  44. +416 −50 test/data/doc/constructed_doc.referenced.yaml.gt
  45. +32 −18 test/data/doc/constructed_document.yaml.dt
  46. +49 −19 test/data/doc/constructed_document.yaml.et
  47. +49 −2 test/data/doc/constructed_document.yaml.html
  48. +34 −4 test/data/doc/constructed_document.yaml.md
  49. BIN ...structed_images/image_000001_797618e862d279d4e3e92f4b6313175f67e08fc36051dfda092bf63220568703.png
  50. BIN ...structed_images/image_000001_ccb4cbe7039fe17892f3d611cfb71eafff1d4d230b19b10779334cc4b63c98bc.png
  51. BIN ...structed_images/image_000001_f3cc103136423a57975750907ebc1d367e2985ac6338976d4d5a439f50323f4a.png
  52. +4 −10 test/data/doc/dummy_doc.yaml.dt
  53. +1 −1 test/data/doc/dummy_doc.yaml.et
  54. +15 −1 test/data/doc/dummy_doc.yaml.html
  55. +1 −1 test/data/doc/dummy_doc.yaml.md
  56. +5 −0 test/data/docling_document/export/formula_mathml.html
  57. +13 −0 test/data/docling_document/unit/CodeItem.yaml
  58. +1 −0 test/data/docling_document/unit/FloatingItem.yaml
  59. +31 −0 test/data/docling_document/unit/FormItem.yaml
  60. +27 −1 test/data/docling_document/unit/KeyValueItem.yaml
  61. +2 −1 test/data/docling_document/unit/ListItem.yaml
  62. +1 −0 test/data/docling_document/unit/PictureItem.yaml
  63. +1 −0 test/data/docling_document/unit/SectionHeaderItem.yaml
  64. +1 −0 test/data/docling_document/unit/TableItem.yaml
  65. +1 −0 test/data/docling_document/unit/TextItem.yaml
  66. +139 −1 test/data/legacy_doc/doc-export.docling.yaml.gt
  67. +9 −0 test/test_data_gen_flag.py
  68. +376 −19 test/test_docling_doc.py
  69. +21 −6 test/test_hierarchical_chunker.py
  70. +63 −16 test/test_hybrid_chunker.py
11 changes: 10 additions & 1 deletion .github/workflows/checks.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
on:
workflow_call:

env:
HF_HUB_DOWNLOAD_TIMEOUT: "60"
HF_HUB_ETAG_TIMEOUT: "60"

jobs:
run-checks:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13']
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Cache Hugging Face models
uses: actions/cache@v4
with:
path: ~/.cache/huggingface
key: huggingface-cache-py${{ matrix.python-version }}
- uses: ./.github/actions/setup-poetry
with:
python-version: ${{ matrix.python-version }}
122 changes: 122 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,125 @@
## [v2.21.2](https://github.com/DS4SD/docling-core/releases/tag/v2.21.2) - 2025-03-06

### Fix

* Suppress warning for missing fallback case ([#184](https://github.com/DS4SD/docling-core/issues/184)) ([`ccde54a`](https://github.com/DS4SD/docling-core/commit/ccde54aa2926281644e5c1f0c96b79db18f6bbc7))
* **doctags:** Fix code export ([#181](https://github.com/DS4SD/docling-core/issues/181)) ([`53f6d09`](https://github.com/DS4SD/docling-core/commit/53f6d099b05f295fea546010dc2faadc5b2c7ee2))
* **markdown:** Fix escaping in case of nesting ([#180](https://github.com/DS4SD/docling-core/issues/180)) ([`834db4b`](https://github.com/DS4SD/docling-core/commit/834db4bc664e010e10c4503e60be576ed7819e2c))
* **HybridChunker:** Remove `max_length` from tokenization ([#178](https://github.com/DS4SD/docling-core/issues/178)) ([`419252c`](https://github.com/DS4SD/docling-core/commit/419252c39b856c45e50326b4eff3c4a183ac8437))

## [v2.21.1](https://github.com/DS4SD/docling-core/releases/tag/v2.21.1) - 2025-02-28

### Fix

* **markdown:** Fix handling of ordered lists ([#175](https://github.com/DS4SD/docling-core/issues/175)) ([`349f7da`](https://github.com/DS4SD/docling-core/commit/349f7daa0c20c861134cffb28177eaaf48b27ae5))

## [v2.21.0](https://github.com/DS4SD/docling-core/releases/tag/v2.21.0) - 2025-02-27

### Feature

* Add inline groups, revamp Markdown export incl. list groups ([#156](https://github.com/DS4SD/docling-core/issues/156)) ([`2abaf9b`](https://github.com/DS4SD/docling-core/commit/2abaf9b53736187adec0266c5ed8b9acff008f6e))

### Fix

* **markdown:** Fix case of leading list ([#174](https://github.com/DS4SD/docling-core/issues/174)) ([`c77c59b`](https://github.com/DS4SD/docling-core/commit/c77c59bec09d4b8093771935393f558cf319ec29))
* Properly handle missing page image case for export_to_html ([#166](https://github.com/DS4SD/docling-core/issues/166)) ([`4708f93`](https://github.com/DS4SD/docling-core/commit/4708f933a7ef87e4637f5bea07e6e4f296abc51a))

## [v2.20.0](https://github.com/DS4SD/docling-core/releases/tag/v2.20.0) - 2025-02-19

### Feature

* Introduce Key-Value and Forms items ([#158](https://github.com/DS4SD/docling-core/issues/158)) ([`d622800`](https://github.com/DS4SD/docling-core/commit/d6228007502fc1f27400059eae7bb768209c0a6f))

## [v2.19.1](https://github.com/DS4SD/docling-core/releases/tag/v2.19.1) - 2025-02-17

### Fix

* Expose included_content_layers arg in export/save methods for MD+HTML ([#164](https://github.com/DS4SD/docling-core/issues/164)) ([`c46995b`](https://github.com/DS4SD/docling-core/commit/c46995bca39fbaa2a9d1fb68c5c9cb5beb6d6722))

## [v2.19.0](https://github.com/DS4SD/docling-core/releases/tag/v2.19.0) - 2025-02-17

### Feature

* Redefine CodeItem as floating object with captions ([#160](https://github.com/DS4SD/docling-core/issues/160)) ([`916323f`](https://github.com/DS4SD/docling-core/commit/916323fb55274753aa1d6a4928388a35417f94b6))
* Implementation of doc tags ([#138](https://github.com/DS4SD/docling-core/issues/138)) ([`f751b45`](https://github.com/DS4SD/docling-core/commit/f751b45b62fb318929f8131ab82fa17db98e8e44))

### Fix

* Document Tokens (doc tags) clean up, fix iterate_items for content_layer ([#161](https://github.com/DS4SD/docling-core/issues/161)) ([`58ed6c8`](https://github.com/DS4SD/docling-core/commit/58ed6c8ab75ba179faf1598b9877662cdcc4c1d3))
* Fix inheritance of CodeItem for backward compatibility ([#162](https://github.com/DS4SD/docling-core/issues/162)) ([`7267c3f`](https://github.com/DS4SD/docling-core/commit/7267c3f5716d3f292592d3b11ddd2b0db4392c20))

## [v2.18.1](https://github.com/DS4SD/docling-core/releases/tag/v2.18.1) - 2025-02-13

### Fix

* Update Pillow constraints ([#157](https://github.com/DS4SD/docling-core/issues/157)) ([`a9afeda`](https://github.com/DS4SD/docling-core/commit/a9afeda6d1251900142571f7bff3d00d871d5915))

## [v2.18.0](https://github.com/DS4SD/docling-core/releases/tag/v2.18.0) - 2025-02-10

### Feature

* Add ContentLayer attribute to designate items to body or furniture ([#148](https://github.com/DS4SD/docling-core/issues/148)) ([`786f0c6`](https://github.com/DS4SD/docling-core/commit/786f0c68336a7b9cced5fb0cb66427b050955e32))

## [v2.17.2](https://github.com/DS4SD/docling-core/releases/tag/v2.17.2) - 2025-02-06

### Fix

* Define LTR/RTL text direction in HTML export ([#152](https://github.com/DS4SD/docling-core/issues/152)) ([`3cf31cb`](https://github.com/DS4SD/docling-core/commit/3cf31cbe384e3f77a375aa057ef61d156d990b23))

## [v2.17.1](https://github.com/DS4SD/docling-core/releases/tag/v2.17.1) - 2025-02-03

### Fix

* Image fallback for malformed equations ([#149](https://github.com/DS4SD/docling-core/issues/149)) ([`eb9b4b3`](https://github.com/DS4SD/docling-core/commit/eb9b4b39a1a2f81baf72d3fa3bbc7cd8ed594c1c))

## [v2.17.0](https://github.com/DS4SD/docling-core/releases/tag/v2.17.0) - 2025-02-03

### Feature

* **HTML:** Fallback showing formulas as images ([#146](https://github.com/DS4SD/docling-core/issues/146)) ([`23477f7`](https://github.com/DS4SD/docling-core/commit/23477f76741b3593734287776fdf5e0761558c2d))
* **HTML:** Export formulas with mathml ([#144](https://github.com/DS4SD/docling-core/issues/144)) ([`ed36437`](https://github.com/DS4SD/docling-core/commit/ed36437346177b9249c98df3eb5ddeadef004c59))

### Fix

* Add html escape in md export and fix formula escapes ([#143](https://github.com/DS4SD/docling-core/issues/143)) ([`c6590e8`](https://github.com/DS4SD/docling-core/commit/c6590e83e28626e4a6b62fdbd270cb794bf10918))

## [v2.16.1](https://github.com/DS4SD/docling-core/releases/tag/v2.16.1) - 2025-01-30

### Fix

* Add newline to md formula export ([#142](https://github.com/DS4SD/docling-core/issues/142)) ([`d07a87e`](https://github.com/DS4SD/docling-core/commit/d07a87e1fbc777cd6d01c7646d714a44a69bc123))

## [v2.16.0](https://github.com/DS4SD/docling-core/releases/tag/v2.16.0) - 2025-01-29

### Feature

* Escape underscores that are within latex equations ([#137](https://github.com/DS4SD/docling-core/issues/137)) ([`0d5cd11`](https://github.com/DS4SD/docling-core/commit/0d5cd11326d8521360add6ffaa3de845bf72abe2))
* Add escaping_underscores option to markdown export ([#135](https://github.com/DS4SD/docling-core/issues/135)) ([`c9739b2`](https://github.com/DS4SD/docling-core/commit/c9739b2c6cf0686747fbda5331e1fd1a174bb91f))
* Added the geometric operations to BoundingBox ([#136](https://github.com/DS4SD/docling-core/issues/136)) ([`f02bbae`](https://github.com/DS4SD/docling-core/commit/f02bbaea47ebbfe98265f530b0b62dd2a6ac1ecd))

## [v2.15.1](https://github.com/DS4SD/docling-core/releases/tag/v2.15.1) - 2025-01-21

### Fix

* Backward compatible add_text() ([#132](https://github.com/DS4SD/docling-core/issues/132)) ([`7e45817`](https://github.com/DS4SD/docling-core/commit/7e458179d8ec46017fd90114a55360daf419f926))

## [v2.15.0](https://github.com/DS4SD/docling-core/releases/tag/v2.15.0) - 2025-01-21

### Feature

* Add CodeItem as pydantic type, update export methods and APIs ([#129](https://github.com/DS4SD/docling-core/issues/129)) ([`c940aa5`](https://github.com/DS4SD/docling-core/commit/c940aa5ca9b345333e3e95d8c0ec32ddfa227385))

### Fix

* Fix hybrid chunker token constraint ([#131](https://github.com/DS4SD/docling-core/issues/131)) ([`b741eea`](https://github.com/DS4SD/docling-core/commit/b741eeaab437781e36f9d356478ef525ef54867b))
* Always return a new bbox when changing origin ([#128](https://github.com/DS4SD/docling-core/issues/128)) ([`841668f`](https://github.com/DS4SD/docling-core/commit/841668f416f2079afc6f8ab07e5507aacce59de3))

## [v2.14.0](https://github.com/DS4SD/docling-core/releases/tag/v2.14.0) - 2025-01-10

### Feature

* Dev/add labels for pictures-classes ([#127](https://github.com/DS4SD/docling-core/issues/127)) ([`078cd61`](https://github.com/DS4SD/docling-core/commit/078cd61b31c36bec553f64c411012e361683bd35))

## [v2.13.1](https://github.com/DS4SD/docling-core/releases/tag/v2.13.1) - 2025-01-08

### Fix
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -23,7 +23,7 @@ pip install docling-core

To develop for Docling Core, you need Python 3.9 / 3.10 / 3.11 / 3.12 / 3.13 and Poetry. You can then install from your local clone's root dir:
```bash
poetry install
poetry install --all-extras
```

To run the pytest suite, execute:
2 changes: 1 addition & 1 deletion docling_core/transforms/chunker/base.py
Original file line number Diff line number Diff line change
@@ -51,7 +51,7 @@ class BaseChunker(BaseModel, ABC):
delim: str = DFLT_DELIM

@abstractmethod
def chunk(self, dl_doc: DLDocument, **kwargs) -> Iterator[BaseChunk]:
def chunk(self, dl_doc: DLDocument, **kwargs: Any) -> Iterator[BaseChunk]:
"""Chunk the provided document.
Args:
7 changes: 5 additions & 2 deletions docling_core/transforms/chunker/hierarchical_chunker.py
Original file line number Diff line number Diff line change
@@ -19,6 +19,7 @@
from docling_core.transforms.chunker import BaseChunk, BaseChunker, BaseMeta
from docling_core.types import DoclingDocument as DLDocument
from docling_core.types.doc.document import (
CodeItem,
DocItem,
DocumentOrigin,
LevelNumber,
@@ -199,8 +200,10 @@ def chunk(self, dl_doc: DLDocument, **kwargs: Any) -> Iterator[BaseChunk]:
heading_by_level.pop(k, None)
continue

if isinstance(item, TextItem) or (
(not self.merge_list_items) and isinstance(item, ListItem)
if (
isinstance(item, TextItem)
or ((not self.merge_list_items) and isinstance(item, ListItem))
or isinstance(item, CodeItem)
):
text = item.text
elif isinstance(item, TableItem):
Loading