Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ReadingOrder and Markdown text evaluation #8

Merged
merged 19 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
4a08527
chore: Add tqdm in the dependencies
nikos-livathinos Jan 8, 2025
3cbf954
chore: Move the DatasetStats outside of table_evaluator into the util…
nikos-livathinos Jan 13, 2025
586ae01
feat: ReadingOrderEvaluator: Full implementation with Average Relativ…
nikos-livathinos Jan 7, 2025
e53e1e5
chore: Code clean up
nikos-livathinos Jan 13, 2025
2f417ab
fix: Add reading_order in the visualise() method of main
nikos-livathinos Jan 14, 2025
36b0722
fix: utils/stats.py: Add the metric name as a parameter. Clean up code
nikos-livathinos Jan 14, 2025
96a7d88
chore: Add reading_order evaluation and visualization in the examples…
nikos-livathinos Jan 14, 2025
020594b
feat: MarkdownTextEvaluator: Introduce text evaluation based on markd…
nikos-livathinos Jan 14, 2025
8bb587f
fix: Dump json evaluations for the reading_order and markdown_text wi…
nikos-livathinos Jan 14, 2025
72e7080
feat: Add ReadingOrderVisualizer and use it in the main
nikos-livathinos Jan 15, 2025
8fffdd2
chore: Add pillow lib to the poetry
nikos-livathinos Jan 15, 2025
50b20c6
fix: ReadingOrderEvaluator: Convert the bboxes in bottom-left origin …
nikos-livathinos Jan 15, 2025
d85f7b7
Merge branch 'main' into nli/reading_order
nikos-livathinos Jan 15, 2025
3a619ef
chore: Update poetry lock
nikos-livathinos Jan 15, 2025
35d37f3
fix: Refactor to move the evaluator statistics in a separate file eva…
nikos-livathinos Jan 15, 2025
e9489bb
chore: Update Readme to include the evaluations and visualizations fo…
nikos-livathinos Jan 15, 2025
8a4f265
fix: Refactor the stats.py:save_historgram() to receive generic name …
nikos-livathinos Jan 16, 2025
c894662
feat: ReadingOrder: Implement weighted ARD where the weight is based …
nikos-livathinos Jan 16, 2025
db6d9e3
chore: Update Readme with ARD and weighted ARD and histograms
nikos-livathinos Jan 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 242 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,129 @@ The final result can be visualised as,
![DPBench_TEDS](./docs/evaluations/evaluation_DPBench_tableformer.png)
</details>


<details>
<summary><b>Reading order evaluations for DP-Bench</b></summary>
<br>

👉 Evaluate the dataset,

```sh
poetry run evaluate -t evaluate -m reading_order -b DPBench -i ./benchmarks/dpbench-layout -o ./benchmarks/dpbench-layout
```

👉 Visualise the reading order evaluations,

```sh
poetry run evaluate -t visualize -m reading_order -b DPBench -i ./benchmarks/dpbench-layout -o ./benchmarks/dpbench-layout
```

Reading order (Norm Average Relative Distance) [mean|median|std]: [0.98|1.00|0.05]

| x0<=ARD | ARD<=x1 | prob [%] | acc [%] | 1-acc [%] | total |
|-----------|-----------|------------|-----------|-------------|---------|
| 0 | 0.05 | 0 | 0 | 100 | 0 |
| 0.05 | 0.1 | 0 | 0 | 100 | 0 |
| 0.1 | 0.15 | 0 | 0 | 100 | 0 |
| 0.15 | 0.2 | 0 | 0 | 100 | 0 |
| 0.2 | 0.25 | 0 | 0 | 100 | 0 |
| 0.25 | 0.3 | 0 | 0 | 100 | 0 |
| 0.3 | 0.35 | 0 | 0 | 100 | 0 |
| 0.35 | 0.4 | 0 | 0 | 100 | 0 |
| 0.4 | 0.45 | 0 | 0 | 100 | 0 |
| 0.45 | 0.5 | 0 | 0 | 100 | 0 |
| 0.5 | 0.55 | 0 | 0 | 100 | 0 |
| 0.55 | 0.6 | 0 | 0 | 100 | 0 |
| 0.6 | 0.65 | 0 | 0 | 100 | 0 |
| 0.65 | 0.7 | 1 | 0 | 100 | 2 |
| 0.7 | 0.75 | 0.5 | 1 | 99 | 1 |
| 0.75 | 0.8 | 1 | 1.5 | 98.5 | 2 |
| 0.8 | 0.85 | 2.5 | 2.5 | 97.5 | 5 |
| 0.85 | 0.9 | 0.5 | 5 | 95 | 1 |
| 0.9 | 0.95 | 1.5 | 5.5 | 94.5 | 3 |
| 0.95 | 1 | 93 | 7 | 93 | 186 |

![DPBench_reading_order_ARD](./docs/evaluations/evaluation_DPBench_reading_order_ARD.png)


Reading order (Weighted Normalized Average Relative Distance) [mean|median|std]: [1.00|1.00|0.00]

| x0<=Weighted ARD | Weighted ARD<=x1 | prob [%] | acc [%] | 1-acc [%] | total |
|--------------------|--------------------|------------|-----------|-------------|---------|
| 0 | 0.05 | 0 | 0 | 100 | 0 |
| 0.05 | 0.1 | 0 | 0 | 100 | 0 |
| 0.1 | 0.15 | 0 | 0 | 100 | 0 |
| 0.15 | 0.2 | 0 | 0 | 100 | 0 |
| 0.2 | 0.25 | 0 | 0 | 100 | 0 |
| 0.25 | 0.3 | 0 | 0 | 100 | 0 |
| 0.3 | 0.35 | 0 | 0 | 100 | 0 |
| 0.35 | 0.4 | 0 | 0 | 100 | 0 |
| 0.4 | 0.45 | 0 | 0 | 100 | 0 |
| 0.45 | 0.5 | 0 | 0 | 100 | 0 |
| 0.5 | 0.55 | 0 | 0 | 100 | 0 |
| 0.55 | 0.6 | 0 | 0 | 100 | 0 |
| 0.6 | 0.65 | 0 | 0 | 100 | 0 |
| 0.65 | 0.7 | 0 | 0 | 100 | 0 |
| 0.7 | 0.75 | 0 | 0 | 100 | 0 |
| 0.75 | 0.8 | 0 | 0 | 100 | 0 |
| 0.8 | 0.85 | 0 | 0 | 100 | 0 |
| 0.85 | 0.9 | 0 | 0 | 100 | 0 |
| 0.9 | 0.95 | 0 | 0 | 100 | 0 |
| 0.95 | 1 | 100 | 0 | 100 | 200 |

![DPBench_reading_order_ARD](./docs/evaluations/evaluation_DPBench_reading_order_weighted_ARD.png)


Additionally, images with the actual reading order visualizations are placed in: `benchmarks/dpbench-layout/reading_order_viz`
</details>


<details>
<summary><b>Markdown text evaluations for DP-Bench</b></summary>
<br>

👉 Evaluate the dataset,

```sh
poetry run evaluate -t evaluate -m markdown_text -b DPBench -i ./benchmarks/dpbench-layout -o ./benchmarks/dpbench-layout
```

👉 Visualise the markdown text evaluations,

```sh
poetry run evaluate -t visualize -m markdown_text -b DPBench -i ./benchmarks/dpbench-layout -o ./benchmarks/dpbench-layout
```

Markdown text (BLEU) [mean|median|std]: [0.81|0.87|0.20]

| x0<=BlEU | BlEU<=x1 | prob [%] | acc [%] | 1-acc [%] | total |
|------------|------------|------------|-----------|-------------|---------|
| 0 | 0.05 | 1 | 0 | 100 | 2 |
| 0.05 | 0.1 | 0.5 | 1 | 99 | 1 |
| 0.1 | 0.15 | 0.5 | 1.5 | 98.5 | 1 |
| 0.15 | 0.2 | 1.5 | 2 | 98 | 3 |
| 0.2 | 0.25 | 1 | 3.5 | 96.5 | 2 |
| 0.25 | 0.3 | 0 | 4.5 | 95.5 | 0 |
| 0.3 | 0.35 | 0.5 | 4.5 | 95.5 | 1 |
| 0.35 | 0.4 | 0 | 5 | 95 | 0 |
| 0.4 | 0.45 | 0.5 | 5 | 95 | 1 |
| 0.45 | 0.5 | 0.5 | 5.5 | 94.5 | 1 |
| 0.5 | 0.55 | 3.5 | 6 | 94 | 7 |
| 0.55 | 0.6 | 1 | 9.5 | 90.5 | 2 |
| 0.6 | 0.65 | 4 | 10.5 | 89.5 | 8 |
| 0.65 | 0.7 | 2 | 14.5 | 85.5 | 4 |
| 0.7 | 0.75 | 3.5 | 16.5 | 83.5 | 7 |
| 0.75 | 0.8 | 10 | 20 | 80 | 20 |
| 0.8 | 0.85 | 9.5 | 30 | 70 | 19 |
| 0.85 | 0.9 | 21 | 39.5 | 60.5 | 42 |
| 0.9 | 0.95 | 22.5 | 60.5 | 39.5 | 45 |
| 0.95 | 1 | 17 | 83 | 17 | 34 |

The above quantiles have been also visualized as a histogram plot in: `benchmarks/dpbench-layout/evaluation_DPBench_markdown_text.png`

</details>


### OmniDocBench

Using a single command,
Expand Down Expand Up @@ -194,6 +317,125 @@ The final result can be visualised as,
| 0.95 | 1 | 16.97 | 83.03 | 16.97 | 56 |
</details>

<details>
<summary><b>Reading order evaluations for OmniDocBench</b></summary>
<br>

👉 Evaluate the dataset,

```sh
poetry run evaluate -t evaluate -m reading_order -b OmniDocBench -i ./benchmarks/omnidocbench-dataset/layout -o ./benchmarks/omnidocbench-dataset/layout
```

👉 Visualise the reading order evaluations,

```sh
poetry run evaluate -t visualize -m reading_order -b OmniDocBench -i ./benchmarks/omnidocbench-dataset/layout -o ./benchmarks/omnidocbench-dataset/layout
```

Reading order (Norm Average Relative Distance) [mean|median|std]: [0.84|0.84|0.12]

| x0<=ARD | ARD<=x1 | prob [%] | acc [%] | 1-acc [%] | total |
|-----------|-----------|------------|-----------|-------------|---------|
| 0 | 0.05 | 0 | 0 | 100 | 0 |
| 0.05 | 0.1 | 0 | 0 | 100 | 0 |
| 0.1 | 0.15 | 0 | 0 | 100 | 0 |
| 0.15 | 0.2 | 0 | 0 | 100 | 0 |
| 0.2 | 0.25 | 0 | 0 | 100 | 0 |
| 0.25 | 0.3 | 0 | 0 | 100 | 0 |
| 0.3 | 0.35 | 0 | 0 | 100 | 0 |
| 0.35 | 0.4 | 0 | 0 | 100 | 0 |
| 0.4 | 0.45 | 0 | 0 | 100 | 0 |
| 0.45 | 0.5 | 0 | 0 | 100 | 0 |
| 0.5 | 0.55 | 1.53 | 0 | 100 | 15 |
| 0.55 | 0.6 | 2.24 | 1.53 | 98.47 | 22 |
| 0.6 | 0.65 | 2.55 | 3.77 | 96.23 | 25 |
| 0.65 | 0.7 | 4.89 | 6.32 | 93.68 | 48 |
| 0.7 | 0.75 | 8.15 | 11.21 | 88.79 | 80 |
| 0.75 | 0.8 | 17.74 | 19.37 | 80.63 | 174 |
| 0.8 | 0.85 | 17.43 | 37.1 | 62.9 | 171 |
| 0.85 | 0.9 | 17.13 | 54.54 | 45.46 | 168 |
| 0.9 | 0.95 | 7.44 | 71.66 | 28.34 | 73 |
| 0.95 | 1 | 20.9 | 79.1 | 20.9 | 205 |

![OmniDocBench_reading_order_ARD](./docs/evaluations/evaluation_OmniDocBench_reading_order_ARD.png)


Reading order (Weighted Normalized Average Relative Distance) [mean|median|std]: [0.99|0.99|0.03]

| x0<=Weighted ARD | Weighted ARD<=x1 | prob [%] | acc [%] | 1-acc [%] | total |
|--------------------|--------------------|------------|-----------|-------------|---------|
| 0 | 0.05 | 0 | 0 | 100 | 0 |
| 0.05 | 0.1 | 0 | 0 | 100 | 0 |
| 0.1 | 0.15 | 0 | 0 | 100 | 0 |
| 0.15 | 0.2 | 0 | 0 | 100 | 0 |
| 0.2 | 0.25 | 0 | 0 | 100 | 0 |
| 0.25 | 0.3 | 0 | 0 | 100 | 0 |
| 0.3 | 0.35 | 0 | 0 | 100 | 0 |
| 0.35 | 0.4 | 0 | 0 | 100 | 0 |
| 0.4 | 0.45 | 0 | 0 | 100 | 0 |
| 0.45 | 0.5 | 0 | 0 | 100 | 0 |
| 0.5 | 0.55 | 0 | 0 | 100 | 0 |
| 0.55 | 0.6 | 0 | 0 | 100 | 0 |
| 0.6 | 0.65 | 0 | 0 | 100 | 0 |
| 0.65 | 0.7 | 0 | 0 | 100 | 0 |
| 0.7 | 0.75 | 0 | 0 | 100 | 0 |
| 0.75 | 0.8 | 0.61 | 0 | 100 | 6 |
| 0.8 | 0.85 | 0 | 0.61 | 99.39 | 0 |
| 0.85 | 0.9 | 1.83 | 0.61 | 99.39 | 18 |
| 0.9 | 0.95 | 4.28 | 2.45 | 97.55 | 42 |
| 0.95 | 1 | 93.27 | 6.73 | 93.27 | 915 |

![OmniDocBench_reading_order_weighted_ARD](./docs/evaluations/evaluation_OmniDocBench_reading_order_weighted_ARD.png)

</details>


<details>
<summary><b>Markdown text evaluations for OmniDocBench</b></summary>
<br>

👉 Evaluate the dataset,

```sh
poetry run evaluate -t evaluate -m markdown_text -b OmniDocBench -i ./benchmarks/omnidocbench-dataset/layout -o ./benchmarks/omnidocbench-dataset/layout
```

👉 Visualise the markdown text evaluations,

```sh
poetry run evaluate -t visualize -m markdown_text -b OmniDocBench -i ./benchmarks/omnidocbench-dataset/layout -o ./benchmarks/omnidocbench-dataset/layout
```

Markdown text (BLEU) [mean|median|std]: [0.30|0.11|0.33]

| x0<=BlEU | BlEU<=x1 | prob [%] | acc [%] | 1-acc [%] | total |
|------------|------------|------------|-----------|-------------|---------|
| 0 | 0.05 | 41.59 | 0 | 100 | 408 |
| 0.05 | 0.1 | 6.83 | 41.59 | 58.41 | 67 |
| 0.1 | 0.15 | 4.18 | 48.42 | 51.58 | 41 |
| 0.15 | 0.2 | 3.26 | 52.6 | 47.4 | 32 |
| 0.2 | 0.25 | 2.45 | 55.86 | 44.14 | 24 |
| 0.25 | 0.3 | 1.83 | 58.31 | 41.69 | 18 |
| 0.3 | 0.35 | 1.83 | 60.14 | 39.86 | 18 |
| 0.35 | 0.4 | 2.04 | 61.98 | 38.02 | 20 |
| 0.4 | 0.45 | 2.04 | 64.02 | 35.98 | 20 |
| 0.45 | 0.5 | 2.55 | 66.06 | 33.94 | 25 |
| 0.5 | 0.55 | 2.04 | 68.6 | 31.4 | 20 |
| 0.55 | 0.6 | 2.04 | 70.64 | 29.36 | 20 |
| 0.6 | 0.65 | 2.75 | 72.68 | 27.32 | 27 |
| 0.65 | 0.7 | 2.96 | 75.43 | 24.57 | 29 |
| 0.7 | 0.75 | 4.69 | 78.39 | 21.61 | 46 |
| 0.75 | 0.8 | 4.28 | 83.08 | 16.92 | 42 |
| 0.8 | 0.85 | 4.79 | 87.36 | 12.64 | 47 |
| 0.85 | 0.9 | 4.59 | 92.15 | 7.85 | 45 |
| 0.9 | 0.95 | 2.65 | 96.74 | 3.26 | 26 |
| 0.95 | 1 | 0.61 | 99.39 | 0.61 | 6 |

The above quantiles have been also visualized as a histogram plot in: `benchmarks/omnidocbench-dataset/layout/evaluation_OmniDocBench_markdown_text.png`

</details>

### FinTabNet

Using a single command (loading the dataset from Huggingface: [FinTabNet_OTSL](https://huggingface.co/datasets/ds4sd/FinTabNet_OTSL)),
Expand Down
2 changes: 2 additions & 0 deletions docling_eval/benchmarks/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ class EvaluationModality(str, Enum):
LAYOUT = "layout"
TABLEFORMER = "tableformer"
CODEFORMER = "codeformer"
READING_ORDER = "reading_order"
MARKDOWN_TEXT = "markdown_text"


class BenchMarkNames(str, Enum):
Expand Down
102 changes: 0 additions & 102 deletions docling_eval/benchmarks/dpbench/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -446,108 +446,6 @@ def create_dpbench_tableformer_dataset(
)


def create_dpbench_readingorder_dataset(
dpbench_dir: Path, output_dir: Path, image_scale: float = 1.0
):
# Init the TableFormer model
tf_updater = TableFormerUpdater()

# load the groundtruth
with open(dpbench_dir / f"dataset/reference.json", "r") as fr:
gt = json.load(fr)

viz_dir = output_dir / "vizualisations"
os.makedirs(viz_dir, exist_ok=True)

records = []

for filename, annots in tqdm(
gt.items(),
desc="Processing files for DP-Bench with TableFormer",
total=len(gt),
ncols=128,
):

pdf_path = dpbench_dir / f"dataset/pdfs/{filename}"

# Create the groundtruth Document
true_doc = DoclingDocument(name=f"ground-truth {os.path.basename(pdf_path)}")
true_doc, true_page_images = add_pages_to_true_doc(
pdf_path=pdf_path, true_doc=true_doc, image_scale=image_scale
)

assert len(true_page_images) == 1, "len(true_page_images)==1"

page_width = true_doc.pages[1].size.width
page_height = true_doc.pages[1].size.height

for elem in annots["elements"]:
update(
true_doc,
elem,
page=true_doc.pages[1],
page_image=true_page_images[0],
page_width=page_width,
page_height=page_height,
)

# Create the updated Document
updated, pred_doc = tf_updater.replace_tabledata(
pdf_path=pdf_path, true_doc=true_doc
)

if updated:

if True:
save_comparison_html(
filename=viz_dir / f"{os.path.basename(pdf_path)}-comp.html",
true_doc=true_doc,
pred_doc=pred_doc,
page_image=true_page_images[0],
true_labels=TRUE_HTML_EXPORT_LABELS,
pred_labels=PRED_HTML_EXPORT_LABELS,
)

true_doc, true_pictures, true_page_images = extract_images(
document=true_doc,
pictures_column=BenchMarkColumns.GROUNDTRUTH_PICTURES.value, # pictures_column,
page_images_column=BenchMarkColumns.GROUNDTRUTH_PAGE_IMAGES.value, # page_images_column,
)

pred_doc, pred_pictures, pred_page_images = extract_images(
document=pred_doc,
pictures_column=BenchMarkColumns.PREDICTION_PICTURES.value, # pictures_column,
page_images_column=BenchMarkColumns.PREDICTION_PAGE_IMAGES.value, # page_images_column,
)

record = {
BenchMarkColumns.DOCLING_VERSION: docling_version(),
BenchMarkColumns.STATUS: "SUCCESS",
BenchMarkColumns.DOC_ID: str(os.path.basename(pdf_path)),
BenchMarkColumns.GROUNDTRUTH: json.dumps(true_doc.export_to_dict()),
BenchMarkColumns.PREDICTION: json.dumps(pred_doc.export_to_dict()),
BenchMarkColumns.ORIGINAL: get_binary(pdf_path),
BenchMarkColumns.MIMETYPE: "application/pdf",
BenchMarkColumns.PREDICTION_PAGE_IMAGES: pred_page_images,
BenchMarkColumns.PREDICTION_PICTURES: pred_pictures,
BenchMarkColumns.GROUNDTRUTH_PAGE_IMAGES: true_page_images,
BenchMarkColumns.GROUNDTRUTH_PICTURES: pred_pictures,
}
records.append(record)

test_dir = output_dir / "test"
os.makedirs(test_dir, exist_ok=True)

save_shard_to_disk(items=records, dataset_path=test_dir)

write_datasets_info(
name="DPBench: readingorder",
output_dir=output_dir,
num_train_rows=0,
num_test_rows=len(records),
)


def parse_arguments():
"""Parse arguments for DP-Bench parsing."""

Expand Down
1 change: 1 addition & 0 deletions docling_eval/benchmarks/omnidocbench/create.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,7 @@ def create_omnidocbench_e2e_dataset(

assert len(true_page_images) == 1, "len(true_page_images)==1"

# The true_doc.pages is a dict with the page numbers as indices starting at 1
page_width = true_doc.pages[1].size.width
page_height = true_doc.pages[1].size.height

Expand Down
Loading
Loading