📃 UDA-Benchmark

[NIPS2024] UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

✨ Introduction

UDA (Unstructured Document Analysis) is a benchmark suite for Retrieval Augmented Generation (RAG) in real-world document analysis.

Each entry in the UDA dataset is organized as a document-question-answer triplet, where a question is raised from the document, accompanied by a corresponding ground-truth answer.

To reflect the complex nature of real-world scenarios, the documents are retained in their original file formats (like PDF) without parsing or segmentation; and they always consist of both textual and tabular data.

🚀 Quick Start

Begin by setting up the necessary libraries and source codes:

git clone [email protected]:qinchuanhui/UDA-Benchmark.git 
cd UDA-Benchmark
pip install -r requirements.txt

For a quick introduction to the functionalities of our benchmark suite, please refer to the basic_demo.ipynb notebook. It outlines a standard workflow for document analysis using our UDA-QA dataset.

The demonstration encompasses several essential steps:

Prepare the question-answer-document triplet
Extract and segment the document content
Build indexes and retrieve the data segments
Generate the answering reponses with LLMs
Evaluate the accuracy of the reponses using the specific metric.

📢 Updates

Date: 12-26-2024

Fix the version issue of the Wikipedia pages, which are used in the FetaQA dataset.

Date: 09-28-2024

Our paper, "UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis", has been accepted by NeurIPS'24.

Date: 09-05-2024

Add more well-parsed annotated tables for the PaperTab dataset.

📖 Dataset: UDA-QA

Description

Each entry within the UDA dataset is organized as a document-question-answer pair. A typical data point may look like:

{ 'doc_name': 'ADI_2009', # a financial report
  'q_uid': 'ADI/2009/page_59.pdf-2',  # unique question id
  'question': 'What is the expected growth rate in amortization expense in 2010?',
  'answer_1': '-27.0%',
  'answer_2': '	-0.26689'}

The UDA dataset comprises six subsets spanning finance, academia, and knowledge bases, encompassing 2965 documents and 29590 expert-annotated Q&A pairs (more details in HuggingFace). The following table shows an overview of sub-datasets in UDA and their statistics.

Sub Dataset (Source Domain)	Doc Format	Doc Num	Q&A Num	Avg #Words	Avg #Pages	Total Size	Q&A Types
FinHybrid (Finance)	PDF	788	8190	76.6k	147.8	2.61 GB	arithmetic
TatHybrid (Finance)	PDF	170	14703	77.5k	148.5	0.58 GB	extractive, counting, arithmetic
PaperTab (Academia)	PDF	307	393	6.1k	11.0	0.22 GB	extractive, yes/no, free-form
PaperText (Academia)	PDF	1087	2804	5.9k	10.6	0.87 GB	extractive, yes/no, free-form
FetaTab (Wikipedia)	PDF & HTML	878	1023	6.0k	14.9	0.92 GB	free-form
NqText (Wikipedia)	PDF & HTML	645	2477	6.1k	14.9	0.68 GB	extractive

Dataset Usage

For the Q&A labels, they are accessible either through the csv files in the dataset/qa directory or by loading the dataset from the HuggingFace repository qinchuanhui/UDA-QA. The basic usage and format conversion can be found in basic_demo.ipynb.

To access the complete source document files, you can download them through the HuffingFace Repo. After downloading, please extract the content into dataset/src_doc_files. For illustrative purposes, some examples of source documents can be found in dataset/src_doc_file_example.

Additionally, we also include some extended information related to question-answering tasks, which encompass reasoning explanations, human-validated factual evidence, and well-structured contexts. These resources are pivotal for an in-depth analysis of the modular benchmark. You can obtain the full set of them from the HuffingFace Repo, and place them in dataset/extended_qa_info. A sampled subset for evaluation purposes is also available in dataset/extended_qa_info_bench.

⚙️ Benchmark and Experiments

Our UDA benchmark focuses on several pivotal items:

The effectiveness of various table-parsing approaches
The performance of different indexing and retrieval strategies, and the influence of precise retrieval on the LLM generation
The effectiveness of long-context LLMs compared to typical RAGs
Comparison of different LLM-based Q&A strategies
End-to-end comparisons of various LLMs across diverse applications

Evaluation Metrics

To evaluate the quality of LLM-generated answers, we apply widely accepted span-level F1-score in PaperTab, PaperText, FetaTab, and NqText datasets, where ground-truth answers are in natural language and the source datasets also utilize this metric. We treat the prediction and ground truth as bags of words and calculate the F1-score to measure their overlap (see basic_eval).

In financial analysis, the assessment becomes more intricate due to numerical values. For the TatHybrid dataset, we adopt the numeracy-focused F1-score, which considers the scale and the plus-minus of numerical values. In the FinHybrid dataset, where answers are always numerical or binary, we rely on the Exact-Match metric but allow for a numerical tolerance of 1%, accounting for rounding discrepancies.

Resources and Reproduction

For an in-depth exploration of our benchmark and experimental framework, please refer to the resources located in the experiment directory. We have curated a collection of user-friendly Jupyter notebooks, to facilitate the reproduction of our benchmarking procedures. Each directory/notebook corresponds to the specific benchmarking items, and the experimental results will be stored in their own directories.

Additionally, more details of implementing our functionalities are available in the uda directory, including the data preprocessing, parsing strategies, retrieval procedures, LLM inference and evaluation scripts.

🔖 Licenses

Our UDA dataset is distributed under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) License.

🌟 Citation

Please kindly cite our paper if helps your research:

@article{hui2024uda,
  title={UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis},
  author={Hui, Yulong and Lu, Yao and Zhang, Huanchen},
  journal={arXiv preprint arXiv:2406.15187},
  year={2024}
}

or

@inproceedings{
hui2024uda,
title={{UDA}: A Benchmark Suite for Retrieval Augmented Generation in Real-World Document Analysis},
author={Yulong Hui and Yao Lu and Huanchen Zhang},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=MS4oxVfBHn}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
experiment		experiment
uda		uda
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basic_demo.ipynb		basic_demo.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📃 UDA-Benchmark

✨ Introduction

🚀 Quick Start

📢 Updates

📖 Dataset: UDA-QA

Description

Dataset Usage

⚙️ Benchmark and Experiments

Evaluation Metrics

Resources and Reproduction

🔖 Licenses

🌟 Citation

About

Releases

Packages

Languages

License

qinchuanhui/UDA-Benchmark

Folders and files

Latest commit

History

Repository files navigation

📃 UDA-Benchmark

✨ Introduction

🚀 Quick Start

📢 Updates

📖 Dataset: UDA-QA

Description

Dataset Usage

⚙️ Benchmark and Experiments

Evaluation Metrics

Resources and Reproduction

🔖 Licenses

🌟 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages