Skip to content

[NIPS'24] UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

License

Notifications You must be signed in to change notification settings

qinchuanhui/UDA-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📃 UDA-Benchmark

[NIPS2024] UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

     

✨ Introduction

UDA (Unstructured Document Analysis) is a benchmark suite for Retrieval Augmented Generation (RAG) in real-world document analysis.

Each entry in the UDA dataset is organized as a document-question-answer triplet, where a question is raised from the document, accompanied by a corresponding ground-truth answer.

To reflect the complex nature of real-world scenarios, the documents are retained in their original file formats (like PDF) without parsing or segmentation; and they always consist of both textual and tabular data.

🚀 Quick Start

Begin by setting up the necessary libraries and source codes:

git clone [email protected]:qinchuanhui/UDA-Benchmark.git 
cd UDA-Benchmark
pip install -r requirements.txt

For a quick introduction to the functionalities of our benchmark suite, please refer to the basic_demo.ipynb notebook. It outlines a standard workflow for document analysis using our UDA-QA dataset.

The demonstration encompasses several essential steps:

  • Prepare the question-answer-document triplet
  • Extract and segment the document content
  • Build indexes and retrieve the data segments
  • Generate the answering reponses with LLMs
  • Evaluate the accuracy of the reponses using the specific metric.

📢 Updates

Date: 12-26-2024

Fix the version issue of the Wikipedia pages, which are used in the FetaQA dataset.

Date: 09-28-2024

Our paper, "UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis", has been accepted by NeurIPS'24.

Date: 09-05-2024

Add more well-parsed annotated tables for the PaperTab dataset.

📖 Dataset: UDA-QA

Description

Each entry within the UDA dataset is organized as a document-question-answer pair. A typical data point may look like:

{ 'doc_name': 'ADI_2009', # a financial report
  'q_uid': 'ADI/2009/page_59.pdf-2',  # unique question id
  'question': 'What is the expected growth rate in amortization expense in 2010?',
  'answer_1': '-27.0%',
  'answer_2': '	-0.26689'}

The UDA dataset comprises six subsets spanning finance, academia, and knowledge bases, encompassing 2965 documents and 29590 expert-annotated Q&A pairs (more details in HuggingFace). The following table shows an overview of sub-datasets in UDA and their statistics.

Sub Dataset
(Source Domain)
Doc Format Doc Num Q&A Num Avg #Words Avg #Pages Total Size Q&A Types
FinHybrid (Finance) PDF 788 8190 76.6k 147.8 2.61 GB arithmetic
TatHybrid (Finance) PDF 170 14703 77.5k 148.5 0.58 GB extractive, counting, arithmetic
PaperTab (Academia) PDF 307 393 6.1k 11.0 0.22 GB extractive, yes/no, free-form
PaperText (Academia) PDF 1087 2804 5.9k 10.6 0.87 GB extractive, yes/no, free-form
FetaTab (Wikipedia) PDF & HTML 878 1023 6.0k 14.9 0.92 GB free-form
NqText (Wikipedia) PDF & HTML 645 2477 6.1k 14.9 0.68 GB extractive

Dataset Usage

For the Q&A labels, they are accessible either through the csv files in the dataset/qa directory or by loading the dataset from the HuggingFace repository qinchuanhui/UDA-QA. The basic usage and format conversion can be found in basic_demo.ipynb.

To access the complete source document files, you can download them through the HuffingFace Repo. After downloading, please extract the content into dataset/src_doc_files. For illustrative purposes, some examples of source documents can be found in dataset/src_doc_file_example.

Additionally, we also include some extended information related to question-answering tasks, which encompass reasoning explanations, human-validated factual evidence, and well-structured contexts. These resources are pivotal for an in-depth analysis of the modular benchmark. You can obtain the full set of them from the HuffingFace Repo, and place them in dataset/extended_qa_info. A sampled subset for evaluation purposes is also available in dataset/extended_qa_info_bench.

⚙️ Benchmark and Experiments

Our UDA benchmark focuses on several pivotal items:

  • The effectiveness of various table-parsing approaches
  • The performance of different indexing and retrieval strategies, and the influence of precise retrieval on the LLM generation
  • The effectiveness of long-context LLMs compared to typical RAGs
  • Comparison of different LLM-based Q&A strategies
  • End-to-end comparisons of various LLMs across diverse applications

Evaluation Metrics

To evaluate the quality of LLM-generated answers, we apply widely accepted span-level F1-score in PaperTab, PaperText, FetaTab, and NqText datasets, where ground-truth answers are in natural language and the source datasets also utilize this metric. We treat the prediction and ground truth as bags of words and calculate the F1-score to measure their overlap (see basic_eval).

In financial analysis, the assessment becomes more intricate due to numerical values. For the TatHybrid dataset, we adopt the numeracy-focused F1-score, which considers the scale and the plus-minus of numerical values. In the FinHybrid dataset, where answers are always numerical or binary, we rely on the Exact-Match metric but allow for a numerical tolerance of 1%, accounting for rounding discrepancies.

Resources and Reproduction

For an in-depth exploration of our benchmark and experimental framework, please refer to the resources located in the experiment directory. We have curated a collection of user-friendly Jupyter notebooks, to facilitate the reproduction of our benchmarking procedures. Each directory/notebook corresponds to the specific benchmarking items, and the experimental results will be stored in their own directories.

Additionally, more details of implementing our functionalities are available in the uda directory, including the data preprocessing, parsing strategies, retrieval procedures, LLM inference and evaluation scripts.

🔖 Licenses

CC BY-NC-SA 4.0

Our UDA dataset is distributed under the Creative Commons Attribution-ShareAlike 4.0 International (CC-BY-SA 4.0) License.

CC BY-SA 4.0

🌟 Citation

Please kindly cite our paper if helps your research:

@article{hui2024uda,
  title={UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis},
  author={Hui, Yulong and Lu, Yao and Zhang, Huanchen},
  journal={arXiv preprint arXiv:2406.15187},
  year={2024}
}

or

@inproceedings{
hui2024uda,
title={{UDA}: A Benchmark Suite for Retrieval Augmented Generation in Real-World Document Analysis},
author={Yulong Hui and Yao Lu and Huanchen Zhang},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=MS4oxVfBHn}
}

About

[NIPS'24] UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published