BRIGHT is the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents. The queries are collected from diverse domains (StackExchange, LeetCode, and math competitions), all sourced from realistic human data. Experiments show that existing retrieval models perform poorly on BRIGHT, where the highest score is only 21 measured by nDCG@10. BRIGHT provides a good testbed for future retrieval research in more realistic and challenging settings.
- Hongjin Su, HKU: Owner
- Howard Yen, Princeton: Owner
- Mengzhou Xia, Princeton: Owner
- Weijia Shi, UW: Contributor
- Niklas Muennighoff: Contributor
- Han-yu Wang, HKU: Contributor
- Haisu Liu, HKU: Contributor
- Quan Shi, Princeton: Contributor
- Zachary S. Siegel, Princeton: Contributor
- Michael Tang, Princeton: Contributor
- Ruoxi Sun, Google: Contributor
- Jinsung Yoon, Google: Contributor
- Sercan Ö. Arik, Google: Contributor
- Danqi Chen, Princeton: Contributor
- Tao Yu, HKU: Contributor
The University of Hong Kong
- Academic - Tech
- Publishing POC: N/A
- Affiliation: N/A
- Contact: N/A
- Mailing List: N/A
- Website: N/A
Hongjin Su, Howard Yen and Mengzhou Xia
- Dataset Owner(s): Hongjin Su, Howard Yen and Mengzhou Xia
- Affiliation: The University of Hong Kong and Princeton University
- Contact: [email protected], {hyen,mengzhou}@cs.princeton.edu
- Group Email: N/A
- Website: N/A
- Hongjin Su, PhD student, The University of Hong Kong
- Howard Yen, Masters student, Princeton University
- Mengzhou Xia, PhD student, Princeton University
- Princeton University
- Google cloud AI research
- Non-Sensitive Data about people
- Public data accessible to everyone
Category | Data |
---|---|
Size of Dataset | 607 MB |
Number of Instances | 1322 |
Number of Fields | 6 |
Domains | 11 |
Above: We collect 1322 diverse queries from realistics human data. Each example is annotated with the gold documents and the reasning traces to fine them.
The datasets are collected from StackExchange, TheoremQA, LeetCode and Math competition.
Dataset | # Q | # D | # D+ | Q.L. | D.L. |
---|---|---|---|---|---|
Biology | 103 | 57,364 | 3.6 | 83.6 | 115.2 |
Earth Science | 118 | 122,388 | 7.7 | 132.4 | 113.3 |
Economics | 103 | 50,221 | 8.0 | 120.2 | 181.5 |
Psychology | 101 | 52,841 | 7.3 | 118.2 | 149.6 |
Robotics | 101 | 62,198 | 5.5 | 120.6 | 818.9 |
Stack Overflow | 117 | 101,100 | 7.0 | 704.5 | 478.3 |
Sustainable Living | 108 | 60,732 | 5.6 | 108.0 | 148.5 |
LeetCode | 142 | 413,932 | 1.8 | 483.1 | 497.5 |
Pony | 112 | 7,894 | 22.5 | 98.3 | 102.6 |
AoPS | 111 | 188,177 | 4.7 | 89.0 | 250.5 |
TheoremQA | 206 | 188,177 | 3.2 | 117.1 | 250.5 |
Data statistics of BRIGHT For each dataset, we show the number of queries (# Q) and documents (# D), the average number of positive documents (# D+) per example, the average length of queries (Q.L.) and documents (D.L., measured by the GPT-2 tokenizer)
- None
- None
Intentional Collected Sensitive Data
- None
Unintentionally Collected Sensitive Data
- None
We select academia-oriented domains and remove all user information in StackExchange data.
- No Known Risks
- None
- N/A
Actively Maintained - No new versions will be made available, but this dataset will be actively maintained, including but not limited to updates to the data.
Current Version: 1.0
Last Updated: 06/2024
Release Date: 06/2024
We will mainly use Github issues and huggingface communities to address any issue the users encounter in using the BRIGHT data.
Versioning: If new versions are released, it will become 1.1 or 2.0 depending on the update.
Updates: There may be updates in the future.
Errors: We will address the error users encounter
Feedback: Either by email, Github issue, huggingface community, we welcome all fedback to make this benchmark better
Version affected: N/A
Next data update: N/A
Next version: N/A
Next version update: N/A
Updates to Data: N/A
Updates to Dataset: N/A
Additional Notes: Add here
- Text Data
Summarize here. Include any criteria for typicality of data point.
{
"query": "Claim in article about why insects are attracted to light\nIn this article they are addressing the reason insects are attracted to light when they say\nHeat radiation as an attractive component is refuted by the effect of LED lighting, which supplies negligible infrared radiation yet still entraps vast numbers of insects.\nI don't see why attraction to LEDs shows they're not seeking heat. Could they for example be evolutionarily programmed to associate light with heat? So that even though they don't encounter heat near/on the LEDs they still \"expect\" to?",
"reasoning": "The question probes why insects are drawn to low-heat LED lights, challenging the idea that their attraction to light is heat-based. The document helps distinguish between heat attraction and evolved behaviors, shedding light on why insects might be attracted to LEDs despite their minimal heat.",
"id": "0",
"excluded_ids": [
"N/A"
],
"gold_ids_long": [
"insects_attracted_to_light/Proximate_and_ultimate_causation.txt",
"insects_attracted_to_light/Phototaxis.txt"
],
"gold_ids": [
"insects_attracted_to_light/Phototaxis_3.txt",
"insects_attracted_to_light/Proximate_and_ultimate_causation_0.txt",
"insects_attracted_to_light/Phototaxis_4.txt",
"insects_attracted_to_light/Proximate_and_ultimate_causation_1.txt",
"insects_attracted_to_light/Phototaxis_0.txt"
]
}
- Research
retrieval
- Existing retrieval benchmarks can be solved by lexical or semantic match
- Many realistic scenarios cannot be solved by such simple match
- To bridge this gap, we introduce BRIGHT to evaluate retrieval models in realistic settings where intensive reasoning is required
- Evaluate retrieval systems in realistic scenarios
Suitable Use Case: Evaluate retrieval models
Unsuitable Use Case: Train retrieval models
We investigate new directions of retrieval, where the relevance between queries and documents go beyond lexical and semantic similarities.
Guidelines & Steps: Include citation when using BRIGHT
BiBTeX:
@inproceedings{BRIGHT,
title={BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval},
author={Su, Hongjin and Yen, Howard and Xia, Mengzhou and Shi, Weijia and Muennighoff, Niklas and Wang, Han-yu and Liu, Haisu and Shi, Quan and Siegel, Zachary S and Tang, Michael and Sun, Ruoxi and Yoon, Jinsung and Arik, Sercan O and Chen, Danqi and Yu, Tao},
year={2024},
}
- External - Open Access
- Dataset Website URL: https://huggingface.co/datasets/xlangai/BRIGHT
- GitHub URL: https://github.com/xlang-ai/BRIGHT
N/A
- Direct download URL: https://huggingface.co/datasets/xlangai/BRIGHT
Code to download data:
from datasets import load_dataset
data = load_dataset('xlangai/BRIGHT', 'examples')['biology']
N/A
Free retention
We are not currently considering wiping out or deleting the data
- Data are collected by authors
Collection Type
Source: StackExchange, TheoremQA, LeetCode and Math competition.
Platform: N/A
Is this source considered sensitive or high-risk? No
Dates of Collection: 2024.03~2024.05
Primary modality of collection data:
Usage Note: Select one for this collection type.
- Text Data
Update Frequency for collected data:
Usage Note: Select one for this collection type.
- Static
Additional Links for this collection:
N/A
- Source: StackExchange is a popular question-answering platform where users ask questions and receive answers from the community. One example is
How good is it to reuse water from plant pots?
I'm living in an apartment, and after I water my plants the water goes to plates below the pots. The pots are in a metallic structure above the plates, so I can take the plates to reuse the water (throwing it at the plants again).
This reuse seems beneficial, because I think I can get rid of mosquitoes that would reproduce in the stagnated water. And also some nutrients of the soil (as well as earthworms) can return to the vase.
Is there some negative points in doing that?
EDIT: I think I must add that I'm at 3 degrees of latitude, in a hot and humid tropical rainforest, where the precipitation used to be around 1700 mm. So I use lots of water everyday, more than once a day sometimes, so the reused water is a small fraction of the water used.
waterreuseplants
Share
Improve this question
Follow
edited Mar 17, 2016 at 15:27
asked Sep 3, 2015 at 18:39
Rodrigo's user avatar
Rodrigo
16311 silver badge66 bronze badges
i think you mean "pots" if they have dirt in them. "vases" hold water and cur flowers. –
Kate Gregory
Mar 17, 2016 at 14:53
Yes, @KateGregory, you're absolutely right. That's because in Portuguese we call them "vasos" :) –
Rodrigo
Mar 17, 2016 at 15:25
Add a comment
2 Answers
Sorted by:
Highest score (default)
7
In my experience plants suffer in the long term from accumulation of salts in the soil, so fresh water would be better than reusing the water. Even better would be to get hold of fresh rain water (tricky in an apartment though, unless perhaps you have a balcony that gets rained on) for watering them, as that won't contain the salts that tap water does.
More detail here.
Share
Improve this answer
Follow
- Source: LeetCode is a popular coding platform for programmers to practice. One example is:
5. Longest Palindromic Substring
Medium
Topics
Companies
Hint
Given a string s, return the longest
palindromic
substring
in s.
Example 1:
Input: s = "babad"
Output: "bab"
Explanation: "aba" is also a valid answer.
Example 2:
Input: s = "cbbd"
Output: "bb"
Constraints:
1 <= s.length <= 1000
s consist of only digits and English letters.
- Source: AoPS contains math competition questions. One example is:
Problem 1
What is the ones digit of 222,22 -222,222, -2,222,-222222 \
A. 0
B. 2
C. 4
D. 6
E. 8
Solution 1
We can rewrite the expression as\[222,222-(22,222+2,222+222+22+2).\]
We note that the units digit of the addition is $0$ because all the units digits of the five numbers are $2$ and $5*2=10$, which has a units digit of $0$.
Now, we have something with a units digit of $0$ subtracted from $222,222$. The units digit of this expression is obviously $2$, and we get $\boxed{B}$ as our answer.
Static: Data was collected once from single or multiple sources.
Source
Included Fields
Data fields that were collected and are included in the dataset.
Field Name | Description |
---|---|
Post | The content of post where users ask questions |
Additional Notes: Add here
Excluded Fields
Data fields that were collected but are excluded from the dataset.
Field Name | Description |
---|---|
Answer | Community answers |
Votes | The votes for the post of answers |
All the data collection and processing are done manually or with the help of python scripts.
- StackExchange: We select posts that have links in answers receiving user accept or more than 5 votes
- Math and Code: We select questions that require a theorems of syntax documentation.
- We include data from diverse domains including psychology, robotics, etc.
- We exclude examples that do not require reasoning in retrieval or do not use theorems.
- StackExchange: We use the post and linked web pages in answers
- Math & Code: We use the questions and tags in websites.
- Using this method, we collect retrieval instances that require intensive reasoning to retrieve documents
- The judgement of relevance can be subjective, leading to non-perfect human performance.
- Release date: 06/2024
- Link to dataset: BRIGHT 1.0: https://huggingface.co/datasets/xlangai/BRIGHT
- Status: Actively Maintained
- Size of Dataset: 607 MB
- Number of Instances: 1322
None
- Daily
- We have not updated the datasets since release.
N/A
None
Intentionally Collected Attributes
We only use human-labeled links or tags to find examples or documents, but not directly include human labels.
Unintentionally Collected Attributes
None
We follow links or tags to find relevant documents or examples
None
We follow links or tags to find relevant documents or examples
N/A
[query
, gold_ids
, gold_ids_long
]
Description: The documents corresponding to gold_ids
or gold_ids_long
are relevant to queries.
Impact on dataset use: It helps evalute retrieval models in realistic setting.
Human Attribute
None
- Safe to use with other data
The data in BRIGHT benchmark focus on academia-oriented domains, and they should be safe.
Evaluate retrieval systems on BRIGHT.
None
The judgement of relevance between queries and documents can be subjective, so marginal difference between model evaluation could be ignored, while significant difference gives good signals of model capabilities.
- Safe to form and/or sample
- Cluster Sampling
- Haphazard Sampling
- Multi-stage sampling
- Random Sampling
- Retrospective Sampling
- Systematic Sampling
- Weighted Sampling
- Unsampled
Although sampling is possible, we recommend not to do it because the size of BRIGHT is not very large.
N/A
N/A
- Evaluation
The intensive reasoning required to retrieve documents.
Usage Guidelines: Follow the tutorial to evaluate retrieval systems.
Approval Steps: Steps are here.
Reviewer: We authors review the dataset for publication.
The BRIGHT benchmark is for the purpose of evaluation, i.e., all data are in test set.
query
, gold_ids
, gold_ids_long
Description: The documents corresponding to gold_ids or gold_ids_long are relevant to queries.
Impact on dataset use: It can help evaluate retrieval systems in more realistic scenarios.
Risks from correlation: The judgement of correlation is by real users, and can be subjective.
Dataset | # Q | # D | # D+ | Q.L. | D.L. |
---|---|---|---|---|---|
Biology | 103 | 57,364 | 3.6 | 83.6 | 115.2 |
Earth Science | 118 | 122,388 | 7.7 | 132.4 | 113.3 |
Economics | 103 | 50,221 | 8.0 | 120.2 | 181.5 |
Psychology | 101 | 52,841 | 7.3 | 118.2 | 149.6 |
Robotics | 101 | 62,198 | 5.5 | 120.6 | 818.9 |
Stack Overflow | 117 | 101,100 | 7.0 | 704.5 | 478.3 |
Sustainable Living | 108 | 60,732 | 5.6 | 108.0 | 148.5 |
LeetCode | 142 | 413,932 | 1.8 | 483.1 | 497.5 |
Pony | 112 | 7,894 | 22.5 | 98.3 | 102.6 |
AoPS | 111 | 188,177 | 4.7 | 89.0 | 250.5 |
TheoremQA | 206 | 188,177 | 3.2 | 117.1 | 250.5 |
Data statistics of BRIGHT For each dataset, we show the number of queries (# Q) and documents (# D), the average number of positive documents (# D+) per example, the average length of queries (Q.L.) and documents (D.L., measured by the GPT-2 tokenizer)
- Data Aggregation
Transformation Type
Field Name | Source & Target |
---|---|
gold_ids | links: gold_ids |
gold_ids_long | links: gold_ids_long |
Transformation Type
Method: We follow the links or tags to find relevant documents.
Platforms, tools, or libraries: We do not leverage other platforms or tools in transformation
Transformation Results: We collect 1322 examples that can be used for evaluating retrievers.
We find documents for all instances following the procedure above
The risk is that the relevance judgement is subjective.
We require human annotators to write down the judgement for relevance and reasoning steps.
None
We select high-quality data instance from websites, so there is no further cleaning.
We follow links or tags in the websites.
We do not use incorrect or mismatched values.
M/A
The data and notes written down by annotators are reviewed
None
We select data from websites, so no anomaly or outlier is excluded.
N/A
Platforms, tools, or libraries N/A
N/A
N/A
The data and notes written by annotators are reviewed.
N/A
N/A
N/A
Platforms, tools, or libraries N/A
N/A
N/A
N/A
N/A
We use StackExchange, LeetCode, TheoremQA and math competitions
They are independent splits, so no join is performed
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
- Human Annotations (Expert)
- Human Annotations (Non-Expert)
Annotation Type | Number |
---|---|
Total number of annotations | 1322 |
Description: Description of annotations (labels, ratings) produced. Include how this was created or authored.
We follow links/tags to find relevant documents
Link: N/A
Platforms, tools, or libraries: N/A
Dataset | number |
---|---|
Biology | 103 |
Earth Science | 118 |
Economics | 103 |
Psychology | 101 |
Robotics | 101 |
Stack Overflow | 117 |
Sustainable Living | 108 |
LeetCode | 142 |
Pony | 112 |
AoPS | 111 |
TheoremQA | 206 |
Distribution of data splits in each domain
(Task Type)
Task description & instructions: In this section, we describe the instructions for annotators to collect data in BRIGHT.
StackExchange
-
Browse posts from the newest to the oldest.
-
Discard posts without an answer accepted by the user or obtains more than 5 votes
-
Discard answers of posts without URL links.
-
For each link in the answer, write down the answers to: (1). why are the document and the post relevant; (2). what is the reasoning required to understand the relevance between the post and the document. If there answers are not possible, discard the link.
-
Use LLMs (e.g., ChatGPT, Claude, etc.) to generate post key words, or use the post title to search for web pages with large keyword or semantic overlap in Google. Search for at most 5 negative web pages per query.
-
Split every web page into small passages either by two newline symbols, "#" in markdonw files or fixed-length tokens
TheoremQA
In TheoremQA, the main task for the annotator is to check if the GPT-4 rewritten questions are valid. The specific instructions are as follows:
- Read the rewritten question and determine if it is solvable.
- If it is solvable, read the original question and solution, and determine if the rewritten question is consistent with the original question. That is, the same reasoning steps and the final answer should hold.
- If it is also consistent, mark the question as valid, and make any minor edits to the problem statement (e.g., to improve grammar or fluency) as you see fit.
- If it is not solvable or not consistent, read the original question and solution, and correct the rewritten question if possible. If not, then discard the problem.
AoPS In AoPS, annotators are tasked to find questions from the AoPS Wiki and record the problems:
- Browse through the AoPS Wiki and find topic/category pages (example 1, example 2).
- Look through each page and find pages specific theorems or techniques that can be used to solve problems. The page should link to at least two competition problems (example 1, example 2).
- Record the links of both the theorem/technique as well as the problem pages. The annotators are assigned a category to look for theorems in to avoid overlaps, and the categories are {algebra, geometry, calculus, probability, number theory, other}. After all links are collected, we use a web scraper to collect the problem statement and solutions, and we manually check the quality of the scraped data.
LeetCode In LeetCode, annotators determine whether a question is grounded in real-world concepts. We give a similar instruction to the annotator as to GPT-4:
- Read the problem statement carefully.
- Categorize the question into one of three categories: • 0: The question is not grounded in any real-world concepts. The description only uses coding-specific terms, such as "linked list", "binary search", "palindrome", "sorting", etc.. • 1: The question is not grounded in any real-world concepts or real-world concepts that are commonly used in the context of coding, such as needle in a haystack, strings/words, or a spiral matrix.
• 2: The question is grounded in real-world concepts that are not commonly used in the context of coding, such as building height, planting trees, or games. It may still uses some code-specific terms to specify the data structure involved.
Methods used: Basically we follow links/tags to find documents
Inter-rater adjudication policy: Reviewers annotate where the pairing of queries and documents are not convincing.
Golden questions: N/A
(Annotation Type)
Task type: Annotate StackExchange data
Number of unique annotators: 3
Expertise of annotators: Both experts and non-experts
Description of annotators: PhD students in computer science, biology, environment, etc.
Language distribution of annotators: They all speak fluent English
Geographic distribution of annotators: They come from Asia
Summary of annotation instructions: Follow links to find documents with filtering
Summary of gold questions: N/A
Annotation platforms: Google sheets
Additional Notes: N/A
(Task Type)
Task description: Annotate math and code data
Task instructions: Follow tags to find similar problems/questions
Methods used: Follow tags annotated by websites
Inter-rater adjudication policy: The data is reviewed
Golden questions: N/A
Additional notes: N/A
(Annotation Type)
- 100% English
(Annotation Type)
- Asia [50 %]
- US [50 %]
(Annotation Type)
- Male [80 %]
- Female [20 %]
- Code/cross-reference Validation
(Validation Type)
Number of Data Points Validated: 1322
Fields Validated
All fields in data are validated
(Validation Type)
Method: Describe the validation method here. Include links where necessary.
We require annotators to write the logic to determine the relevance between queries and documents. The reviewers not only check the data, but also annotators' notes.
Validation Results:
Over 90% of annotation passes peer review, and we discard the the rest part.
(Validation Type)
- Unique validators: 8
- Number of examples per validator: 300
- Average cost/task/validator: N/A
- Training provided: N
- Expertise required: N
(Validation Type)
Validator description: Validators are domain experts, e.g., PhD students from the corresponding domains.
Training provided: We do not provide training, but verify that the annotators, reviewers are qualified
Validator selection criteria: We have a test containing verified examples. An annotator is qualified if they can work out these examples.
Training provided: N/A
(Validation Type)
- English [100 %]
(Validation Type)
- Asia [60 %]
- US [40 %]
(Validation Type)
- Male [70 %]
- Female [30 %]
- Unsampled
N/A
N/A
Retrieval evaluation
SFR-Embedding-Mistral 17.8
Model Card: https://huggingface.co/Salesforce/SFR-Embedding-Mistral/tree/main
Evaluation Results
- nDCG@10: 17.8
We write python scripts to run retrieval models on BRIGHT.
SFR-Embedding-Mistral
Model Card: https://huggingface.co/Salesforce/SFR-Embedding-Mistral/tree/main
Model Description: The best-class retrieval model trained from mistral-7b
- Model Size: 7.11B
- Model Weights: 7.11B
- Model Layers 32
- Latency: 2s
Claude-3 + BM25
Expected Performance: surpasses results obtained without using LLMs
Known Caveats: The inference of LLMs can be expensive
Definition: The name of this benchmark
Source: https://huggingface.co/datasets/xlangai/BRIGHT
Interpretation: N/A
We believe that BRIGHT paves the way for future research on retrieval20 systems in more realistic and challenging settings.