C-STS

This repository contains the dataset and code for the paper C-STS: Conditional Semantic Textual Similarity. [ArXiv]

Data

To avoid the intentional/unintentional scraping of the C-STS dataset for pre-training LLMs, which could cause training data contamination and impact their evaluation, we adopt the following approach for our dataset release.

The dataset for C-STS is stored in an encrypted file named csts.tar.enc. To access the dataset, follow these steps:

Request Access: Submit a request to obtain the decryption password by clicking here. You will receive an email response with the password immediately.
Decrypt the Dataset: Once you have received the password via email, you can decrypt the csts.tar.enc file using the provided extract.sh script. Follow the instructions below:
- Open a terminal and navigate to the data directory.
- Run the following command, replacing <password> with the decryption password obtained via email:
```
bash extract.sh csts.tar.enc <password>
```
Provided the correct password, this step will generate three files csts_train.csv, csts_validation.csv, and csts_test.csv, the unencrypted dataset splits.

You can load the data using datasets with the following lines

from datasets import load_dataset

dataset = load_dataset(
  'csv', 
  data_files=
  {
    'train': 'data/csts_train.csv',
    'validation': 'data/csts_validation.csv',
    'test': 'data/csts_test.csv'
  }
)

Important: By using this dataset, you agree to not publicly share its unencrypted contents or decryption password.

Code

We provide the basic training scripts and utilities for finetuning and evaluating the models in the paper. The code is adapted from the HuggingFace Transformers library. Refer to the documentation for more details.

Fine-tuning

The current code supports finetuning any encoder-only model, using the cross_encoder, bi_encoder, or tri_encoder settings described in the paper. You can finetune the models described in the paper using the run_sts.sh script. For example, to finetune the princeton-nlp/sup-simcse-roberta-base model on the C-STS dataset, run the following command:

MODEL=princeton-nlp/sup-simcse-roberta-base \
ENCODER_TYPE=bi_encoder \
LR=1e-5 \
WD=0.1 \
TRANSFORM=False \
OBJECTIVE=mse \
OUTPUT_DIR=output \
TRAIN_FILE=data/csts_train.csv \
EVAL_FILE=data/csts_validation.csv \
TEST_FILE=data/csts_test.csv \
bash run_sts.sh

See run_sts.sh for a full description of the available options and default values.

Few-shot Evaluation

The script run_sts_fewshot.sh can be used to evaluate large language-models in a few-shot setting with or without instructions. For example, to evaluate the google/flan-t5-xxl model on the C-STS dataset, run the following command:

python run_sts_fewshot.py \
--model_name_or_path google/flan-t5-xxl \
--k_shot 2 \
--prompt_name long \
--train_file data/csts_train.csv \
--validation_file data/csts_validation.csv \
--test_file data/csts_test.csv \
--output_dir output/flan-t5-xxl/k2_long \
--dtype tf32 \
--batch_size 4

To accommodate large model types run_sts_fewshot.sh will use all visible GPUs to load the model in model parallel. For smaller models set CUDA_VISIBLE_DEVICES to the desired GPU ids.

Run python run_sts_fewshot.py --help for a full description of additional options and default values.

Submitting Test Results

You can scores for your model on the test set by submitting your predictions using the make_test_submission.py script as follows:

python make_test_submission.py [email protected] /path/to/your/predictions.json

This script expects the test predictions file to be in the format generated automatically by the scripts above; i.e.

{
  "0": 1.0,
  "1": 0.0,
  "...":
  "4731": 0.5
}

After submission your results will be emailed to the submitted email address with the relevant filename in the subject.

Citation

@misc{deshpande2023csts,
      title={CSTS: Conditional Semantic Textual Similarity}, 
      author={Ameet Deshpande and Carlos E. Jimenez and Howard Chen and Vishvak Murahari and Victoria Graf and Tanmay Rajpurohit and Ashwin Kalyan and Danqi Chen and Karthik Narasimhan},
      year={2023},
      eprint={2305.15093},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
utils		utils
.gitignore		.gitignore
README.md		README.md
make_test_submission.py		make_test_submission.py
requirements.txt		requirements.txt
run_sts.py		run_sts.py
run_sts.sh		run_sts.sh
run_sts_fewshot.py		run_sts_fewshot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C-STS

Table of Contents

Data

Code

Fine-tuning

Few-shot Evaluation

Submitting Test Results

Citation

About

Releases

Packages

Contributors 2

Languages

princeton-nlp/c-sts

Folders and files

Latest commit

History

Repository files navigation

C-STS

Table of Contents

Data

Code

Fine-tuning

Few-shot Evaluation

Submitting Test Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages