This is the repository for our 'Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset' paper submitted to the EMNLP NLP+CSS 2022 workshop. This repository includes our SAD dataset along with version 1 and 2 of our S3D dataset. Both of these twitter datasets can be used for the purpose of training sarcasm detection models.
SAD - We provide the Tweet IDs and the given sarcasm labels of 2340 manually annotated tweets which were collected observing the #sarcasm hashtag. Available on HuggingFace
S3D-v1 - We provide the Tweet IDs of 100,000 tweets along with their respective labels which were predicted by a fine-tuned BERTweet model which was trained on our 'Combined dataset', a corpus of over a million tweets and reddit comments labelled for sarcasm in previous works. Available on HuggingFace
S3D-v2 - We provide the Tweet IDs of 100,000 tweets along with their respective labels which were predicted by an ensemble of our 'best' three fine-tuned sarcasm detection models. Available on HuggingFace
We provide a notebook to show the labelling process of our datasets. You can reproduce the experiments to create S3D-v1 and S3D-v2 via our Python notebooks which uses HuggingFace to load the relevant models to label the dataset.
Models | Fine-tuned Models | Description |
---|---|---|
BERTweet | BERTweet-base-finetuned-SARC-combined-DS | BERTweet model fine-tuned on our combined dataset |
BERTweet | BERTweet-base-finetuned-SARC-DS | BERTweet model fine-tuned on the SARC dataset |
RoBERTalarge | roberta-large-finetuned-SARC-combined-DS | RoBERTalarge model fine-tuned on our combined dataset |