The first unsupervised multimodal clustering method for multimodal semantics discovery.
This repository contains the official PyTorch implementation of the research paper Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances (Accepted by ACL 2024 Main Conference, Long Paper).
We use anaconda to create python environment and install required libraries:
conda create --name umc python=3.8
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
- MIntRec: The first multimodal intent recognition dataset (Paper, Resource)
- MELD-DA: A multimodal multi-party dataset for emotion recognition in conversation (Paper, Resource)
- IEMOCAP-DA: The Interactive Emotional Dyadic Motion Capture database (Paper, Resource)
For MELD-DA and IEMOCAP-DA, we use the well-annotated dialogue act (DA) labels from the EMOTyDA dataset (Paper, Resource)
You can download the multimodal features from Baidu Cloud (code: swqe) or Google Disk.
An example of the data structure of one dataset is as follows:
Datasets/
├── MIntRec/
│ ├── train.tsv
│ ├── dev.tsv
│ ├── test.tsv
│ ├── video_data/
│ │ └── swin_feats.pkl
│ └── audio_data/
│ │ └── wavlm_feats.pkl
├── MELD-DA/
│ ├──..
├── IEMOCAP-DA/
│ ├──...
The pre-trained bert model can be downloaded from Baidu Cloud with code: v8tk.
In this work, we propose UMC, a novel unsupervised multimodal clustering method. It introduces (1) a unique approach to contructing augmentation views for multimodal data, (2) an innovative strategy to dynamically select high-quality samples as guidance for representation learning, (3) a combined learning approach to use both high- and low-quality samples to learn friendly representations conducive to clustering. The model architecture is as follows:
The high-quality sampling strategy is illustrated as follows:
Clone the repository:
git clone [email protected]:thuiar/UMC.git
Run the experiments by:
sh examples/run_umc.sh
The following example demonstrates a complete quickstart process using the MIntRec dataset.
Step 1 : Download the dataset and pre-trained BERT model using the provided method, and place them in the UMC/
directory.
Step 2 : Modify the parameters in configs/umc_MIntRec.py
to include the pre-training process.
'pretrain': [True],
Step 3 :You can modify the parameters in the examples/run_umc.sh file to suit your needs as follows:
--data_path 'Datasets' \ # Change dataset address/path
--train \ # Include the training process
--save_model \ # Specify to save the model
--output_path "outputs" # Store both pre-trained and final models
Step 4 :Run the experiments by:
sh examples/run_umc.sh
Methods | NMI | ARI | ACC | FMI | Avg. | |
---|---|---|---|---|---|---|
MIntRec | SCCL | 45.33 | 14.60 | 36.86 | 24.89 | 30.42 |
CC | 47.45 | 22.04 | 41.57 | 26.91 | 34.49 | |
USNID | 47.91 | 21.52 | 40.32 | 26.58 | 34.08 | |
MCN | 18.24 | 1.70 | 16.76 | 10.32 | 11.76 | |
UMC (Text) | 47.15 | 22.05 | 42.46 | 26.93 | 34.65 | |
UMC | 49.26 | 24.67 | 43.73 | 29.39 | 36.76 | |
MELD-DA | SCCL | 22.42 | 14.48 | 32.09 | 27.51 | 24.13 |
CC | 23.03 | 13.53 | 25.13 | 24.86 | 21.64 | |
USNID | 20.80 | 12.16 | 24.07 | 23.28 | 20.08 | |
MCN | 8.34 | 1.57 | 18.10 | 15.31 | 10.83 | |
UMC (Text) | 19.57 | 16.29 | 33.40 | 30.81 | 25.02 | |
UMC | 23.22 | 20.59 | 35.31 | 33.88 | 28.25 | |
IEMOCAP-DA | SCCL | 21.90 | 10.90 | 26.80 | 24.14 | 20.94 |
CC | 23.59 | 12.99 | 25.86 | 24.42 | 21.72 | |
USNID | 22.19 | 11.92 | 27.35 | 23.86 | 21.33 | |
MCN | 8.12 | 1.81 | 16.16 | 14.34 | 10.11 | |
UMC (Text) | 20.01 | 18.15 | 32.76 | 31.10 | 25.64 | |
UMC | 24.16 | 20.31 | 33.87 | 32.49 | 27.71 |
If you are insterested in this work, and want to use the codes or results in this repository, please star this repository and cite the following works:
@article{zhang2024unsupervised,
title={Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances},
author={Hanlei Zhang and Hua Xu and Fei Long and Xin Wang and Kai Gao},
year={2024},
journal = {arXiv preprint arXiv:2405.12775},
}
@inproceedings{10.1145/3503161.3547906,
author = {Zhang, Hanlei and Xu, Hua and Wang, Xin and Zhou, Qianrui and Zhao, Shaojie and Teng, Jiayan},
title = {MIntRec: A New Dataset for Multimodal Intent Recognition},
year = {2022},
doi = {10.1145/3503161.3547906},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
pages = {1688–1697},
numpages = {10},
}