Make the model decide when to utter the utterances in the conversation, which can make the interaction more engaging.
Model architecture:
- GCN for predicting the timing of speaking
- Dialogue-sequence: Sequence of the dialogue history
- User-sequence: User utterance sequence
- PMI: Context relationship
- (Seq2Seq/HRED) for language generation
- Multi-head attention for dialogue context (use GCN hidden state)
- Pytorch 1.2
- PyG
- numpy
- tqdm
- nltk: word tokenize and sent tokenize
- BERTScore 0.2.1
Format:
- Corpus folder have lots of sub folder, each named as the turn lengths of the conversations.
- Each sub folder have lots of file which contains one conversation.
- Each conversation file is the tsv format, each line have four element:
- time
- poster
- reader
- utterance
Create the dataset
# ubuntu / cornell, cf / ncf. Then the ubuntu-corpus folder will be created
# ubuntu-corpus have two sub folder (cf / ncf) for each mode
./data/run.sh ubuntu cf
- Language Model: BLEU4, PPL, Distinct-1, Distinct-2
- Talk timing: F1, Acc
- Human Evaluation: Engaging evaluation
- Seq2Seq
- HRED / HRED + CF
- w/o BERT Embedding cosine similarity
- w/o User-sequence
- w/o Dialogue-sequence
Generate the graph of the context
# generate the graph information of the train/test/dev dataset
./run.sh graph cornell when2talk 0
Analyze the graph context coverage information
# The average context coverage in the graph: 0.7935/0.7949/0.7794 (train/test/dev) dataset
./run.sh stat cornell 0 0
Generate the vocab of the dataset
./run.sh vocab ubuntu 0 0
Train the model (seq2seq / seq2seq-cf / hred / hred-cf):
# train the hred model on the 4th GPU
./run.sh train ubuntu hred 4
Translate the test dataset by applying the model
# translate the test dataset by applying the hred model on 4th GPU
./run.sh translate ubuntu hred 4
Evaluate the result of the translated utterances
# evaluate the translated result of the model on 4th GPU (BERTScore need it)
./run.sh eval ubuntu hred 4
Generate performance curve
./run.sh curve dailydialog hred-cf 0
Chat with the model
./run.sh chat dailydialog GatedGCN 0
wait to do:
1. add GatedGCN to all the graph-based method
2. add BiGRU to all the graph-based method
3. refer the DialogueGCN to construct the graph
* the complete graph in the **p** windows size
* add one long edge out of the windows size to explore long context sentence
* user embedding as the node for processing
4. Layers analyse of the GatedGCN in this repo and mutli-turn modeling
-
Methods
- Seq2Seq: seq2seq with attention
- HRED: hierarchical context modeling
- HRED-CF: HRED model with classification for talk timing
- When2Talk: GCNContext modeling first and RNN Context later
- W2T_RNN_First: BiRNN Context modeling first and GCNContext later
- GCNRNN: combine the Gated GCNContext and RNNContext together (?)
- GatedGCN: combine the Gated GCNContext and RNNContext together
- BiRNN for background modeling
- Gated GCN for context modeling
- Combine GCN embedding and BiRNN embedding, final embedding
- Low-turn examples trained without the GCNConv (only use the BiRNN)
- Separate the decision module and generation module is better
- W2T_GCNRNN: RNN + GCN combine RNN together (W2T_RNN_First + GCNRNN)
-
Automatic evaluation
-
Compare the PPL, BLEU4, Disctint-1, Distinct-2 score for all the models.
Proposed classified methods need to be cascaded to calculate the BLEU4, BERTScore (the same format as the traditional models' results)
Model Dailydialog Cornell BLEU Dist-1 Dist-2 PPL BLEU Dist-1 Dist-2 PPL Seq2Seq 0.1038 0.0178 0.072 29.0640 0.0843 0.0052 0.0164 45.1504 HRED 0.1175 0.0176 0.0571 29.7402 0.0823 0.0227 0.0524 39.9009 HRED-CF 0.1268 0.0435 0.1567 29.0111 0.1132 0.0221 0.0691 38.5633 When2Talk 0.1226 0.0211 0.0608 24.0131 0.0996 0.0036 0.0073 32.9503 W2T_RNN_First 0.1244 0.0268 0.0787 24.5056 0.1118 0.0065 0.0147 33.754 GCNRNN 0.1250 0.0214 0.0624 25.8213 0.1072 0.0077 0.0188 33.9572 W2T_GCNRNN 0.1246 0.0152 0.0400 23.4434 0.1107 0.0063 0.0142 34.4256 GatedGCN 0.1231 0.0423 0.1609 27.1615 0.1157 0.0261 0.0873 34.4256 -
F1 metric for measuring the accuracy for the timing of the speaking, only for classified methods (hred-cf, ...). The stat data shows that the number of the negative label is the half of the number of the positive label. F1 and Acc maybe suitable for mearusing the result instead of the F1. In this settings, we care more about the precision in F1 metric.
Model Dailydialog Cornell Acc F1 Acc F1 HRED-CF 0.8272 0.8666 0.7708 0.8427 When2Talk 0.7992 0.8507 0.7616 0.8388 W2T_RNN_First 0.8144 0.8584 0.7481 0.8312 GCNRNN 0.8176 0.8635 0.7598 0.8445 W2T_GCNRNN 0.7565 0.8434 0.7853 0.8466 GatedGCN 0.8226 0.8663 0.738 0.8181
-
-
Human judgments (engaging, ...)
Invit the volunteer to chat with these models (seq2seq, hred, seq2seq-cf, hred-cf,) and score the models' performance accorading to the Engaging, Fluent, ...
-
Dailydialog dataset
Model When2Talk vs. kappa win(%) loss(%) tie(%) Seq2Seq HRED HRED-CF -
Cornell dataset
Model When2Talk vs. kappa win(%) loss(%) tie(%) Seq2Seq HRED HRED-CF
-
-
Graph ablation learning
- F1 accuracy of predicting the speaking timing (hred-cf,)
- BLEU4, BERTScore, Distinct-1, Distinct-2