-
Notifications
You must be signed in to change notification settings - Fork 0
Arguments
The sample command line to run any method is:
SemOpinionS.py [-h] --method {DohareEtAl2018,DohareEtAl2018_TF,LiuEtAl2015,LiaoEtAl2018,machine_learning,machine_learning_clustering,score_optimization} [--corpus CORPUS] --alignment ALIGNMENT --alignment_format {giza,jamr} [--gold GOLD] [--openie OPENIE] [--tfidf TFIDF] [--training TRAINING] [target TARGET] [--model MODEL] [--loss {perceptron,ramp}] [--sentlex SENTLEX] [--similarity {lcs,smatch,concept_coverage}] [--machine_learning {decision_tree,random_forest,svm,mlp}] [--levi] [--aspects ASPECTS] --output OUTPUT
- Method argument
- Corpus argument
- Alignment arguments
- Gold argument
- OpenIE argument
- TF-IDF argument
- Training and Target arguments
- Model argument
- Loss function argument
- Sentlex argument
- Similarity argument
- Machine Learning argument
- Levi argument
- Aspects argument
- Output argument
This represents which summarization method is to be executed. Each method uses a specific set of arguments that must be provided in the command line. The methods implemented within SemOpinionS are:
- DohareEtAl2018
- DohareEtAl2018_TF
- LiuEtAl2015
- LiaoEtAl2018
- machine_learning
- machine_learning_clustering
- score_optimization
CORPUS
is the path for an AMR file containing the sentences that are going to be summarized. This file has the following format:
# ::id O-Apanhador-no-Campo-de-Centeio.Documento_117.2
# ::snt O livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador .
(e / e
:op1 (i2 / idiota)
:op2 (r / repetitivo)
:op3 (c / cansativo)
:op4 (i3 / irritante)
:domain (e2 / e
:op1 (l / livro)
:op2 (a / autor
:ARG0-of (e3 / escrever-01
:ARG1 l))))
The file may contain different metadata (within lines starting with # ::metadata
), however these two (id and snt) are obligatory.
ALIGNMENT
is the path for an AMR alignment file for the original graphs within the CORPUS
file. There are two formats possible for this file, which is controlled by the --alignment_format {giza, jamr}
argument. This file may contain other sentences, but it must contain all sentences from the CORPUS
file.
This format is used mainly by aligners of concepts to words (one to one). The format is as follows:
# o_0 livro_1 é_2 idiota_3 ,_4 repetitivo_5 ,_6 cansativo_7 e_8 irritante_9 ,_10 tanto_11 quanto_12 seu_13 narrador_14 ._15
(e / e~e.8 :op1 (i2 / idiota~e.3) :op2 (r / repetitivo~e.5) :op3 (c / cansativo~e.7) :op4 (i3 / irritante~e.9) :domain (e2 / e~e.8 :op1 (l / livro~e.1) :op2 (a / autor~e.12 :arg0-of (e3 / escrever-01~e.14 :arg1 l))))
Each sentence is represented by a line starting with #
followed by its numbered tokens. The following line contains the AMR graph linearised with alignments in each concept indicated by ~e.n
, so that n
indicates the token aligned to that specific node. It is important to be careful about the sentence tokens, as they must match the exact same tokens within the original file from the CORPUS
argument.
This format is used by aligners of concepts to word spans (one to many). This is how it looks like:
# ::snt O livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador .
# ::tok O livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador .
# ::alignments 8-9|0 1-2|0.4.0 9-10|0.3 7-8|0.2 5-6|0.1 3-4|0.0 ::annotator Aligner v.03 ::date 2020-07-04T11:48:49.113
# ::node 0 e 8-9
# ::node 0.0 idiota 3-4
# ::node 0.1 repetitivo 5-6
# ::node 0.2 cansativo 7-8
# ::node 0.3 irritante 9-10
# ::node 0.4 e
# ::node 0.4.0 livro 1-2
# ::node 0.4.1 autor
# ::node 0.4.1.0 escrever-01
# ::root 0 e
# ::edge autor ARG0-of escrever-01 0.4.1 0.4.1.0
# ::edge e domain e 0 0.4
# ::edge e op1 idiota 0 0.0
# ::edge e op1 livro 0.4 0.4.0
# ::edge e op2 autor 0.4 0.4.1
# ::edge e op2 repetitivo 0 0.1
# ::edge e op3 cansativo 0 0.2
# ::edge e op4 irritante 0 0.3
# ::edge escrever-01 ARG1 livro 0.4.1.0 0.4.0
(e / e :op1 (i2 / idiota) :op2 (r / repetitivo) :op3 (c / cansativo) :op4 (i3 / irritante) :domain (e2 / e :op1 (l / livro) :op2 (a / autor :ARG0-of (e3 / escrever-01 :ARG1 l))))
The snt
metadata must match with the snt
from the CORPUS
file. Only the node
alignments are taken into consideration, i.e. all edge
alignments are ignores. The node alignment information consists of a line starting with # ::node
followed by an id, the node label and the word span to which it is aligned, all separated by tabs (\t
).
GOLD
is the path to a directory with all gold summary texts in multiple files, to be used if one wants to create a merged AMR graph and aligned BOW texts from them. These summaries must follow the same format:
O livro é idiota, repetitivo, cansativo e irritante, tanto quanto seu narrador. <O-Apanhador-no-Campo-de-Centeio.Documento_117.2>
The file contains one or more lines with (or without) the sentence followed by the sentence ID between angle brackets (<id>
). This ID is the most important part of the line and must match the IDs from the CORPUS
file. From these, the AMR graphs for the sentences are retrieved from the CORPUS
file (if they exist) and a single summary AMR graph is created by merging all the sentence AMRs.
These sentences must also be present in the ALIGNMENT
file, so that BOW pseudotexts can be created for each summary.
OpenIE is an Open Information Extraction tool used by some methods implemented. The format file is a CSV, as follows:
"ID SENTENÇA";"SENTENÇA";"ID EXTRAÇÃO";"ARG1";"REL";"ARG2";"COERÊNCIA";"MINIMALIDADE";"MÓDULO SUJEITO";"MÓDULO RELAÇÃO"
"O-Apanhador-no-Campo-de-Centeio.Documento_117.2";"o livro é idiota , repetitivo , cansativo e irritante , tanto quanto seu narrador . ";"1.0";"o livro ";" é idiota";"repetitivo , cansativo e irritante , tanto quanto seu narrador ";;;1;1
;;"2.0";"o livro ";" é idiota";"tanto quanto seu narrador ";;;1;1
It contains 10 columns, however we focus on the ones of numbers 1, 3, 4, 5 and 6 (ID SENTENÇA
, ID EXTRAÇÃO
, ARG1
, REL
and ARG2
), all separated by semicolons (;
). The first column (ID SENTENÇA
) must match with the IDs from the CORPUS
file. If a line does not contain an ID it is considered to be using the last seen ID (in the order of the file) instead. Note that the columns do not need to have the same names as the ones here, just their position is important. The first line of the file must always be the name of each column.
This argument points to a directory with multiple text files which are going to be used to calculate TF-IDF scores. The TF part is calculated using the sentences from the CORPUS
file and the DF counts are obtained from all the files in the TFIDF
directory, in which each file is considered a document.
Both of these arguments are used specifically by methods requiring some kind of supervised training. Namely, these methods are:
- LiuEtAl2015
- LiaoEtAl2018
- machine_learning
- machine_learning_clustering
- score_optimization
These arguments should be paths to specified directories containing parallel documents, i.e. the files should have the same name so that it creates a pair of train-target instances. This can be seen in the example directories as follows:
D:.
├───target
│ 1984_1.txt
│ 1984_2.txt
│ 1984_3.txt
│ 1984_4.txt
│ 1984_5.txt
│ Capitaes-da-Areia_1.txt
│ Capitaes-da-Areia_2.txt
│ Capitaes-da-Areia_3.txt
│ ...
│
└───training
1984_1.txt
1984_2.txt
1984_3.txt
1984_4.txt
1984_5.txt
Capitaes-da-Areia_1.txt
Capitaes-da-Areia_2.txt
Capitaes-da-Areia_3.txt
...
Each file follows the same structure as the CORPUS
parameter. Training files contain all sentences to be summarized, while the corresponding target files contain the gold summary sentences.
This arguments bears the path to a pretrained model file. This cannot be used together with the training and target arguments. Each method requires a specific format:
- LiuEtAl2015: CSV
- LiaoEtAl2018: CSV
- machine_learning: joblib
- machine_learning_clustering: joblib
- score_optimization: CSV
This is used by the methods focused on optimizing weights for score calculation. The CSV format should have two columns: the name of the feature and the corresponding optimized weight. An example can be seen as follows:
...
e_freq_0,1.0
e_freq_1,1.0
e_freq_2,1.0
e_freq_5,1.0
e_freq_10,1.0
e_fmst_pos_5,0.7642977396044841
e_fmst_pos_6,0.7642977396044841
e_fmst_pos_7,0.7642977396044841
e_fmst_pos_10,0.7642977396044841
e_fmst_pos_15,0.7418011102528389
e_avg_pos_5,1.0
e_avg_pos_6,1.0
e_avg_pos_7,1.0
e_avg_pos_10,1.0
e_avg_pos_15,0.9775033706483548
node1_n_freq_0,1.0
...
This format is used by all Machine Learning methods using the scikit-learn library. Joblib is a binary file format that allows to save pretrained scikit-learn models. These files should be created using the dump
function from the joblib library upon the trained model.
This argument is used exclusively by the LiuEtAl2015
and LiaoEtAl2018
methods for AdaGrad optimization. Two types are implemented:
-
Perceptron loss
-
Ramp loss
The argmax
function represents the optimal graph obtained through the ILP method using the current weights. For more details about these functions, please refer to Liu et Al (2015).
SENTLEX
is the path for a sentiment lexicon using the OpLexicon format. This is a CSV format with four columns: the word, its morphological category, its sentiment (-1, 0 or 1), if the annotaiton is manual (M) or automatic (A). Only columns 1 and 3 are used. The lexicon contains all inflections of a specific word, so we do not apply any lemmatization or stemming.
...
desatencioso,adj,-1,A
desatender,vb,-1,A
desatenta,adj,-1,M
desatentas,adj,-1,M
desatento,adj,-1,M
desatentos,adj,-1,M
desaterrar,vb,0,A
desatestar,vb,1,A
desatinada,adj,-1,A
desatinadas,adj,-1,A
desatinado,adj,-1,A
desatinados,adj,-1,A
desatinar,vb,1,A
desativar,vb,1,A
desatracar,vb,1,A
desatracar-se,vb,1,A
...
This argument is used specifically by the LiaoEtAl2018
and machine_learning_clustering
methods for Spectral Clustering of sentences. The implemented similarity scores are:
-
lcs
: Longest common subsequence; number of overlapping words between two sentences. -
smatch
: Smatch similarity score between AMR graphs. -
concept_coverage
: Number of matching AMR concepts between the two sentence graphs.
The default value is lcs
.
This argument is used specifically by the machine_learning
and machine_learning_clustering
methods. This determines which Scikit-learn method is going to be used. The implemented methods are:
decision_tree
random_forest
svm
mlp
The default value is decision_tree
.
This is a flag argument used specifically by the machine_learning
and machine_learning_clustering
methods. If this argument is used, the edges of the AMR graphs are first turned into nodes, so that the ML classification can classify them too.
![](https://user-images.githubusercontent.com/25110651/106650620-e1446d80-6571-11eb-8f11-d09b423c052b.png)
![](https://user-images.githubusercontent.com/25110651/106650822-1f419180-6572-11eb-94b7-9a0ee3e7db3b.png)
ASPECTS
is the path to a JSON file containing all aspect annotation for the sentences in CORPUS
, TRAINING
or TARGET
(when used). This argument is used specifically by the machine_learning
and machine_learning_clustering
methods.
The first set of keys corresponds to the name (not the whole path) of the file that was annotated (CORPUS
file or the files in TRAINING
or TARGET
). The second layer of keys indicates the sentence IDs within the said file, these IDs should match with those of the original AMR file (CORPUS
, TRAINING
or TARGET
). Then, there is the list of aspects within the sentence. An example can be seen as follows:
{
"Galaxy-SIII_1.txt": {
"D0_S1": [
"Galaxy SIII"
],
"D0_S2": [
"modelo"
],
"D0_S3": [
"aparelho",
"bateria",
"desingn"
],
"D0_S4": [],
...
"D9_S5": [
"IPHONE 5"
]
},
"LG-Smart-TV_1.txt": {
"D0_S1": [],
"D0_S2": [],
"D0_S3": [
"Design",
"imagem"
],
"D0_S4": [],
...
"D9_S8": [
"TV"
]
},
...
}
This is an optional argument, i.e. if it is not given, the method will run without any aspect feature, while including all other features.
This argument indicates a directory to which all output files are going to be saved (AMR graphs, BOW pseudotexts, training weights...).