-
Notifications
You must be signed in to change notification settings - Fork 11
scikit
Transitioning from Orange to scikit-learn
Rough work plan
-
noop-
model persistence (we get this for free; pickle)(switched to joblib)
-
-
replacecore table management (rows, columns)reading table from file - svmlight format onlynaive bayes classifier
- break (mark for future repair)
-
meta-features(the EDUS file) last decoder (needs metafeatures)-
maxent (hopefully more wrap simply)(it's just logistic regression!) - orange format files (irit-stac will have to stay pinned to master) <== WE ARE HERE 2015-01-23
- perceptron (Pascal knows what he's doing)
-
- ditch (mark for future replacement)
-
classifiers: svm, majority(added back)
-
- refactor - some things that sklearn does for us that we could maybe nuke outright
- enfolding
- cross validation
- scoring, confidence intervals, etc
- inert/meta features in libsvm? (some features needed for decoding but not learning: edu ids, spans)
Attelo decoders need some information which hitherto have been implemented as meta-features (edu ids, edu spans), but which strictly speaking have nothing to do with classification and which may not have a way of being represented in the libsvm format (unknown)
This could be handled by
- YOLO'ing them in as continuous features in a first instance (it would be somebody else's problem to map from numbers to edu ids)
- in the medium term, working out how to ignore these features so they don't skew our models
In the longer term, it may be good to have a reasonable story for how to use them. One idea would be to separate the feature files into two inputs: features and EDUs. The features file would be a bog-standard svmlight file, one row per edu-pair, sparse format, numbers for anything, no metadata (?). All meta features would then appear in the EDU file (CONLL format). It should be identical to the output format, except we would have one unlabelled column containing a list of possible parent in its dependency graph.
- global id: used by your application, arbitrary string?
- text: good for debugging
- grouping: eg. file, dialogue
- span start: (int)
- span end: (int)
- possible parents (single column, space delimited, 0 as distinguished name for root)
d1_492 anybody want sheep for wood? dialogue_1 0 27 0 d1_493 d1_494
d1_493 nope, not me dialogue_1 28 40 0 d1_492 d1_494
d1_494 not me either dialogue_1 41 54 0 d1_491 d1_492 d1_493
The alignment of the two files would be made by exploring each EDU exhaustively, ie. the first row being the first edu and its first candidate parent, then its second, and so forth until we run out of possible parents. From there the next row would consist of the second EDU with its first parent, and so forth:
d1_492 0
d1_492 d1_493
d1_492 d1_494
d1_493 0
d1_493 d1_492
As a sanity check, each row of the svmlight file could be annotated with the edu pair it is meant to correspond to, although this would be ignored by attelo.
If these aren't already one-hot-encoded by feature extraction, maybe one way to tie them back together would be to annotate the files with a space delimited lists indicating the string labels that go with the values associated with such features. In this scheme,
- SET identifies the value associated with discrete features, starting from 1 (0 used for 'not present')
- For relations, we would just use SET 0
# SET 0 elaboration narration continuation
# SET 1 x y z
1 1:1
0 1:2
2 1:3
0 1:1
3 1:2
Maybe we should merge the two into just a single relations feature file. If we really wanted to have distinct feature sets for attachment and labelling prediction, we could use the attelo configuration file mechanism to say eg. "features 1 8 7 9 are only used for attachment"…