scikit

Transitioning from Orange to scikit-learn

Plan

Rough work plan

~~noop~~
1. ~~model persistence (we get this for free; pickle)~~ (switched to joblib)
~~replace~~
1. ~~core table management (rows, columns)~~
2. ~~reading table from file - svmlight format only~~
3. ~~naive bayes classifier~~
break (mark for future repair)
1. ~~meta-features~~ (the EDUS file)
2. ~~last decoder (needs metafeatures)~~
3. ~~maxent (hopefully more wrap simply)~~ (it's just logistic regression!)
4. orange format files (irit-stac will have to stay pinned to master) <== WE ARE HERE 2015-01-23
5. perceptron (Pascal knows what he's doing)
ditch (mark for future replacement)
1. ~~classifiers: svm, majority~~ (added back)
refactor - some things that sklearn does for us that we could maybe nuke outright
1. enfolding
2. cross validation
3. scoring, confidence intervals, etc

Issues

inert/meta features in libsvm? (some features needed for decoding but not learning: edu ids, spans)

EDU meta features for decoding

Attelo decoders need some information which hitherto have been implemented as meta-features (edu ids, edu spans), but which strictly speaking have nothing to do with classification and which may not have a way of being represented in the libsvm format (unknown)

This could be handled by

YOLO'ing them in as continuous features in a first instance (it would be somebody else's problem to map from numbers to edu ids)
in the medium term, working out how to ignore these features so they don't skew our models

In the longer term, it may be good to have a reasonable story for how to use them. One idea would be to separate the feature files into two inputs: features and EDUs. The features file would be a bog-standard svmlight file, one row per edu-pair, sparse format, numbers for anything, no metadata (?). All meta features would then appear in the EDU file (CONLL format). It should be identical to the output format, except we would have one unlabelled column containing a list of possible parent in its dependency graph.

global id: used by your application, arbitrary string?
text: good for debugging
grouping: eg. file, dialogue
span start: (int)
span end: (int)
possible parents (single column, space delimited, 0 as distinguished name for root)

d1_492	anybody want sheep for wood?	dialogue_1	0	27	0 d1_493 d1_494
d1_493	nope, not me	dialogue_1	28	40	0 d1_492 d1_494
d1_494	not me either	dialogue_1	41	54	0 d1_491 d1_492 d1_493

The alignment of the two files would be made by exploring each EDU exhaustively, ie. the first row being the first edu and its first candidate parent, then its second, and so forth until we run out of possible parents. From there the next row would consist of the second EDU with its first parent, and so forth:

d1_492 0
d1_492 d1_493
d1_492 d1_494
d1_493 0
d1_493 d1_492

As a sanity check, each row of the svmlight file could be annotated with the edu pair it is meant to correspond to, although this would be ignored by attelo.

Set value features?

If these aren't already one-hot-encoded by feature extraction, maybe one way to tie them back together would be to annotate the files with a space delimited lists indicating the string labels that go with the values associated with such features. In this scheme,

SET identifies the value associated with discrete features, starting from 1 (0 used for 'not present')
For relations, we would just use SET 0

# SET 0 elaboration narration continuation
# SET 1 x y z
1 1:1
0 1:2
2 1:3 
0 1:1
3 1:2

Killing the attachment/relations input file distinction

Maybe we should merge the two into just a single relations feature file. If we really wanted to have distinct feature sets for attachment and labelling prediction, we could use the attelo configuration file mechanism to say eg. "features 1 8 7 9 are only used for attachment"…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly