scikit

Transitioning from Orange to scikit-learn

Plan

Rough work plan

~~noop~~
1. ~~model persistence (we get this for free; pickle)~~ (switched to joblib)
~~replace~~
1. ~~core table management (rows, columns)~~
2. ~~reading table from file - svmlight format only~~
3. ~~naive bayes classifier~~
break (mark for future repair)
1. ~~meta-features~~ (the EDUS file)
2. ~~last decoder (needs metafeatures)~~
3. ~~maxent (hopefully more wrap simply)~~ (it's just logistic regression!)
4. orange format files (irit-stac will have to stay pinned to master) <== WE ARE HERE 2015-01-23
5. perceptron (Pascal knows what he's doing)
ditch (mark for future replacement)
1. ~~classifiers: svm, majority~~ (added back)
refactor - some things that sklearn does for us that we could maybe nuke outright
1. enfolding
2. cross validation
3. scoring, confidence intervals, etc

Issues

inert/meta features in libsvm? (some features needed for decoding but not learning: edu ids, spans)

EDU meta features for decoding

Now in the scikit branch documentation

Set value features?

If these aren't already one-hot-encoded by feature extraction, maybe one way to tie them back together would be to annotate the files with a space delimited lists indicating the string labels that go with the values associated with such features. In this scheme,

NB: just realised we also need a mechanism to indicate that some features are categorical, so this could double as that
SET identifies the value associated with discrete features, starting from 1 (0 used for 'not present')
For relations, we would just use SET 0
[MM]: see http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

# SET 0 elaboration narration continuation
# SET 1 x y z
1 1:1
0 1:2
2 1:3 
0 1:1
3 1:2

Killing the attachment/relations input file distinction (DONE)

In the future, if we really wanted to have distinct feature sets for attachment and labelling prediction, we could use the attelo configuration file mechanism to say eg. "features 1 8 7 9 are only used for attachment"…

attelo graph

attelo graph on a plain conll output should just show the edus and links between them
fancy feature: show the big-ball-of-yarn for input files
fancy feature: if you supply both, show the big-ball-of-yarn, but HIGHLIGHT THE CHOSEN LINKS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly