Skip to content
Mathieu Morey edited this page Jan 24, 2015 · 15 revisions

Transitioning from Orange to scikit-learn

Plan

Rough work plan

  1. noop
    1. model persistence (we get this for free; pickle) (switched to joblib)
  2. replace
    1. core table management (rows, columns)
    2. reading table from file - svmlight format only
    3. naive bayes classifier
  3. break (mark for future repair)
    1. meta-features (the EDUS file)
    2. last decoder (needs metafeatures)
    3. maxent (hopefully more wrap simply) (it's just logistic regression!)
    4. orange format files (irit-stac will have to stay pinned to master) <== WE ARE HERE 2015-01-23
    5. perceptron (Pascal knows what he's doing)
  4. ditch (mark for future replacement)
    1. classifiers: svm, majority (added back)
  5. refactor - some things that sklearn does for us that we could maybe nuke outright
    1. enfolding
    2. cross validation
    3. scoring, confidence intervals, etc

Issues

  • inert/meta features in libsvm? (some features needed for decoding but not learning: edu ids, spans)

EDU meta features for decoding

Now in the scikit branch documentation

Set value features?

If these aren't already one-hot-encoded by feature extraction, maybe one way to tie them back together would be to annotate the files with a space delimited lists indicating the string labels that go with the values associated with such features. In this scheme,

# SET 0 elaboration narration continuation
# SET 1 x y z
1 1:1
0 1:2
2 1:3 
0 1:1
3 1:2

Killing the attachment/relations input file distinction (DONE)

In the future, if we really wanted to have distinct feature sets for attachment and labelling prediction, we could use the attelo configuration file mechanism to say eg. "features 1 8 7 9 are only used for attachment"…

attelo graph

  • attelo graph on a plain conll output should just show the edus and links between them
  • fancy feature: show the big-ball-of-yarn for input files
  • fancy feature: if you supply both, show the big-ball-of-yarn, but HIGHLIGHT THE CHOSEN LINKS

See also

https://github.com/kowey/attelo/issues/11