-
Notifications
You must be signed in to change notification settings - Fork 11
scikit
Mathieu Morey edited this page Jan 24, 2015
·
15 revisions
Transitioning from Orange to scikit-learn
Rough work plan
-
noop-
model persistence (we get this for free; pickle)(switched to joblib)
-
-
replacecore table management (rows, columns)reading table from file - svmlight format onlynaive bayes classifier
- break (mark for future repair)
-
meta-features(the EDUS file) last decoder (needs metafeatures)-
maxent (hopefully more wrap simply)(it's just logistic regression!) - orange format files (irit-stac will have to stay pinned to master) <== WE ARE HERE 2015-01-23
- perceptron (Pascal knows what he's doing)
-
- ditch (mark for future replacement)
-
classifiers: svm, majority(added back)
-
- refactor - some things that sklearn does for us that we could maybe nuke outright
- enfolding
- cross validation
- scoring, confidence intervals, etc
- inert/meta features in libsvm? (some features needed for decoding but not learning: edu ids, spans)
Now in the scikit branch documentation
If these aren't already one-hot-encoded by feature extraction, maybe one way to tie them back together would be to annotate the files with a space delimited lists indicating the string labels that go with the values associated with such features. In this scheme,
- NB: just realised we also need a mechanism to indicate that some features are categorical, so this could double as that
- SET identifies the value associated with discrete features, starting from 1 (0 used for 'not present')
- For relations, we would just use SET 0
- [MM]: see http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
# SET 0 elaboration narration continuation
# SET 1 x y z
1 1:1
0 1:2
2 1:3
0 1:1
3 1:2
In the future, if we really wanted to have distinct feature sets for attachment and labelling prediction, we could use the attelo configuration file mechanism to say eg. "features 1 8 7 9 are only used for attachment"…
-
attelo graph
on a plain conll output should just show the edus and links between them - fancy feature: show the big-ball-of-yarn for input files
- fancy feature: if you supply both, show the big-ball-of-yarn, but HIGHLIGHT THE CHOSEN LINKS