-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathpart1
7 lines (3 loc) · 2.04 KB
/
part1
1
2
3
4
5
6
7
Our word sense disambiguation algorithm will make use of the NLTK toolkit. The algorithm will take in a list of word senses matched with a dictionary of features as input. The training set contains words that are tagged with word senses. The algorithm will pass through the training set and extract all of the features for each word and put them into a dictionary. Then the word and the dictionary will be added to the list and passed into the machine learning algorithm from the NLTK toolkit. The toolkit contains both the Decision Lists and the Naive Bayes machine learning algorithms. We plan to use both in our experiments to determine which one is more effective. Also, if we have enough time, we will implement the bootstrapping algorithm. After training the machine learning algorithm, the test file will be passed in and the senses of the words will be determined. Depending on the time it takes for the algorithm to execute, we may attempt the all-words task because the algorithm should be general enough to complete it.
We plan on using colocation features, cooccurrance features, part-of-speech tagging, parser-based part of speech tagging, optimal word picking, and sentence length as features. We also plan to try to find other interesting features and try those out as well.
We will test a variety of systems with different features turned on or off. We will use a stepwise feature combination approach to identify parsimonious but high-performing systems. A simple way of doing this would be to run all of the systems on all of the words and using backwards stepwise regression, considering different words to be replicates. This would require up to n*2^k runs, where n=number of words and k=number of features. That could take a while. To save time, we may use a forward stepwise approach in running the systems. In the simplest forward approach, we would start by running k systems, each with a different feature turned on. In the next step, we would run k-1 systems by the highest-performing system and turning on one additional feature per new system.