Skip to content

kingrichard2005/cs578-project3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cs578-project3

This is an implementation of the Naive Bayes and K-Nearest Neighbor machine learning classification algorithms used to derive a classification model for predicting a set of labels for classifying scrubbed medical patient records into one of three categories: Smokers, Non-Smokers or Unknown, by training a classifier model using an existing set of pre-labeled medical record training data.

Medical Record Format Schema:
<ROOT>
  <RECORD ID="1">
    <SMOKING STATUS="SMOKER"></SMOKING>
    <TEXT>Patient annotations</TEXT>
  </RECORD>
</ROOT>

Medical Record DTD:
<!DOCTYPE ScrubbedMedicalRecordSet [
<!ELEMENT ROOT (#PCDATA)>
<!ELEMENT RECORD (ID,SMOKING,TEXT)>
<!ATTLIST RECORD ID ID #REQUIRED>
<!ELEMENT SMOKING (SMOKING|NON-SMOKING|UNKNOWN)>
<!ATTLIST SMOKING STATUS CDATA #IMPLIED>
<!ELEMENT TEXT (#PCDATA)>
]>

##########
##########
usage: Naive Bayes Classifier [-t TRAININGSET] [-m MU] [-s]

Trains a Naive Bayes classifier to label patients as SMOKING, NON-SMOKING or
UNKNOWN based on available (scrubbed) medical record information.

optional arguments:
  -t TRAININGSET, --trainingSet TRAININGSET
                        The path to the labeled medical record training set
                        file
  -m MU, --mu MU        Tuning parameter for the Naive Bayes classifier,
                        default is the length of the unique terms across all documents in the training set
  -s, --BayesianSmoothing
                        Specifies the Bayesian esimate for parameter smoothing
                        in the Naive Bayes classifier, default is Dirichlet
                        smoothing which considers mu
                        
##########
##########
usage: K-Nearest Neighbor Classifier [-h] [-t TRAININGSET] [-r TERMRANKINGS]
                                     [-a ASSOCFUNC] [-K KNEIGHBORS]
                                     [-s SIMILARITY_FUNC] [-S SAMPLETYPE]

Trains a K-Nearest Neighbor classifier to label patients as SMOKING, NON-
SMOKING or UNKNOWN based on available (scrubbed) medical record information.

optional arguments:
  -h, --help            show this help message and exit
  -t TRAININGSET, --trainingSet TRAININGSET
                        Path to the labeled medical record training set file,
                        e.g. ./path/to/training.txt
  -r TERMRANKINGS, --termrankings TERMRANKINGS
                        Path to term rankings pickle file, e.g.
                        ./path/to/termRankings.p
  -a ASSOCFUNC, --associationFunction ASSOCFUNC
                        The association function used to compare the relevancy
                        of a term to a specific class label [default=chi-
                        square|dice].
  -K KNEIGHBORS, --kNeighbors KNEIGHBORS
                        Total neighbors to sample.
  -s SIMILARITY_FUNC, --similarity SIMILARITY_FUNC
                        The similarity function used to compare a unlabeled
                        examples to a labeled kth-neighbor
                        [default=euclidean|manhatten|minkowski].
  -S SAMPLETYPE, --sampleType SAMPLETYPE
                        The method to sample 'K' records from each label
                        subset, top 'K'' are records with the max combined
                        term relevance score [default=Krandom|topK],' this
                        sample type doesn't apply when using the hamming
                        distance similarity.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages