You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Features:
binary: Set of words (vectorized)
binary + weighting: binary vector multiplied with weights
frequency: Bag of words (vectorized)
frequency + weight: some function, e.g. log_2(freq_in_body) + 10*log_2(freq_in_header)
possible weighting schemes:
Word is in title ( tags)
Word is contained in body
Unsupervised: classification, later manually named by us by picking centre and extreme points to look at. Further, we can play around with the number of clusters we want to find and see what is found if we do not limit the number of clusters
Supervised: Label 100 manually (by us), then let about 1'000 - 5'000 be labelled externally by hand, let the rest be labelled externally. After that, we can at least train on this set and try to predict the rest of the URLs.
Process:
Detect language
Remove stop words
Depending on the language may use stemming or other reduction schemes
Create sets and bags of words (weighted), on which one should learn
Randomly select URLs to be manually labelled (for supervised only)
Run analysis on the dataset
The text was updated successfully, but these errors were encountered:
Features:
binary: Set of words (vectorized)
binary + weighting: binary vector multiplied with weights
frequency: Bag of words (vectorized)
frequency + weight: some function, e.g. log_2(freq_in_body) + 10*log_2(freq_in_header)
possible weighting schemes:
Word is in title ( tags)
Word is contained in body
Process:
The text was updated successfully, but these errors were encountered: