These are the notebooks I used to look for patterns in harvard.edu domain names that may indicate if a url was valid (i.e. an actual website).
To get the data I started with the list of 25K harvard.edu URLs (data/LF_Survey.csv) and engineered some features (e.g. parsed the URL into sub domains, got response status codes, assigned "KEEP/REMOVE/CHECK" values) to get the dataframe for exploring associations and domain names. This is the LF_DomainList_FeatureEngineering.ipynb notebook.
Run the LF_DomainList_FeatureEngineering.ipynb notebook in binder.
Using the dataframe from that notebook I looked at domain names and any correlation to the "REMOVE" value. This is the domain_name_exploration.ipynb notebook.
Run the domain_name_exploration.ipynb notebook in binder.
Anf finally I used the apriori method to look at associations between sub-domains and the "REMOVE" value to see if there were patterns with multiple subdomains. This is the Apriori-association-exploration.ipynb notebook.
Run the Apriori-association-exploration.ipynb notebook in binder.
In the end the aprori association rules was interesting but did not turn up any information that just looking at single domain names found. In the end the list of sub-domains used to exclude a site is in the exclusion_list.csv file