domain-exploration

These are the notebooks I used to look for patterns in harvard.edu domain names that may indicate if a url was valid (i.e. an actual website).

To get the data I started with the list of 25K harvard.edu URLs (data/LF_Survey.csv) and engineered some features (e.g. parsed the URL into sub domains, got response status codes, assigned "KEEP/REMOVE/CHECK" values) to get the dataframe for exploring associations and domain names. This is the LF_DomainList_FeatureEngineering.ipynb notebook.

Run the LF_DomainList_FeatureEngineering.ipynb notebook in binder.

Using the dataframe from that notebook I looked at domain names and any correlation to the "REMOVE" value. This is the domain_name_exploration.ipynb notebook.

Run the domain_name_exploration.ipynb notebook in binder.

Anf finally I used the apriori method to look at associations between sub-domains and the "REMOVE" value to see if there were patterns with multiple subdomains. This is the Apriori-association-exploration.ipynb notebook.

Run the Apriori-association-exploration.ipynb notebook in binder.

In the end the aprori association rules was interesting but did not turn up any information that just looking at single domain names found. In the end the list of sub-domains used to exclude a site is in the exclusion_list.csv file

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
modules		modules
Apriori-association-exploration.ipynb		Apriori-association-exploration.ipynb
LF_DomainList_FeatureEngineering.ipynb		LF_DomainList_FeatureEngineering.ipynb
README.md		README.md
domain_name_exploration.ipynb		domain_name_exploration.ipynb
exclusion_list.csv		exclusion_list.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

domain-exploration

About

Releases

Packages

Languages

derekjackson-das/domain-exploration

Folders and files

Latest commit

History

Repository files navigation

domain-exploration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages