Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To do: similarity index to find frequently edited configuration files. #28

Open
SankBad opened this issue Mar 24, 2021 · 0 comments
Open

Comments

@SankBad
Copy link
Contributor

SankBad commented Mar 24, 2021

Given large types of configuration files, we wanted to find frequently edited configuration files which is interesting for our data analysis.

We can define similarity indices to find frequently edited configuration files. Similarity index estimates the fraction of users who edited a configuration file by calculating similarity within a set of configuration files from different systems. We assume that frequently edited config files have a lower similarity index. For example, if a file sssd.conf is frequently edited by users on their systems, we should observe a low similarity index when we compare all such files. We assume that frequently edited config files have a lower similarity index which will be interesting for our analysis.

There are potentially two types of methods to calculate similarity index.
Parsing based methods: In this method, we parse configuration files to convert them into key-value pairs. Next, we apply similarity measure techniques such as Jaccard similarity and cosine similarity to calculate the similarity index.
NLP based methods: In this method, we can use NLP techniques such as N-gram and Doc2Vec, to create intermediate representations. Next, we can apply similarity measure techniques such as Jaccard similarity and cosine similarity to calculate the similarity index.

Acceptance criteria: Similarity index given a set of configuration files indicating the frequently edited config files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant