You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given large types of configuration files, we wanted to find frequently edited configuration files which is interesting for our data analysis.
We can define similarity indices to find frequently edited configuration files. Similarity index estimates the fraction of users who edited a configuration file by calculating similarity within a set of configuration files from different systems. We assume that frequently edited config files have a lower similarity index. For example, if a file sssd.conf is frequently edited by users on their systems, we should observe a low similarity index when we compare all such files. We assume that frequently edited config files have a lower similarity index which will be interesting for our analysis.
There are potentially two types of methods to calculate similarity index.
Parsing based methods: In this method, we parse configuration files to convert them into key-value pairs. Next, we apply similarity measure techniques such as Jaccard similarity and cosine similarity to calculate the similarity index.
NLP based methods: In this method, we can use NLP techniques such as N-gram and Doc2Vec, to create intermediate representations. Next, we can apply similarity measure techniques such as Jaccard similarity and cosine similarity to calculate the similarity index.
Acceptance criteria: Similarity index given a set of configuration files indicating the frequently edited config files.
The text was updated successfully, but these errors were encountered:
Given large types of configuration files, we wanted to find frequently edited configuration files which is interesting for our data analysis.
We can define similarity indices to find frequently edited configuration files. Similarity index estimates the fraction of users who edited a configuration file by calculating similarity within a set of configuration files from different systems. We assume that frequently edited config files have a lower similarity index. For example, if a file sssd.conf is frequently edited by users on their systems, we should observe a low similarity index when we compare all such files. We assume that frequently edited config files have a lower similarity index which will be interesting for our analysis.
There are potentially two types of methods to calculate similarity index.
Parsing based methods: In this method, we parse configuration files to convert them into key-value pairs. Next, we apply similarity measure techniques such as Jaccard similarity and cosine similarity to calculate the similarity index.
NLP based methods: In this method, we can use NLP techniques such as N-gram and Doc2Vec, to create intermediate representations. Next, we can apply similarity measure techniques such as Jaccard similarity and cosine similarity to calculate the similarity index.
Acceptance criteria: Similarity index given a set of configuration files indicating the frequently edited config files.
The text was updated successfully, but these errors were encountered: