Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit Message functionalities #277

Open
Leo-Send opened this issue Jan 24, 2025 · 5 comments
Open

Commit Message functionalities #277

Leo-Send opened this issue Jan 24, 2025 · 5 comments
Assignees
Milestone

Comments

@Leo-Send
Copy link
Contributor

Leo-Send commented Jan 24, 2025

Description

Currently, commit messages are included in coronet but we do not provide any functionality for them. This issue should serve as a collection of ideas and potential functionalities or metrics that we might want to add.

Functionalities

  • NLP functionalities: stemming, tokenization, lemmatization, stop word removal
  • keyword search
  • regex search

Metrics

  • word count (per commit, average per developer/group of developers)
  • keyword count (per developer/group of developers)
@Leo-Send
Copy link
Contributor Author

I have a few comments about what we discussed today:
@hechtlC mentioned that he would prefer if the methods took a list of commits to work on rather than authors or a time range. There is of course the option to add different helper methods later that allow for this functionality of we find that they are common use cases.

Until now I have assumed that these methods would need the project/commit data to work on. Strictly speaking however, they do not require those if we already give them the commits they are supposed to be used on. I would still pass them the project data so that they do not need the full commits as arguments and you can pass them the hashes or commit ids of the desired commits instead.

In general, I see multiple options of handling these functionalities and metrics. The basic distinction is whether or not we want to change the commit data. If we want to use stemming on all commits, does it make more sense to return a list of stemmed commit messages, or should we replace the commit messages in the commit data? What should happen if we want to use stemming on only a subset of all commits?

@Leo-Send
Copy link
Contributor Author

After further discussion we have decided to provide the functions with the project data and optionally with a list of commit hashes. Currently, I have implemented the stemming as follows:

  1. The function takes the project data, a list of hashes (default NULL, meaning all commits are considered), and a list of desired preprocessing steps (currently implemented are "lowercase", "punctuation", "stopwords", and "whitespaces" for lowercasing and removal of punctuation, stopwords and extra whitespaces respectively. These are all enabled by default)
  2. Then a corpus is built out of the whole commit messages, meaning title and body
  3. preprocessing steps are applied (here I use the package called 'tm', which seems to be a well-maintained NLP package
  4. stemming is applied (again using 'tm')
  5. a dataframe is returned containing the columns 'hash' and 'stemmed.message'

Now, the question is: what helper methods are desired for this? Do we want to disable some or all preprocessing by default?
I don't really see a reason not to apply preprocessing steps aside from a runtime concern, although I doubt that there would be a significant impact.

If you agree with the basic structure, I would implement the tokenization similarily.

@bockthom bockthom added this to the v5.1 milestone Feb 3, 2025
@bockthom
Copy link
Collaborator

bockthom commented Feb 3, 2025

I have a few comments about what we discussed today: @hechtlC mentioned that he would prefer if the methods took a list of commits to work on rather than authors or a time range. There is of course the option to add different helper methods later that allow for this functionality of we find that they are common use cases.

Hm, we could also think about providing it as a method of a ProjectData object (and, thus, also for RangeData objects via inheritance)? But I am not sure about the use cases – maybe this would be used only rarely.

I agree with @hechtlC: A list of commits would be more convenient – the user can determine which commits should be analyzed and just pass them to the function (e.g., get all commits from the ProjectData object, get commits of specific authors, get commits authored at specific weekdays, get commits of core developers, get commits of developers whose name starts with L, and so on... 😄 ). So, if we let the user decide which commits to analyze and just pass them to the new function, then we would end up in highest flexibility. Helper functions can be added if there are justified use cases for them, but I would like to avoid them as long as we have filter functions that allow to filter the commits in the requested way.

Until now I have assumed that these methods would need the project/commit data to work on. Strictly speaking however, they do not require those if we already give them the commits they are supposed to be used on. I would still pass them the project data so that they do not need the full commits as arguments and you can pass them the hashes or commit ids of the desired commits instead.

Having the ProjectData/RangeData object available does not hurt. We may need to access information that is not provided by the user but available in the data object.

In general, I see multiple options of handling these functionalities and metrics. The basic distinction is whether or not we want to change the commit data. If we want to use stemming on all commits, does it make more sense to return a list of stemmed commit messages, or should we replace the commit messages in the commit data? What should happen if we want to use stemming on only a subset of all commits?

Replacing the commit messages is not a good idea. Why might think about adding the information elsewhere - either by adding additional columns that contains the "processed" data, or by storing them in a new data frame. What should we do if we have multiple tasks that we would like to perform on a commit message? Just storing one result is not a good option then. So, it might be useful to somehow store the result in a data structure that is capable of storing the outcomes of multiple text-processing steps separately (e.g., stemming and stop word removal) as well as in a combined way (e.g., stemming after stop word removal). So, in general I would assume that we store the outcomes separately - if a user wants to combine both functionalities, it is the users task to configure both to performed one after the other. Regarding where and how to store the outcomes – I don't have an appropriate solution in mind right now, but I'd be happy to hear and comment on your ideas 😉

@bockthom
Copy link
Collaborator

bockthom commented Feb 3, 2025

  1. The function takes the project data, a list of hashes (default NULL, meaning all commits are considered), and a list of desired preprocessing steps

Sounds like a good idea. Combines the two ideas from my previous post 😄

(currently implemented are "lowercase", "punctuation", "stopwords", and "whitespaces" for lowercasing and removal of punctuation, stopwords and extra whitespaces respectively. These are all enabled by default

What does "default" mean? Depends on which other functionalities will be available. Also it is worth to think about whether the order of these steps matters (not only for the ones you provided, but in general). Some steps might lead to different results depending on what steps have been performed beforehand. It might be necessary to specify them as an ordered list (where ordered does not mean lexicographically, but the order in which they should be performed).

  1. Then a corpus is built out of the whole commit messages, meaning title and body

Hate me for this comment, but I would like this to be configurable 🙈

  1. preprocessing steps are applied (here I use the package called 'tm', which seems to be a well-maintained NLP package

Package tm has also been used in Codeface, I am aware of it. Whether it is well-maintained, I don't now.

  1. stemming is applied (again using 'tm')
  2. a dataframe is returned containing the columns 'hash' and 'stemmed.message'

Regarding the return value, please see my previous comment. There might be multiple steps to be performed separately or in a row, and all should be storable somehow... I don't have a concrete idea yet, but you might come up with a few ideas to be discussed...

Now, the question is: what helper methods are desired for this? Do we want to disable some or all preprocessing by default? I don't really see a reason not to apply preprocessing steps aside from a runtime concern, although I doubt that there would be a significant impact.

Again, please consider the last paragraph of my previous comment here.

@Leo-Send
Copy link
Contributor Author

Leo-Send commented Feb 4, 2025

I will again collect what we talked about today and comment on it:

  • Regarding the return type, we came to the conclusion that we want the methods to return a single dataframe with the columns as described in my previous comment.
  • We also discussed adding a wrapper function, where you could choose the kind of processing you want (preprocessing steps and stemming, tokenization, ...) which then combines the results into a single data frames with the columns 'hash', 'preprocessed.message', 'stemmed.message', 'tokenized.message'. We are undecided whether we want/need such a method.
  • Preprocessing steps will be taken out of the function for stemming to make them available on their own and for other processes such as tokenization.
  • I will look into additional suitable preprocessing steps that might be interesting to us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants