Commit Message functionalities #277

Leo-Send · 2025-01-24T15:52:42Z

Description

Currently, commit messages are included in coronet but we do not provide any functionality for them. This issue should serve as a collection of ideas and potential functionalities or metrics that we might want to add.

Functionalities

NLP functionalities: stemming, tokenization, lemmatization, stop word removal
keyword search
regex search

Metrics

word count (per commit, average per developer/group of developers)
keyword count (per developer/group of developers)

Leo-Send · 2025-01-28T14:18:02Z

I have a few comments about what we discussed today:
@hechtlC mentioned that he would prefer if the methods took a list of commits to work on rather than authors or a time range. There is of course the option to add different helper methods later that allow for this functionality of we find that they are common use cases.

Until now I have assumed that these methods would need the project/commit data to work on. Strictly speaking however, they do not require those if we already give them the commits they are supposed to be used on. I would still pass them the project data so that they do not need the full commits as arguments and you can pass them the hashes or commit ids of the desired commits instead.

In general, I see multiple options of handling these functionalities and metrics. The basic distinction is whether or not we want to change the commit data. If we want to use stemming on all commits, does it make more sense to return a list of stemmed commit messages, or should we replace the commit messages in the commit data? What should happen if we want to use stemming on only a subset of all commits?

Leo-Send · 2025-01-28T16:34:11Z

After further discussion we have decided to provide the functions with the project data and optionally with a list of commit hashes. Currently, I have implemented the stemming as follows:

The function takes the project data, a list of hashes (default NULL, meaning all commits are considered), and a list of desired preprocessing steps (currently implemented are "lowercase", "punctuation", "stopwords", and "whitespaces" for lowercasing and removal of punctuation, stopwords and extra whitespaces respectively. These are all enabled by default)
Then a corpus is built out of the whole commit messages, meaning title and body
preprocessing steps are applied (here I use the package called 'tm', which seems to be a well-maintained NLP package
stemming is applied (again using 'tm')
a dataframe is returned containing the columns 'hash' and 'stemmed.message'

Now, the question is: what helper methods are desired for this? Do we want to disable some or all preprocessing by default?
I don't really see a reason not to apply preprocessing steps aside from a runtime concern, although I doubt that there would be a significant impact.

If you agree with the basic structure, I would implement the tokenization similarily.

bockthom · 2025-02-03T01:29:46Z

I have a few comments about what we discussed today: @hechtlC mentioned that he would prefer if the methods took a list of commits to work on rather than authors or a time range. There is of course the option to add different helper methods later that allow for this functionality of we find that they are common use cases.

Hm, we could also think about providing it as a method of a ProjectData object (and, thus, also for RangeData objects via inheritance)? But I am not sure about the use cases – maybe this would be used only rarely.

I agree with @hechtlC: A list of commits would be more convenient – the user can determine which commits should be analyzed and just pass them to the function (e.g., get all commits from the ProjectData object, get commits of specific authors, get commits authored at specific weekdays, get commits of core developers, get commits of developers whose name starts with L, and so on... 😄 ). So, if we let the user decide which commits to analyze and just pass them to the new function, then we would end up in highest flexibility. Helper functions can be added if there are justified use cases for them, but I would like to avoid them as long as we have filter functions that allow to filter the commits in the requested way.

Until now I have assumed that these methods would need the project/commit data to work on. Strictly speaking however, they do not require those if we already give them the commits they are supposed to be used on. I would still pass them the project data so that they do not need the full commits as arguments and you can pass them the hashes or commit ids of the desired commits instead.

Having the ProjectData/RangeData object available does not hurt. We may need to access information that is not provided by the user but available in the data object.

In general, I see multiple options of handling these functionalities and metrics. The basic distinction is whether or not we want to change the commit data. If we want to use stemming on all commits, does it make more sense to return a list of stemmed commit messages, or should we replace the commit messages in the commit data? What should happen if we want to use stemming on only a subset of all commits?

Replacing the commit messages is not a good idea. Why might think about adding the information elsewhere - either by adding additional columns that contains the "processed" data, or by storing them in a new data frame. What should we do if we have multiple tasks that we would like to perform on a commit message? Just storing one result is not a good option then. So, it might be useful to somehow store the result in a data structure that is capable of storing the outcomes of multiple text-processing steps separately (e.g., stemming and stop word removal) as well as in a combined way (e.g., stemming after stop word removal). So, in general I would assume that we store the outcomes separately - if a user wants to combine both functionalities, it is the users task to configure both to performed one after the other. Regarding where and how to store the outcomes – I don't have an appropriate solution in mind right now, but I'd be happy to hear and comment on your ideas 😉

bockthom · 2025-02-03T01:43:34Z

The function takes the project data, a list of hashes (default NULL, meaning all commits are considered), and a list of desired preprocessing steps

Sounds like a good idea. Combines the two ideas from my previous post 😄

(currently implemented are "lowercase", "punctuation", "stopwords", and "whitespaces" for lowercasing and removal of punctuation, stopwords and extra whitespaces respectively. These are all enabled by default

What does "default" mean? Depends on which other functionalities will be available. Also it is worth to think about whether the order of these steps matters (not only for the ones you provided, but in general). Some steps might lead to different results depending on what steps have been performed beforehand. It might be necessary to specify them as an ordered list (where ordered does not mean lexicographically, but the order in which they should be performed).

Then a corpus is built out of the whole commit messages, meaning title and body

Hate me for this comment, but I would like this to be configurable 🙈

preprocessing steps are applied (here I use the package called 'tm', which seems to be a well-maintained NLP package

Package tm has also been used in Codeface, I am aware of it. Whether it is well-maintained, I don't now.

stemming is applied (again using 'tm')

a dataframe is returned containing the columns 'hash' and 'stemmed.message'

Regarding the return value, please see my previous comment. There might be multiple steps to be performed separately or in a row, and all should be storable somehow... I don't have a concrete idea yet, but you might come up with a few ideas to be discussed...

Now, the question is: what helper methods are desired for this? Do we want to disable some or all preprocessing by default? I don't really see a reason not to apply preprocessing steps aside from a runtime concern, although I doubt that there would be a significant impact.

Again, please consider the last paragraph of my previous comment here.

Leo-Send · 2025-02-04T15:30:24Z

I will again collect what we talked about today and comment on it:

Regarding the return type, we came to the conclusion that we want the methods to return a single dataframe with the columns as described in my previous comment.
We also discussed adding a wrapper function, where you could choose the kind of processing you want (preprocessing steps and stemming, tokenization, ...) which then combines the results into a single data frames with the columns 'hash', 'preprocessed.message', 'stemmed.message', 'tokenized.message'. We are undecided whether we want/need such a method.
Preprocessing steps will be taken out of the function for stemming to make them available on their own and for other processes such as tokenization.
I will look into additional suitable preprocessing steps that might be interesting to us.

hechtlC added enhancement help wanted labels Jan 24, 2025

hechtlC assigned Leo-Send Jan 24, 2025

bockthom added this to the v5.1 milestone Feb 3, 2025

Leo-Send mentioned this issue Mar 4, 2025

Commitmessage functionalities #281

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit Message functionalities #277

Commit Message functionalities #277

Leo-Send commented Jan 24, 2025 •

edited

Loading

Leo-Send commented Jan 28, 2025

Leo-Send commented Jan 28, 2025

bockthom commented Feb 3, 2025

bockthom commented Feb 3, 2025

Leo-Send commented Feb 4, 2025

Commit Message functionalities #277

Commit Message functionalities #277

Comments

Leo-Send commented Jan 24, 2025 • edited Loading

Description

Functionalities

Metrics

Leo-Send commented Jan 28, 2025

Leo-Send commented Jan 28, 2025

bockthom commented Feb 3, 2025

bockthom commented Feb 3, 2025

Leo-Send commented Feb 4, 2025

Leo-Send commented Jan 24, 2025 •

edited

Loading