-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commit Message functionalities #277
Comments
I have a few comments about what we discussed today: Until now I have assumed that these methods would need the project/commit data to work on. Strictly speaking however, they do not require those if we already give them the commits they are supposed to be used on. I would still pass them the project data so that they do not need the full commits as arguments and you can pass them the hashes or commit ids of the desired commits instead. In general, I see multiple options of handling these functionalities and metrics. The basic distinction is whether or not we want to change the commit data. If we want to use stemming on all commits, does it make more sense to return a list of stemmed commit messages, or should we replace the commit messages in the commit data? What should happen if we want to use stemming on only a subset of all commits? |
After further discussion we have decided to provide the functions with the project data and optionally with a list of commit hashes. Currently, I have implemented the stemming as follows:
Now, the question is: what helper methods are desired for this? Do we want to disable some or all preprocessing by default? If you agree with the basic structure, I would implement the tokenization similarily. |
Hm, we could also think about providing it as a method of a ProjectData object (and, thus, also for RangeData objects via inheritance)? But I am not sure about the use cases – maybe this would be used only rarely. I agree with @hechtlC: A list of commits would be more convenient – the user can determine which commits should be analyzed and just pass them to the function (e.g., get all commits from the ProjectData object, get commits of specific authors, get commits authored at specific weekdays, get commits of core developers, get commits of developers whose name starts with L, and so on... 😄 ). So, if we let the user decide which commits to analyze and just pass them to the new function, then we would end up in highest flexibility. Helper functions can be added if there are justified use cases for them, but I would like to avoid them as long as we have filter functions that allow to filter the commits in the requested way.
Having the ProjectData/RangeData object available does not hurt. We may need to access information that is not provided by the user but available in the data object.
Replacing the commit messages is not a good idea. Why might think about adding the information elsewhere - either by adding additional columns that contains the "processed" data, or by storing them in a new data frame. What should we do if we have multiple tasks that we would like to perform on a commit message? Just storing one result is not a good option then. So, it might be useful to somehow store the result in a data structure that is capable of storing the outcomes of multiple text-processing steps separately (e.g., stemming and stop word removal) as well as in a combined way (e.g., stemming after stop word removal). So, in general I would assume that we store the outcomes separately - if a user wants to combine both functionalities, it is the users task to configure both to performed one after the other. Regarding where and how to store the outcomes – I don't have an appropriate solution in mind right now, but I'd be happy to hear and comment on your ideas 😉 |
Sounds like a good idea. Combines the two ideas from my previous post 😄
What does "default" mean? Depends on which other functionalities will be available. Also it is worth to think about whether the order of these steps matters (not only for the ones you provided, but in general). Some steps might lead to different results depending on what steps have been performed beforehand. It might be necessary to specify them as an ordered list (where ordered does not mean lexicographically, but the order in which they should be performed).
Hate me for this comment, but I would like this to be configurable 🙈
Package
Regarding the return value, please see my previous comment. There might be multiple steps to be performed separately or in a row, and all should be storable somehow... I don't have a concrete idea yet, but you might come up with a few ideas to be discussed...
Again, please consider the last paragraph of my previous comment here. |
I will again collect what we talked about today and comment on it:
|
Description
Currently, commit messages are included in coronet but we do not provide any functionality for them. This issue should serve as a collection of ideas and potential functionalities or metrics that we might want to add.
Functionalities
Metrics
The text was updated successfully, but these errors were encountered: