Skip to content

Mining commits

Alexander Bakhtin edited this page May 14, 2024 · 1 revision

Mining commits from CLI

In this wiki we describe how to use command-line options to mine commits of a git repository with MiSON. Functions of MiSON can also be imported into a script, in which case consult their respective docstrings.

The results will be a csv file with a row for each modified file in each commit. The fields that are saved are commit_hash, author_name, author_email, committer_name, committer_email, commit_date, additions, deletions, and filename.

Additionally, it is possible to map modified files to the corresponding microservice by providing a custom mapping function. Then the saved table will also contain the field microservice (see Optional arguments).

Common arguments

The main command for mining commits with MiSON is mison commit.

Mandatory arguments

At least the fool options are mandatory for all cases:

  • --repo Path to the repository. Depending on the backend used can be a local path or a URL
  • --backend Which backend to use, current options are github and pydriller
  • --commit_table Output path to save the csv table of all mined commits and their file modifications. Can be default, in which case the format is mison_BACKEND_commit_table_TIMESTAMP.csv

Optional arguments

These arguments can be provided in case mapping from the modified file name to the corresponding microservice is necessary.

  • --import_mapping_file The name of the file from which a user-defined function of signature str -> str is imported. Can be a *.py file, in which case the default expected function name is microservice_mapping (can be modified with --import_mapping_func, see below). Another option is to provide the name of a module defined in mison.mappings, for example mison.mappings.trainticket (feel free to submit a pull request adding mappings for common benchmarks!)
  • --import_mapping_func The name of the function to import from a specified custom file

Mining backends

The following lists all available backends and their specific CLI options.

GitHub

This backend connects to the GitHub API to query data of a hosted repository to mine the commits

  • --github_token A token to access GitHub API. Needs to have permissions to access desired repository
  • --per_page Passed to the API request, the amount of responses per page

PyDriller

This backend uses the PyDriller Python library to mine the commits of a local or remote repository. The additional parameters are filters accepted by pydriller.Repository class constructor.

  • --since Only commits after this date will be analyzed (converted to datetime object)
  • --from_commit Only commits after this commit hash will be analyzed
  • --from_tag Only commits after this commit tag will be analyzed
  • --to Only commits up to this date will be analyzed (converted to datetime object)
  • --to_commit Only commits up to this commit hash will be analyzed
  • --to_tag Only commits up to this commit tag will be analyzed
  • --order Order to traverse commits, options: date-order,author-date-order,topo-order,reverse
  • --only_in_branch Only analyses commits that belong to this branch
  • --only_no_merge Only analyses commits that are not merge commits
  • --only_authors Only analyses commits that are made by these authors (accepts a list of names)
  • --only_commits Only these commits will be analyzed (accepts a list of values)
  • --only_releases Only commits that are tagged (“release” is a term of GitHub, does not actually exist in Git)
  • --filepath Only commits that modified this file will be analyzed
  • --only_modifications_with_file_types Only analyses commits in which at least one modification was done in that file type