-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider alternative to append-only CHANGELOG.md files #978
Comments
I realized that there is a public repo that uses beachball (and not Azure DevOps) that demonstrates this behavior as well. The
And here are its top paths by disk size:
After repacking with
And now, the CHANGELOG files are not completely dominating the top files by on-disk size:
|
I was recently investigating Git repository size bloat and came across Beachball’s CHANGELOG.md files as contributing significantly to repository growth in some internal Javascript repositories. Inefficiencies in Git and Azure DevOps are more at fault here, but those tools are harder to fix on a short timeline. I’m creating this issue to communicate the issue to the Beachball team and see what options they believe are possible to be done here.
For some of the technical Git details, I will be using terms from the following articles:
The Size Issue
I began looking at an internal Javascript monorepo and noticed that the size of the Git blob objects were growing much faster than what would be justified by the number of contributors to the repo. The tree objects were growing at a reasonable rate. This allowed me to focus on the size of the blobs.
As someone familiar with the Git codebase, I was able to prototype a tool that scans the Git object graph grouping object ids by a path they appear at from at least one commit. From these per-path batches, I was able to collect the top 100 paths that contributed to the on-disk size. In this case, I was looking at the packfile that was cloned from Azure DevOps.
Of these top 100 paths, almost all of them were the CHANGELOG.md and CHANGELOG.json files produced by beachball! The only other file in the list was the yarn.lock file that is changed frequently as part of natural development process.
In the worst case, one CHANGELOG.md file was contributing 1.9 GB to the size of the clone! This is across 6,769 versions of the file.
This struck me as odd, since it was clear that these CHANGELOG.md files were being appended to and Git should be storing these files using delta compression. Please see the following figure for how these append-only files are stored as full snapshots and then should be stored using efficient delta compression.
Note how the small gray rectangles do not look like much data. In a well-packed repository, this is indeed how Git would choose to compress these files for efficient storage and network communication.
However, there are two issues that are affecting this process:
git push
is not considering the previous version of the CHANGELOG.md file as a potential delta base, and thus is creating an inefficient packfile to send to the server.We are pursuing fixes to both of these issues, as this inefficiency is affecting more repositories than just the ones using beachball.
However, beachball repositories are affected particularly badly due to this structure! Based on these findings, we expanded our search to two other internal Javascript repos that use beachball. The tool that looked for the “biggest paths” also identified CHANGELOG files as a major factor to growth. These three repos had the following behavior after cloning from Azure DevOps and then repacking with
git repack -adf –window=250
:What can Beachball do?
There are a few things that beachball can do to try and improve this situation, both in the short term (before Git and Azure DevOps inefficiencies can be fixed) and in the long term.
CHANGELOG.md files
I’ve called out the CHANGELOG.md files because these are currently growing without bound, leading to the most obvious issues. Since the files at HEAD are growing without any truncation of old changelog entries, there are two issues happening:
git checkout
.git blame
orgit log -p
.While Git’s delta compression can reduce the on-disk size of these append-only files, the abstract size of the blobs is still growing linearly over time. Thus, Git operations need to "inflate" the objects to full size in order to check them out at tip or to do diff operations.
One possible way to limit this issue is to consider truncating old entries from these files, as in this figure:
Here, the on-disk size doesn’t change in the optimal storage, but the size of the snapshots decreases. This helps both
git checkout
andgit blame
types of commands.This also helps in the very short term, before
git push
is fixed, as the size of the objects being sent bygit push
reduces greatly. If the objects at tip are smaller, then even bad delta bases will lead to smaller pushes.CHANGELOG.json files
The other side of things is the creation of CHANGELOG.json files.
I believe that these files already have mechanisms for limiting the number of entries in their internal structure.(Edit: I stand corrected, the length-limiting aspect is custom to a use of beachball that I saw, but is not the default behavior.) The content of the files may be inefficient to compress as well.In one case, we saw that one CHANGELOG.json file was getting updated even though the version was not changing, but the included date was changing. This led to more versions than necessary. On top of that, the dates are harder to compress, but that’s a minor concern.
The ask here is to double-check that the data being stored in these .json files are sensible and serving a valuable purpose for users.
Philosophy: Committing Build Artifacts to Source Repository
A general piece of advice I give to monorepo maintainers is to avoid committing build artifacts to their Git repositories. Git is (usually) good at storing human-generated changes to text files. When an automated process creates Git data in a way that differs from that type of behavior, then issues with scale are likely to occur.
For this reason, I want to suggest that we consider recommendations for how beachball consumers could consider alternatives to committing the changelog files to their repo. Is there a way that the end goals of beachball could be accomplished without committing that data to Git?
Open source repositories on GitHub could use beachball output to generate release notes for GitHub releases. Private repositories could consider publishing changelogs to their package feeds or something like Azure Artifacts. These represent a change to where changelog consumers would need to look for things, so is a more substantial ask than the previous two options.
Questions
How can I reproduce the checks for the on-disk size of paths?
This is currently based on a prototype change to Git, but I will be contributing it to the Git project soon. As that branch solidifies, I can point to it for those who want to compile their own Git version with the change. I’ll try to remember to point out when the tool is in a released version of Git for those who don’t want to build it from source.
Why is
git push
so inefficient?The area of Git’s code that creates the packfile during ‘git push’ has been optimized over and over by contributors focused on the application within Git servers. Serving a clone or a fetch has significantly different needs than pushing a small topic branch. Moreover, most Git servers are implemented using Git on the backend, and these inefficiencies are typically overwritten by the server doing a full repack of the repository. This “repack everything” approach that is typically done can hide these issues in many cases.
In my prototype to fix this, not only does the packfile shrink, but the time needed to compute it also shrinks. So there are end-user performance gains outside of this issue.
Why is Azure DevOps trusting deltas?
Unlike most other Git servers, Azure DevOps does not use the open source Git project on its backend. That means that nearly everything has been reimplemented from scratch (with some help from the libgit2 project). One benefit of that choice is that certain architectural choices were made to enable Git scale that is not known to be possible on other servers.
One such choice is that Azure DevOps avoids anything that requires rewriting the entire repository contents. The
git repack
command that recomputes delta bases is typically run with the entire repository under consideration, so that model is not possible to follow. In this case, "incremental delta compression" was a problem that was left unsolved. This investigation has increased the priority of that work.The text was updated successfully, but these errors were encountered: