Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider alternative to append-only CHANGELOG.md files #978

Open
derrickstolee opened this issue Sep 5, 2024 · 1 comment
Open

Consider alternative to append-only CHANGELOG.md files #978

derrickstolee opened this issue Sep 5, 2024 · 1 comment

Comments

@derrickstolee
Copy link

derrickstolee commented Sep 5, 2024

I was recently investigating Git repository size bloat and came across Beachball’s CHANGELOG.md files as contributing significantly to repository growth in some internal Javascript repositories. Inefficiencies in Git and Azure DevOps are more at fault here, but those tools are harder to fix on a short timeline. I’m creating this issue to communicate the issue to the Beachball team and see what options they believe are possible to be done here.

For some of the technical Git details, I will be using terms from the following articles:

The Size Issue

I began looking at an internal Javascript monorepo and noticed that the size of the Git blob objects were growing much faster than what would be justified by the number of contributors to the repo. The tree objects were growing at a reasonable rate. This allowed me to focus on the size of the blobs.

As someone familiar with the Git codebase, I was able to prototype a tool that scans the Git object graph grouping object ids by a path they appear at from at least one commit. From these per-path batches, I was able to collect the top 100 paths that contributed to the on-disk size. In this case, I was looking at the packfile that was cloned from Azure DevOps.

Of these top 100 paths, almost all of them were the CHANGELOG.md and CHANGELOG.json files produced by beachball! The only other file in the list was the yarn.lock file that is changed frequently as part of natural development process.

In the worst case, one CHANGELOG.md file was contributing 1.9 GB to the size of the clone! This is across 6,769 versions of the file.

This struck me as odd, since it was clear that these CHANGELOG.md files were being appended to and Git should be storing these files using delta compression. Please see the following figure for how these append-only files are stored as full snapshots and then should be stored using efficient delta compression.

Append-only files in Git

Note how the small gray rectangles do not look like much data. In a well-packed repository, this is indeed how Git would choose to compress these files for efficient storage and network communication.

However, there are two issues that are affecting this process:

  1. git push is not considering the previous version of the CHANGELOG.md file as a potential delta base, and thus is creating an inefficient packfile to send to the server.
  2. Azure DevOps trusts this delta choice and does not recompute the delta base.

We are pursuing fixes to both of these issues, as this inefficiency is affecting more repositories than just the ones using beachball.

However, beachball repositories are affected particularly badly due to this structure! Based on these findings, we expanded our search to two other internal Javascript repos that use beachball. The tool that looked for the “biggest paths” also identified CHANGELOG files as a major factor to growth. These three repos had the following behavior after cloning from Azure DevOps and then repacking with git repack -adf –window=250:

  • Repo A: 6.5 GB to 416 MB
  • Repo B: 40.1 GB to 2.0 GB
  • Repo C: 51GB to 3.5 GB
  • Repo D: 138GB to 33.3GB

What can Beachball do?

There are a few things that beachball can do to try and improve this situation, both in the short term (before Git and Azure DevOps inefficiencies can be fixed) and in the long term.

CHANGELOG.md files

I’ve called out the CHANGELOG.md files because these are currently growing without bound, leading to the most obvious issues. Since the files at HEAD are growing without any truncation of old changelog entries, there are two issues happening:

  1. The size of the CHANGELOG.md files at tip are growing linearly over time, causing scale issues with git checkout.
  2. The size of the CHANGELOG.md snapshots across history are growing quadratically over time, causing scale issues with history commands such as git blame or git log -p.

While Git’s delta compression can reduce the on-disk size of these append-only files, the abstract size of the blobs is still growing linearly over time. Thus, Git operations need to "inflate" the objects to full size in order to check them out at tip or to do diff operations.

One possible way to limit this issue is to consider truncating old entries from these files, as in this figure:

Capped files in Git

Here, the on-disk size doesn’t change in the optimal storage, but the size of the snapshots decreases. This helps both git checkout and git blame types of commands.

This also helps in the very short term, before git push is fixed, as the size of the objects being sent by git push reduces greatly. If the objects at tip are smaller, then even bad delta bases will lead to smaller pushes.

CHANGELOG.json files

The other side of things is the creation of CHANGELOG.json files. I believe that these files already have mechanisms for limiting the number of entries in their internal structure. (Edit: I stand corrected, the length-limiting aspect is custom to a use of beachball that I saw, but is not the default behavior.) The content of the files may be inefficient to compress as well.

In one case, we saw that one CHANGELOG.json file was getting updated even though the version was not changing, but the included date was changing. This led to more versions than necessary. On top of that, the dates are harder to compress, but that’s a minor concern.

The ask here is to double-check that the data being stored in these .json files are sensible and serving a valuable purpose for users.

Philosophy: Committing Build Artifacts to Source Repository

A general piece of advice I give to monorepo maintainers is to avoid committing build artifacts to their Git repositories. Git is (usually) good at storing human-generated changes to text files. When an automated process creates Git data in a way that differs from that type of behavior, then issues with scale are likely to occur.

For this reason, I want to suggest that we consider recommendations for how beachball consumers could consider alternatives to committing the changelog files to their repo. Is there a way that the end goals of beachball could be accomplished without committing that data to Git?

Open source repositories on GitHub could use beachball output to generate release notes for GitHub releases. Private repositories could consider publishing changelogs to their package feeds or something like Azure Artifacts. These represent a change to where changelog consumers would need to look for things, so is a more substantial ask than the previous two options.

Questions

How can I reproduce the checks for the on-disk size of paths?

This is currently based on a prototype change to Git, but I will be contributing it to the Git project soon. As that branch solidifies, I can point to it for those who want to compile their own Git version with the change. I’ll try to remember to point out when the tool is in a released version of Git for those who don’t want to build it from source.

Why is git push so inefficient?

The area of Git’s code that creates the packfile during ‘git push’ has been optimized over and over by contributors focused on the application within Git servers. Serving a clone or a fetch has significantly different needs than pushing a small topic branch. Moreover, most Git servers are implemented using Git on the backend, and these inefficiencies are typically overwritten by the server doing a full repack of the repository. This “repack everything” approach that is typically done can hide these issues in many cases.

In my prototype to fix this, not only does the packfile shrink, but the time needed to compute it also shrinks. So there are end-user performance gains outside of this issue.

Why is Azure DevOps trusting deltas?

Unlike most other Git servers, Azure DevOps does not use the open source Git project on its backend. That means that nearly everything has been reimplemented from scratch (with some help from the libgit2 project). One benefit of that choice is that certain architectural choices were made to enable Git scale that is not known to be possible on other servers.

One such choice is that Azure DevOps avoids anything that requires rewriting the entire repository contents. The git repack command that recomputes delta bases is typically run with the entire repository under consideration, so that model is not possible to follow. In this case, "incremental delta compression" was a problem that was left unsolved. This investigation has increased the priority of that work.

@derrickstolee
Copy link
Author

I realized that there is a public repo that uses beachball (and not Azure DevOps) that demonstrates this behavior as well.

The microsoft/fluentui repo uses beachball and has these stats on clone:

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10444116 |      15092790
      Trees | 276503 |  40228587 |     244429615
      Blobs | 294500 | 633502881 |   10791187920

And here are its top paths by disk size:

TOP FILES BY DISK SIZE
=========================================================================================================
                                                                 Path | Count | Disk Size | Inflated Size
----------------------------------------------------------------------+-------+-----------+--------------
                                                           /yarn.lock |   802 |  56499126 |     889120488
                           /packages/react-experiments/CHANGELOG.json |   505 |  28457640 |     252723999
                      /packages/office-ui-fabric-react/CHANGELOG.json |  1270 |  25556509 |     902756623
           /packages/react-components/react-components/CHANGELOG.json |   176 |  20099290 |     244936649
                              /packages/react-charting/CHANGELOG.json |   590 |  20073030 |     208224460
                    /packages/react-docsite-components/CHANGELOG.json |   559 |  15873725 |     189061764
                              /packages/react-examples/CHANGELOG.json |   577 |  13615569 |     234949961
                                /packages/react-charting/CHANGELOG.md |   564 |  11146607 |     104337986
                                 /packages/experiments/CHANGELOG.json |   569 |  10596377 |     123662770
                        /packages/office-ui-fabric-react/CHANGELOG.md |  1263 |   8154247 |     261494258
                      /packages/react-docsite-components/CHANGELOG.md |   534 |   8098216 |      96846669
                                /packages/react-examples/CHANGELOG.md |   559 |   8052921 |     109921376
                             /packages/react-date-time/CHANGELOG.json |   484 |   7530590 |      86830898
                                       /packages/react/CHANGELOG.json |   577 |   7181528 |     747565376
                             /packages/react-experiments/CHANGELOG.md |   484 |   6766386 |      96479925
                         /packages/react-monaco-editor/CHANGELOG.json |   551 |   6406416 |      95392294
                                   /packages/utilities/CHANGELOG.json |   419 |   6367949 |      45048020
                                /packages/azure-themes/CHANGELOG.json |   682 |   6263035 |      78990294
                                 /packages/react-cards/CHANGELOG.json |   662 |   5481460 |      76464666
                               /packages/react-date-time/CHANGELOG.md |   460 |   5162339 |      52333668

After repacking with git repack -adf --window=50, I get the following space reduction:

TOTAL OBJECT SIZES BY TYPE
================================================
Object Type |  Count | Disk Size | Inflated Size
------------+--------+-----------+--------------
    Commits |  20579 |  10434515 |      15092790
      Trees | 276503 |  29176837 |     244429615
      Blobs | 294500 | 159878227 |   10791187920

And now, the CHANGELOG files are not completely dominating the top files by on-disk size:

TOP FILES BY DISK SIZE
===========================================================================================================
                                                                   Path | Count | Disk Size | Inflated Size
------------------------------------------------------------------------+-------+-----------+--------------
                                                             /yarn.lock |   802 |   4005164 |     889120488
                        /packages/office-ui-fabric-react/CHANGELOG.json |  1270 |   3676258 |     902756623
      /packages/fabric-website/src/files/OfficeBrandGuide_16Sep2016.pdf |     1 |   2106334 |       2186005
                /packages/dashboard-grid-layout/src/images/download.jpg |     1 |   1845249 |       1846117
                                /packages/react-examples/CHANGELOG.json |   577 |   1712605 |     234949961
                                   /packages/experiments/CHANGELOG.json |   569 |   1541330 |     123662770
             /packages/react-components/react-components/CHANGELOG.json |   176 |   1471271 |     244936649
  /packages/fluentui/docs/src/public/images/fluent-ui-logo-inverted.png |     3 |   1370372 |       1447493
                                     /packages/utilities/CHANGELOG.json |   419 |   1347729 |      45048020
                                                 /.yarn/releases/cli.js |     1 |   1335657 |       6741614
           /packages/fluentui/docs/src/public/images/fluent-ui-logo.png |     3 |   1272902 |       1341139
       /packages/fluentui/docs/src/public/images/fluent-ui-logo-dev.png |     3 |   1130989 |       1186897
                                /packages/web-components/CHANGELOG.json |   262 |   1126131 |      21774112
                         /packages/web-components/public/SegoeUI-VF.ttf |     1 |   1074046 |       1844524
                                /common/config/rush/npm-shrinkwrap.json |   138 |    952556 |      89326567
                                  /packages/merge-styles/CHANGELOG.json |   205 |    947635 |       9957729
                  /packages/react-components/react-table/CHANGELOG.json |   125 |    936879 |      16127439
                                     /packages/dashboard/CHANGELOG.json |   130 |    872942 |       8204838
                                    /apps/fabric-website/CHANGELOG.json |   452 |    869565 |      64803780
    /packages/fabric-website/src/files/ColorAccessibility_29Sep2016.pdf |     1 |    856621 |        927268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant