Introducing Scalar: Git at scale for everyone

Derrick Stolee

February 12th, 20205 0

Git is a distributed version control system, so by default each Git repository has a copy of all files in the entire history. Even moderately-sized teams can create thousands of commits adding hundreds of megabytes to the repository every month. As your repository grows, Git may struggle to manage all that data. Time spent waiting for git status to report modified files or git fetch to get the latest data is time wasted. As these commands get slower, developers stop waiting and start switching context. Context switches harm developer productivity.

At Microsoft, we support the Windows OS repository using VFS for Git (formerly GVFS). VFS for Git uses a virtualized filesystem to bypass many assumptions about repository size, enabling the Windows developers to use Git at a scale previously thought impossible.

While supporting VFS for Git, we identified performance bottlenecks using a custom trace system and collecting user feedback. We made several contributions to the Git client, including the commit-graph file and improvements to git push and sparse-checkout. Building on these contributions and many other recent improvements to Git, we began a project to support very large repositories without needing a virtualized filesystem.

Today we are excited to announce the result of those efforts – Scalar. Scalar accelerates your Git workflow, no matter the size or shape of your repository. And it does it in ways we believe can all make their way into Git, with Scalar doing less and Git doing much more over time.

Scalar is a .NET Core application with installers available for Windows and macOS. Scalar maximizes your Git command performance by setting recommended config values and running background maintenance. You can clone a repository using the GVFS protocol if your repository is hosted by Azure Repos. This is how we will support the next largest Git repository: Microsoft Office.

What about Linux?
There is potential for porting Scalar to Linux, so please comment on this issue if you would use Scalar on Linux.

In the rest of this post, I’ll share three important lessons that informed Scalar’s design:

Focus on the files that matter.
Reduce object transfer.
Don’t wait for expensive operations.

Finally, I share our plan for contributing these features to the Git client. You can get started with Scalar using the instructions below.

Quick start for existing repositories

Scalar accelerates Git commands in your existing repositories, no matter what service you use to host those repositories. All you need to do is register your biggest repositories with Scalar and then see how much faster your Git experience becomes.

To get started, download and install the latest Scalar release. Scalar currently requires a custom version of Git. We plan to remove that requirement after we contribute enough features to the core Git client.

Before beginning, ensure you have the correct versions:

$ git version
git version 2.25.0.vfs.1.1

$ scalar version
scalar 20.01.165.7

From the working directory of your Git repository, run scalar register to make Scalar aware of your repository.

$ scalar register
Successfully registered repo at '/Users/stolee/_git/git'

By registering your repository with Scalar, it will set up some local Git config options and start running background maintenance. If you decide that you do not want Scalar running maintenance, then scalar pause will delay all maintenance for 12 hours, or scalar unregister will stop all future maintenance on the current repository.

You can watch what Scalar does by checking the log files in your .git/logs directory. For example, here is a section of logs from my repository containing the Git source code:

[2020-02-05 11:24:00.9711 -05:00] run (Start) {"Version":"20.01.165.7","EnlistmentRoot":"/Users/stolee/_git/git","Remote":"https://github.com/git/git","ObjectsEndpoint":"https://github.com/git/git","MaintenanceTask":"commit-graph","PackfileMaintenanceBatchSize":"","EnlistmentRootPathParameter":"/Users/stolee/_git/git","StartedByService":true,"Area":"run_Verb","Verb":"run"}
[2020-02-05 11:24:00.9797 -05:00] TryWriteGitCommitGraph (Start)
[2020-02-05 11:24:00.9806 -05:00] RunGitCommand (Start) {"Area":"CommitGraphStep","gitCommand":"WriteCommitGraph"}
[2020-02-05 11:24:01.2120 -05:00] RunGitCommand (Stop) {"DurationMs":229}
[2020-02-05 11:24:01.2297 -05:00] Information {"Message":"commit-graph list after write: graph-6928d994cab880ad7e30fa9f406d01bd0c7bbe6c.graph;graph-cf5d2151c2cfac0451686fafdd6de8bb9111d0d9.graph;commit-graph-chain;graph-0c676dd4d1ff904528c8563a39de8c0e3928ba01.graph;"}
[2020-02-05 11:24:01.2298 -05:00] RunGitCommand (Start) {"Area":"CommitGraphStep","gitCommand":"VerifyCommitGraph"}
[2020-02-05 11:24:01.2518 -05:00] RunGitCommand (Stop) {"DurationMs":21}
[2020-02-05 11:24:01.2518 -05:00] TryWriteGitCommitGraph (Stop) {"DurationMs":272}
[2020-02-05 11:24:01.2522 -05:00] run (Stop) {"DurationMs":333}

These logs show the details from updating the Git commit-graph in the background, the equivalent of the scalar run commit-graph command.

You can run maintenance in the foreground using the scalar run command. When given the all option, Scalar runs all maintenance steps in a single command:

$ scalar run all
Setting recommended config settings...Succeeded
Fetching from remotes...Succeeded
Updating commit-graph...Succeeded
Cleaning up loose objects...Succeeded
Cleaning up pack-files...Succeeded

The scalar run command exists so you can run maintenance tasks on your own schedule or in conjunction with the background maintenance schedule provided by scalar register.

Quick start for using the GVFS protocol

If you are considering using Scalar with the GVFS protocol and Azure Repos, then you can try cloning a new enlistment using scalar clone <url>. Scalar automatically registers this new enlistment, so it will benefit from all the config options and maintenance described above.

By following the snippet below, you can clone a mirror of the Scalar source code using the GVFS protocol:

$ scalar clone https://dev.azure.com/ms-scalar/_git/scalar
Clone parameters:
  Repo URL:     https://dev.azure.com/ms-scalar/_git/scalar
  Branch:       Default
  Cache Server: Default
  Local Cache:  /Users/stolee/.scalarCache
  Destination:  /Users/stolee/_git/t/scalar
  FullClone:     False
Authenticating...Succeeded
Querying remote for config...Succeeded
Using cache server: None (https://dev.azure.com/ms-scalar/_git/scalar)
Querying remote for repo info...Succeeded
Cloning...Succeeded
Fetching from origin (no cache server)...Succeeded
Registering repo...Succeeded

Note that this repository is not large enough to really need the GVFS protocol. We have not set up a GVFS cache server for this repository, but any sufficiently large repository being used by a large group of users should set up a co-located cache server for handling GVFS protocol requests. If you do not have the resources to set up this infrastructure, then perhaps the GVFS protocol is not a good fit, and instead you could use scalar register on an existing Git repository using the Git protocol.

When using scalar clone, the working directory contains only the files at root using the Git sparse-checkout feature in cone mode. You can expand the files in your working directory using the git sparse-checkout set command, or fully populate your working directory by running git sparse-checkout disable.

$ cd scalar/src
$ ls
AuthoringTests.md      Directory.Build.targets  SECURITY.md      global.json
CONTRIBUTING.md          License.md               Scalar.ruleset   nuget.config
Dependencies.props     Protocol.md              Scalar.sln
Directory.Build.props    Readme.md                Signing.targets

$ git sparse-checkout set Scalar Scripts/Mac
Receiving packfile 1/1 with 45 objects (bytes received): 127638, done.

$ ls
AuthoringTests.md      Directory.Build.targets  SECURITY.md     Scripts
CONTRIBUTING.md        License.md               Scalar          Signing.targets
Dependencies.props     Protocol.md              Scalar.ruleset  global.json
Directory.Build.props  Readme.md                Scalar.sln      nuget.config

$ ls Scalar
CommandLine  Images  Program.cs  Scalar.csproj

Note that the clone created the scalar directory and created the working directory is inside a src directory one level down. This allows creating sibling directories for build output files, preventing over-taxing the work Git needs to do when managing your repository. This leads to the first big lesson we learned about making Git as fast as possible.

Lesson 1: Focus on the files that matter

The most common Git commands are git status to see what change are available, git add to stage those changes before committing, and git checkout to change your working directory to match a different version. We call these the core commands.

Each core command inspects the working directory to see how Git’s view of the working directory agrees with what is actually on-disk. There are a few different measurements for how “big” this set can be: the index size, the populated size, and the modified size.

Index size

The Git index is a list of every tracked path at your current HEAD. This file is read and written by each core command, presenting a minimum amount of work.

Pro Tip!
If you are struggling with the size of your index, then you can try running git config features.manyFiles true to take advantage of the updated index version and Git’s untracked cache feature.

In the Windows OS repository, the index contains over three million entries. We minimize the index file size by using an updated version of the index file format, which compresses the index file from 400 MB to 250 MB. Since this size primarily impacts reading and writing a stream from a single file, the average time per index entry is very low.

Populated size

How many paths in the index are actually in your working directory? This is normally equal to the number of tracked files in the index, but Git’s sparse-checkout feature can make it smaller. It takes a little bit of work to design your repository to work with sparse-checkout, but it can allow most developers to populate a fraction of the total paths and still build the components necessary for their daily work.

Scalar leans into the sparse-checkout feature, so much so that the scalar clone command creates a sparse working directory by default. At the start, only the files in the root directory are present. It is up to the user to request more directories, increasing the populated size. This mode can be overridden using the --full-clone option.

The populated size is always at most the number of tracked files. The average cost of populating a file is much higher than adjusting an index entry due to the amount of data involved, so it is more critical to minimize the number of populated files than to minimize the total number of paths in the repository. It is even more expensive to determine which populated files were modified by the user.

Modified Size

The modified size is the number of paths in the working directory that differ from the version in the index. This includes all files that are untracked or ignored by Git. This size determines the minimum amount of work that Git must do to update the index and its caches during the core commands.

Without assistance, Git needs to scan the entire working directory to find which paths were modified. As the populated size increases, this can become extremely slow.

fsmonitor in action
For some developers in the Microsoft Office team, their sparse-checkout definition requires around 700,000 populated files among the three million tracked files. When there are no modified files, git status takes 12.2 seconds with fsmonitor disabled and only 1.5 seconds with it enabled.

Scalar painlessly configures your Git repository to work better with modified files using the fsmonitor Git feature and the Watchman tool. Git uses the fsmonitor hook to discover the list of paths that were modified since the last index update, then focuses its work in inspecting only those paths instead of every populated path. Our team originally contributed the fsmonitor feature to Git, and we continue to contribute improvements.

Lesson 2: Reduce object transfer

Now that the working directory is under control, let’s investigate another expensive dimension of Git at scale. Git expects a complete copy of all objects, both currently referenced and all versions in history. This can be a massive amount of data to transfer — especially when you only need objects near your current branch do a checkout and get on with your work.

For example, in the Windows OS repository, the complete set contains over 100 GB of compressed data. This is incredibly expensive for both the server and the client. Not only is that a lot of data to transfer over the network, but the client needs to verify that all 90 million Git objects hash to the correct values.

We created the GVFS protocol to significantly reduce object transfer. This protocol is currently only available on Azure Repos. It solved one of the major issues with adapting Git to very large repositories by relaxing the distributed nature of Git to become slightly more coupled to a central server for missing objects. It has since inspired the Git partial clone feature which has very similar goals.

When using the GVFS protocol, an initial clone downloads a set of pack-files containing only commits and trees. A clone of the Windows OS repository downloads about 15 GB of data containing 40 million commits and trees. With these objects on-disk, we can generate a view of the working directory and examine commit history using git log.

The GVFS protocol also allows dynamically downloading Git objects as-needed. This pairs well with our work to reduce the populated size using sparse checkout, since reducing the populated size reduces the number of required objects.

To reduce latency and increase throughput, we allow the GVFS protocol to be proxied through a set of cache servers that are co-located with the end users and build machines. This has an added bonus of reducing stress on the central server. We intend to contribute this idea to the Git protocol.

Lesson 3: Don’t wait for expensive operations

There is no free lunch. Large repositories require upkeep. We can’t make users wait, so we defer these operations to background processes.

Git typically handles maintenance by running garbage collection (GC) with the git gc --auto command at the end of several common commands, like git commit and git fetch. Auto-GC checks your .git directory to see if certain thresholds are met to run garbage collection. If the thresholds are met, it completely rewrites all object data, a process that includes a CPU-intensive compression step. This can cause simple commands like git commit to be blocked for minutes. A rewrite of tens of gigabytes of data can also bring your entire system to a standstill because it consumes all the CPU and memory resources it can.

You can already disable automatic garbage collection by setting gc.auto to zero. However, this has the downside that your Git performance will decay slowly as you accumulate new objects through your daily work.

VFS for Git and Scalar both solve this problem by maintaining the repository in the background. This is also done incrementally to reduce the extra load on your machine. Let’s explore each of these background operations and how they improve the repository.

Set recommended Git config settings

The config step updates your Git config settings to some recommended values. The config step runs in the background so that new versions of Scalar can update the registered repositories after install. As new config options are supported, we will update the list of settings accordingly.

Some of the noteworthy config settings are:

We disable auto-GC by setting gc.auto=0. This prevents your Git commands from being blocked by expensive maintenance. The background maintenance keeps your Git object database clean.
We disable writing the commit-graph during git fetch by setting fetch.writeCommitGraph=false, because we write it in the background (see below).
We set status.aheadBehind=false to remove the calculation of how far ahead or behind your branch is compared to the remote-tracking branch. This message is frequently ignored, but can cost precious seconds when you just want to see your unstaged changes.
We set core.fsmonitor to a hook that communicates with Watchman, if Watchman is installed.

Fetch in the background

The fetch step runs git fetch about once an hour. This allows your local repository to keep its object database close to that of your remotes. This means that the time-consuming part of git fetch that downloads the new objects happens when you are not waiting for your command to complete.

We intentionally do not change your local branches, including the ones in refs/remotes. You still need to run git fetch in the foreground when you want ref updates from your remotes. We run git fetch with a custom refspec to put all remote refs into a new ref namespace: refs/scalar/hidden/<remote>/<branch>. This allows us to have starting points when writing the commit-graph.

Write the commit-graph

The Git commit-graph is critical to performance in repositories with hundreds of thousands of commits. While it is enabled and written during git fetch by default since Git 2.24.0, that does require a little bit of extra overhead in foreground fetches. To recover that time during git fetch while maintaining performance, we update the commit-graph in the background.

By running git commit-graph write --split --reachable, we update the commit-graph to include all reachable commits (including those reachable from refs in refs/scalar/hidden) and use the incremental file format to minimize the cost of these background operations.

Clean up loose objects

As you work, Git creates “loose” objects by writing the data of a single object to a file named according to its SHA-1 hash. This is very quick to create, but accumulating too many objects like this can have significant performance drawbacks. It also uses more disk space than necessary, since Git’s pack-files can compress data more efficiently using delta encoding.

To reduce this overhead, the loose objects step will clean up your loose objects.

Index multiple pack-files

Pack-files are very efficient ways to store a set of Git objects. Each .pack file is paired with a .idx file called the pack-index, which allows Git to find the data for a packed object quickly. As pack-files accumulate, Git needs to inspect a long list of pack-indexes to find objects, so a previously fast operation becomes slow. Normally, garbage collection would occasionally group these pack-files into a single pack-file, improving performance.

But what happens if we have too much data to efficiently rewrite all Git data into a single pack-file? How can we keep the performance of a single pack-file while also performing smaller maintenance steps?

Our solution is the Git multi-pack-index file. Inspired by a similar feature in Azure Repos, the multi-pack-index tracks the location of objects across multiple pack-files. This file keeps Git’s object lookup time the same as if we had repacked into a single pack-file. Scalar runs git multi-pack-index write in the background to create the multi-pack-index.

Clean up pack-files

Image scalar multi pack index — The multi-pack-index maintenance loop.

However, there is still a problem. If we let the number of pack-files grow without bound, Git cannot hold file handles to all pack-files at once. Rewriting pack-files could also reduce space costs due to better delta encoding.

To solve this problem, Scalar has a pack-file maintenance step which performs an incremental repack by selecting a batch of small pack-files to rewrite. The multi-pack-index is a critical component for this rewrite. When the new pack-file is added to the multi-pack-index, the old pack-files are still referenced by the multi-pack-index, but all of their objects are pointing to the new pack-file. Any Git processes looking at the new multi-pack-index will never read from the old pack-files.

Concrete results for Windows
When we deployed these maintenance steps to the Windows OS developers, we saw that some repositories had thousands of packs that summed to 150-200 gigabytes. These repositories now have fewer than one hundred packs totaling 30-50 gigabytes.

The git multi-pack-index repack command collects a set of small pack-files and creates a new pack-file containing all of the objects the multi-pack-index references from those pack-files. Then, Git adds the new pack-file to the multi-pack-index and updates those object references to point to the new pack-file. We then run git multi-pack-index expire which deletes the pack-files that have no referenced objects. By performing these in two steps, we avoid disrupting other Git commands a user may run in the foreground.

Scalar and the future of Git

We intentionally are making Scalar do less and investing in making Git do more. Scalar is simply a way to get the performance we need today. As Git improves, Scalar can provide a way to transition away from needing Scalar and using only the core Git client.

Scalar also serves as an example for the kinds of features we need in Git to remove these management layers on top. Here are a few of our planned Git contributions for the coming years.

Scalar relies on a stable and correct filesystem watcher to scale growth in modified size, and Watchman does that decently well. However, Watchman is a much more general tool than we need, and it isn’t “Git aware.” It doesn’t know when a directory matches a .gitignore pattern and that we don’t need to scan it for changes. By creating a custom filesystem watcher in Git itself, we can optimize this interface to our needs.
The sparse-checkout feature is how we scale growth in populated size. While the recent updates to the sparse-checkout feature made it faster and easier to use, we have a long way to go before that feature is complete.
Now that we are using sparse-checkout instead of a virtualized filesystem, we have new bottlenecks for Git commands. In particular, git checkout is not as fast as when using VFS for Git. With virtualization tricks, VFS for Git can act as if the filesystem is updated, delaying the cost of the populated size to later operations. We are investigating a parallel version of git checkout to improve performance.
The GVFS protocol allowed Azure Repos to quickly support the Windows OS repository. After that success, a cross-community group created the partial clone feature in Git. Partial clones do not have a local copy of every reachable object and request missing objects when needed. Partial clone needs a few client-side improvements and support from service providers. When implementing Scalar, we reworked how Git interacts with the GVFS protocol to be inside the partial clone interface, so improvements to one experience will benefit the other. As the Microsoft Office team onboards to Scalar, we expect to find new ways that Git can better interact with partial clone.
To truly scale Git services to the demands of thousands of engineers and build machines interacting with a central server, Git needs a notion similar to the GVFS cache servers. It could be as simple as a fetch-objects URL in addition to the fetch and push URLs in the remote config. While the branch updates would still come from the central authority, clients could download the pack-file from the fetch-objects URL. We plan to propose this concept on the mailing list soon.
We mentioned earlier how the Git client depends on periodic foreground garbage collection to keep repositories running smoothly. This is simply not feasible for very large repositories, and we plan to contribute a form of background maintenance to the core Git client. This will be an opt-in feature, and we hope to create a command such as git maintenance start that is as easy to use as scalar register.

I will be presenting these ideas and more at Git Merge 2020, so please check out the livestream at 12:00pm PT on March 4, 2020.

Please, give Scalar a try and let us know if it helps you. Is there something it needs to do better? Please create an issue to provide feedback.

Derrick Stolee Principal Software Engineer, Azure DevOps

5 comments

Discussion is closed. Login to edit/delete existing comments.

Vincent Thorn February 14, 2020 11:01 am 0

MS made wrong, dilettantish decision, selecting Git as a DVCS. Now they pay for mistake, inventing crutches for this rubbish. Mercurial – this is real DVCS, properly designed from very start. It’s so pity a big company like MS makes so silly decisions.
- Max Vasilyev February 20, 2020 2:43 am 0
  
  Is that why Bitbucket removes support for HG this year?
- Sun Kim February 20, 2020 11:51 am 0
  
  I don’t know if Mercurial is technically superior to Git, but sometimes that is less relevant than the community support for a product. Clearly, Git has many more users and community support.
Paul Dunn February 26, 2020 11:32 am 0

Hi Derrick,

Is there any help guide or instructions for “set up a GVFS cache server” against a git repo hosted by Azure DevOps? Will you have a demo of using a cache server in your presentation at Git Merge 2020?

Thanks,
Paul Dunn
Eli Black March 23, 2020 6:39 pm 0

This sounds super cool!

I particularly like the idea of automatically running fetch in the background, to speed up download times.