Avoiding git-lfs bandwidth waste with GitHub and CircleCI

If you work on a software project, there’s a good chance you’re using a version control system and a continuous integration system, and that the project requires some kind of binary assets. Assets are usually not stored directly in the code repository for various reasons that range from bandwidth requirements for clone operations to git performance issues when dealing with binary files, especially if those change often. git-lfs is a great tool that lets you commit binary files minimizing the aforementioned issues and with the goal of reducing the need for additional infrastructure that would be needed to handle this kind of files externally rather than in-repo.

This article explores how to avoid wasting bandwidth when managing git-lfs repositories with GitHub and CircleCI.

git-what?

The raison d’être for git-lfs is to provide a git workflow to work with large binary files without impacting the speed of the usual git operations or the size of the repository.

In order to do so, git-lfs needs to be configured (per-repository) with a list of files it needs to track. When tracking a file git-lfs will not store the real file in the git repository: it will store a pointer file instead. A pointer file is a small file containing a little bit of metadata regarding the real file. This means that git only ever sees the small pointer files, and its operations will not be slowed down from the need to operate on very large files. A downside of this approach is that git-lfs needs both client and server support to be able to seamlessly push and pull the tracked files.

If the server does not support git-lfs the assets will not be uploaded, if a collaborator does not have git-lfs installed they’ll not be able to pull the assets (even if they are on the server), pulling just the pointer files instead. Luckily, GitHub natively supports git-lfs. So you only have to tell git-lfs which files to track and make sure the project’s collaborators have git-lfs installed, and the usual git workflow will work as usual, until you add continuous integration to the mix.

The CI system

Continuous integration (CI) is incredibly useful if you wish to automate work, like running tests on each push to confirm the code is correct, automating the build system and so on and so forth. Obviously, for a continuous integration system to do anything useful it needs to have access to your code.

It follows logically that for a continuous integration system to correctly pull your code — including those pesky assets — it must support git-lfs. Again luckily CircleCI supports git-lfs natively, so there is nothing else that technically needs to be done, but the default configuration will not save bandwidth in any way. Of course we aim to fix that.

The bandwidth issue

GitHub provides 1GB of git-lfs storage and 1GB of git-lfs bandwidth for free, with additional storage and bandwidth available as data packs. Each data pack at the time of this writing costs 5$ and provides 50GB of storage and 50GB of bandwidth.

So, if your continuous integration system springs into action on each push — which is usually how CI systems are set up — and it needs to pull a lot of assets each time — which is especially likely if you vendor your dependencies — you might see 50GB of bandwidth deplete very fast. This is because CircleCI needs to perform a full clone to guarantee a clean working environment in which to build your code and perform tests. That’s not a problem during the normal development cycle since most assets won’t change very often.

The strategy to avoid wasting bandwidth here is to use CircleCI’s caching mechanism and a nifty functionality exposed by git-lfs.

Step 1: clone just the pointer files

Exporting the GIT_LFS_SKIP_SMUDGE=1 environment variable will make git-lfs pull the pointer files rather than the assets during a clone. This is ideal because we avoid using GitHub’s git-lfs bandwidth.

So, a minimal CircleCI config at this step might look something like this

version: 2

jobs:
  pull-code:
    docker:
      - image: some/image

    environment:
      - GIT_LFS_SKIP_SMUDGE: 1

    steps:
      - checkout

I called the job that pulls the code pull-code, but you can obviously call it anything you like.

checkout is a built-in CircleCI step, which basically clones your repository. The environment option is set so that git-lfs will only clone the pointer files as explained above. Of course, if the assets are needed to compile and/or test your code this configuration is inadequate. In that case, we’ll need to fetch the actual assets via git-lfs, what we want to accomplish though is to only do so when strictly needed to avoid bandwidth wastage.

Luckily CircleCI lets us define caches so that we can reuse files without the need to re-fetch, re-build, re-generate or re-anything them unless needed.

Step 2: cache the assets

CircleCI caches are identified by a unique key and they are immutable. The cache key is user-defined, and CircleCI provides some ways to make it easier to generate unique keys.

One of the ways we can do this is by letting CircleCI calculate a checksum of a file and use it as (part of) the key. This has the secondary advantage that when the checksum of the file changes the key changes too, and the cache will thus be automatically invalidated.

You can create a cache with the save_cache command and restore it using the restore_cache command.

Considering all of the above we can add the following to our configuration:

- restore_cache:
    key: v1-my-cache-key-{{ checksum ??? }}

- run:
    command: |
      git lfs pull

- save_cache:
    key: v1-my-cache-key-{{ checksum ??? }}
    paths:
      - .git/lfs

Let’s look at the example in greater detail. The steps are executed in the order they’re written, so we first and foremost try to restore an existing cache. If no cache with the specified key is found, nothing happens.

Only then we run git lfs pull — which will ignore GIT_LFS_SKIP_SMUDGE since a pull operation is explicitly requested — but will not pull any tracked file that is up-to-date with origin.

Then we save the cache using the same key, specifying the paths that contain the files that we want to put in the cache itself. In our case we’re lucky since git-lfs stores all its tracked files in the .git/lfs directory, so we don’t need to hunt for them.

The next time CirleCI runs it will find the cache with the specified key and will restore it from its own servers, at which point git lfs pull will be a noop, thus avoiding wasted bandwidth.

The ??? in the above example represent a single file.

Another thing: you may have noticed that the cache key starts with v1. That’s a trick so that if for some reason you need to manually invalidate the cache you can always force it simply by bumping v1 to v2.

The missing step: correctly calculating the cache key

Let’s recap: we have a way to download the pointer files rather than the assets to avoid wasting bandwidth, and CircleCI lets us programmatically create unique cache keys based on a checksum. The checksum is only computed for a single file and it must change when the assets change, so that the cache gets invalidated.

The only thing we’re missing at this point is a way to calculate a checksum that will change when any of the assets change. There are undoubtedly many ways to perform this, one possible way is to use the git lfs ls-files -lcommand, which will list all the files that are tracked by git-lfs, along with their unique identifier.

git lfs ls-files -l | cut -d' ' -f1 | sort > .assets-id

This will list the unique identifier (typically the sha256 sum) of each file tracked by git-lfs along with its path, takes only the unique identifiers (cut -d' ' -f1), sorts them and saves the output to a file that will be used to compute part of the cache key.

Please note that sorting the output is non-optional because ls-files does not guarantee any kind of sorting. If two different calls to ls-files return differently-sorted lists even when the assets themselves are the same, the cache keys will be different even if the assets did not in fact change.

Putting it all together

That was the last piece of the puzzle! We now have a file whose checksum will change when and only when any asset changes.

Our final configuration looks like this:

version: 2

jobs:
  pull-code:
    docker:
      - image: some/image

    environment:
      - GIT_LFS_SKIP_SMUDGE: 1

    steps:
      - checkout

      - run:
          command: |
            git lfs ls-files -l | cut -d' ' -f1 | sort > .assets-id

      - restore_cache:
          key: v1-my-cache-key-{{ checksum ".assets-id" }}

      - run:
          command: |
            git lfs pull

      - save_cache:
          key: v1-my-cache-key-{{ checksum ".assets-id" }}
          paths:
            - .git/lfs

And that’s it! You can now unleash your CI system on your GitHub repositories that make use of git-lfs, without the fear of wasting bandwidth.