git clone --filter
+ git sparse-checkout
downloads only the required files
E.g., to clone only files in subdirectory small/
in this test repository: https://github.com/cirosantilli/test-git-partial-clone-big-small-no-bigtree
git clone -n --depth=1 --filter=tree:0 \
https://github.com/cirosantilli/test-git-partial-clone-big-small-no-bigtree
cd test-git-partial-clone-big-small-no-bigtree
git sparse-checkout set --no-cone small
git checkout
You could also select multiple directories for download with:
git sparse-checkout set --no-cone small small2
This method doesn't work for individual files however, but here is another method that does: How to sparsely checkout only one single file from a git repository?
In this test, clone is basically instantaneous, and we can confirm that the cloned repository is very small as desired:
du --apparent-size -hs * .* | sort -hs
giving:
2.0K small
226K .git
That test repository contains:
- a
big/
subdirectory with 10x 10MB files
- 10x 10MB files
0
, 1
, ... 9
on toplevel (this is because certain previous attempts would download toplevel files)
- a
small/
and small2/
subdirectories with 1000 files of size one byte each
All contents are pseudo-random and therefore incompressible, so we can easily notice if any of the big files were downloaded, e.g. with ncdu
.
So if you download anything you didn't want, you would get 100 MB extra, and it would be very noticeable.
On the above, git clone
downloads a single object, presumably the commit:
Cloning into 'test-git-partial-clone-big-small'...
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Receiving objects: 100% (1/1), done.
and then the final checkout downloads the files we requested:
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 3 (delta 0), pack-reused 0
Receiving objects: 100% (3/3), 10.19 KiB | 2.04 MiB/s, done.
remote: Enumerating objects: 253, done.
remote: Counting objects: 100% (253/253), done.
Receiving objects: 100% (253/253), 2.50 KiB | 2.50 MiB/s, done.
remote: Total 253 (delta 0), reused 253 (delta 0), pack-reused 0
Your branch is up to date with 'origin/master'.
Tested on git 2.37.2, Ubuntu 22.10, on January 2023.
TODO also prevent download of unneeded tree objects
The above method downloads all Git tree objects (i.e. directory listings, but not actual file contents). We can confirm that by running:
git ls-files
and seeing that it contains the directories large files such as:
big/0
In most projects this won't be an issue, as these should be small compared to the actual file contents, but the perfectionist in me would like to avoid them.
I've also created a very extreme repository with some very large tree objects (100 MB) under the directory big_tree
: https://github.com/cirosantilli/test-git-partial-clone-big-small
Let me know if anyone finds a way to clone just the small/
directory from it!
About the commands
The --filter
option was added together with an update to the remote protocol, and it truly prevents objects from being downloaded from the server.
The sparse-checkout
part is also needed unfortunately. You can also only download certain files with the much more understandable:
git clone --depth 1 --filter=blob:none --no-checkout \
https://github.com/cirosantilli/test-git-partial-clone-big-small
cd test-git-partial-clone-big-small
git checkout master -- d1
but that method for some reason downloads files one by one very slowly, making it unusable unless you have very few files in the directory.
Another less verbose but failed attempt was:
git clone --depth 1 --filter=blob:none --sparse \
https://github.com/cirosantilli/test-git-partial-clone-big-small
cd test-git-partial-clone-big-small
git sparse-checkout set small
but that downloads all files in the toplevel directory: How to prevent git clone --filter=blob:none --sparse from downloading files on the root directory?
The dream: any directory can have web interface metadata
This feature could revolutionize Git.
Imagine having all the code base of your enterprise in a single monorepo without ugly third-party tools like repo
.
Imagine storing huge blobs directly in the repo without any ugly third party extensions.
Imagine if GitHub would allow per file / directory metadata like stars and permissions, so you can store all your personal stuff under a single repo.
Imagine if submodules were treated exactly like regular directories: just request a tree SHA, and a DNS-like mechanism resolves your request, first looking on your local ~/.git
, then first to closer servers (your enterprise's mirror / cache) and ending up on GitHub.
I have a dream.
The test cone monorepo philosophy
This is a possible philosophy for monorepo maintenance without submodules.
We want to avoid submodules because it is annoying to have to commit to two separate repositories every time you make a change that has a submodule and non-submodule component.
Every directory with a Makefile or analogous should build and test itself.
Such directories can depend on either:
- every file and subdirectory under it directly at their latest versions
- external directories can be relied upon only at specified versions
Until git starts supporting this natively (i.e. submodules that can track only subdirectories), we can support this with some metadata in a git tracked file:
monorepo.json
{
"path": "some/useful/lib",
"sha": 12341234123412341234,
}
where sha
refers to the usual SHA of the entire repository. Then we need scripts that will checkout such directories e.g. under a gitignored monorepo
folder:
monorepo/som/useful/lib
Whenever you change a file, you have to go up the tree and test all directories that have Makefile. This is because directories can depend on subdirectories at their latest versions, so you could always break something above you.
Related: