How to find/identify large commits in git history?


I have a 300 MB git repo. The total size of my currently checked-out files is 2 MB, and the total size of the rest of the git repo is 298 MB. This is basically a code-only repo that should not be more than a few MB.

I suspect someone accidentally committed some large files (video, images, etc), and then removed them... but not from git, so the history still contains useless large files. How can find the large files in the git history? There are 400+ commits, so going one-by-one is not practical.

NOTE: my question is not about how to remove the file, but how to find it in the first place.


🚀 A blazingly fast shell one-liner 🚀

This shell script displays all blob objects in the repository, sorted from smallest to largest.

For my sample repo, it ran about 100 times faster than the other ones found here.
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5.6 million objects in just over a minute.

The Base Script

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

When you run above code, you will get nice human-readable output like this:

0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

macOS users: Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.


To achieve further filtering, insert any of the following lines before the sort line.

To exclude files that are present in HEAD, insert the following line:

grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |

To show only files exceeding given size (e.g. 1 MiB = 220 B), insert the following line:

awk '$2 >= 2^20' |

Output for Computers

To generate output that's more suitable for further processing by computers, omit the last two lines of the base script. They do all the formatting. This will leave you with something like this:

0d99bb93129939b72069df14af0d0dbda7eb6dba 542455 path/to/some-image.jpg
2ba44098e28f8f66bac5e21210c2774085d2319b 12446815 path/to/hires-image.png
bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f 65183843 path/to/some-video-1080p.mp4


File Removal

For the actual file removal, check out this SO question on the topic.

Understanding the meaning of the displayed file size

What this script displays is the size each file would have in the working directory. If you want to see how much space a file occupies if not checked out, you can use %(objectsize:disk) instead of %(objectsize). However, mind that this metric also has its caveats, as is mentioned in the documentation.

More sophisticated size statistics

Sometimes a list of big files is just not enough to find out what the problem is. You would not spot directories or branches containing humongous numbers of small files, for example.

So if the script here does not cut it for you (and you have a decently recent version of git), look into git-filter-repo --analyze or git rev-list --disk-usage (examples).

Can I make 'git diff' only display the line numbers AND changed file names?

Rollback a Git merge