Git Under the Hood

This blog post demystifies the inner workings of Git by breaking down its core concepts into digestible sections. It covers Git Objects, Git References, and The Three Trees, using practical examples to illustrate how Git tracks changes. By exploring the .git folder's internal structure, readers will gain confidence in using Git more effectively.

Git is one of the most widespread programming tools, but there are many developers who despite years of daily usage still fear making mistakes during specific operations. This is because many developers ascribe too much magic to Git and build a false mental model of Git, thinking that it must be far too complicated to understand. But in fact, it is quite simple!

Today our goal is to learn the right abstractions, which will give you more confidence when working with Git in the future. One of the most efficient ways to build up a deep understanding of Git is to take a look at how it works under the hood. Therefore, today we will take a deep dive into the inner workings of Git.

The concepts you will know about by the end of this post can grouped be into three categories:

We will dedicate a section to each category of concepts. Let’s go 🚀

📝 References

Some excellent materials to learn about Git:

📚 Introduction

📜 History

Linus Torvalds created Git in 2005 for the development of the Linux kernel. Git is a distributed version control system for tracking changes (no central repository).

🌰 Git in a Nutshell

Git operates by capturing snapshots of your project and storing these snapshots in the .git folder within your repository. Delving into this repository reveals that a project’s history is composed of a series of commits, with each commit acting as a snapshot at a certain moment. Branches and tags serve as references, pointing to these commits, which helps in managing the project’s version history efficiently.

📂 What’s in the .git folder?

Let’s find out! Therefore, we create a new dummy folder and initialize a new git repository:

$ mkdir dummy && cd dummy
$ git init

Git created a new .git folder, let’s take a look:

$ tree .git
  .git
  ├── branches # historical artifact
  ├── config
  ├── description
  ├── HEAD # pointer to the current branch or last commit (detached head)
  ├── index # staging area
  ├── info
  │   └── exclude
  ├── objects # database of git objects (seems pretty empty atm)
  │   ├── info
  │   └── pack
  └── refs # git references
      ├── heads # branches
      └── tags # tags

Just looking at the folder structure, we can already recognize the three concepts mentioned above:

  • HEAD and index are two of the three trees.
  • objects is a folder, but it seems to be empty at the moment.
  • the refs folder has two subfolders heads and tags

Okay, that was a good start! Let’s continue by taking a deeper look at the objects folder.

🗃️ Git Objects

As a small convenience, we open a second terminal where we use the watch command to continuously monitor 🔎 the changes in the .git/object folder:

$ watch tree .git/objects

Now, we can see what happens when we run a git command in the other terminal.

📄 What happens if we create a new file?

We create a simple Python program that greets the user:

$ echo 'print("hello!")' > hello.py
$ python hello.py
hello!

After we run

$ git add hello.py

add to add our program to the staging area, Git seems to create a new file in the objects folder.

.git/objects/
├── 87
│   └── 1d653255f9209504b5614b3f631a6bdec187e3
...

What kind of file could this be? Let’s try to print it:

$ cat .git/objects/87/1d65...
xK��OR04c((��+�P�H���WT��_�(

Hmm… looks just like gibberish 😅

Do you have an idea?

Maybe the file is compressed 🤔? Let’s try to uncompress it using zlib compression.

$ cat .git/objects/87/1d65... | zlib-flate -uncompress
blob 16print("hello!")

Tada 🎉 That worked! Git seems to store a zlib-compressed version of our hello.py file in the .git/objects folder. Furthermore, it seems to have prefixed its contents with the word blob plus the length of the file in bytes. Interesting!

But how does Git come up with this ridiculously long name❓️

Easy! Git just uses the SHA-1 hash of the content we just decoded.

We can verify this by running the command above again, but this time additionally pipe it through the sha1sum command:

$ cat .git/objects/87/1d65... | zlib-flate -uncompress | sha1sum
871d653255f9209504b5614b3f631a6bdec187e3

Nice 👍️ Let’s do a small recap: Every time we git add a file, Git creates a new blob object stored in the .git/objects folder. The file gets prefixed with a header blob #bytes and compressed using zlib compression. Finally, the name of the file is determined by the SHA-1 hash of the prefix content.

So, in some sense the .git/objects folder can be seen as a key-value store, that is content-addressable (meaning that the keys are derived from the contents of a file).

➡️ What happens if we create a commit?

In the previous section, we saw that files are stored as blob objects in the .git/objects database. But are there other kind of objects?

Let’s try what happens if we create a new commit.

$ git commit -m "first commit"

We take a look at our objects database on our second terminal:

.git/objects/
├── 81
   └── 9dcf8f883e6f376f502ac22d745a204cd1ebaf
├── 87 # blob object (hello.py)
   └── 1d653255f9209504b5614b3f631a6bdec187e3
├── bb
... └── eb0a397f7b73ef01553bc901185e459dea661d

If you follow along one of these files should have a different name for you. Can you figure out why?

Running git commit seems to have created two more files. Let’s investigate 🔍️! Again, we can try to use the zlib-flate command to uncompress the two new object files. Running the decompression on the 87d5f9 object

zlib-flate -uncompress < .git/objects/bb/eb0a...

yields:

commit 205tree 819dcf8f883e6f376f502ac22d745a204cd1ebaf
author Felix Andreas <[email protected]> 1647514185 +0100
committer Felix Andreas <[email protected]> 1647514185 +0100

first commit

This must be the commit object! Really? That’s it? That’s all of Git’s magic? A commit seems to be a plain text file containing the author & committer name as well as the commit message. Pretty simple, right?

Hmm … wait look at the first line: There it says tree. This is a concept we have not come across yet! But, if we look closely 🧐, we can see that the hash is the same one as of the third file in our .git/objects database.

Let’s print the third file.

$ zlib-flate -uncompress < .git/objects/81/9dcf...

Running this command yields:

tree 36100644 hello.py�e2U� ��aK?c�k����

The format of the output seems familiar! First a type, then the number of bytes, and the content of the object. Let me guess 🤔 … this must be the tree object! The tree objects seem to contain a reference to our blob object. The jibberish after hello.py is the 20-byte binary representation of the hash of our 871d65... blob object. But how to decode it - that I leave to you!

Degression - Header Format of Git Objects

The format which Git uses to store its object seems to be always the same: A concatenation of its type, length, a null byte, and the object’s content:

type length|content
 ^     ^   ^   ^
 |     |   |   |
 |     |   |   raw content
 |     |   null byte
 |     length in bytes
 blob, commit, or tree

🌳 What are tree objects?

We are still not sure what these “tree” things are. But I have an idea 💡! As the tree objects contain a reference to a blob object (which is a file), a tree object is probably just Git’s “fancy” name for a folder!

So, we can use these terms interchangeably:

  • file <-> blob
  • folder <-> tree

Let’s remember, the commit object points to the tree object which in turn points to a blob object. We can visualize their relationship:

Loading graph...

We need more trees 🌳! Let’s create a src folder and move our hello.py into it.

$ mkdir src
$ mv hello.py src

Let’s add and commit our changes:

$ git add -A
$ git commit -m "second commit"
.git/objects/
├── 14
   └── 6b3ae97b5c3dc65aa55c1577ed87f6caec5932
├── a2
   └── 36711ffcc61e1dc67182bbe58a4ddce15fd822
...

So, there seem to be two new objects in our database. Why? Let’s find out! But first I will let you in on a secret. We don’t have to manually decompress the git objects like we did before. Git already has a built-in function to inspect objects! It’s called git cat-file and you just have to pass it the first four letters of the object’s hash.

$ git cat-file -p 146b3a
tree a236711ffcc61e1dc67182bbe58a4ddce15fd822
parent bbeb0a397f7b73ef01553bc901185e459dea661d
author Felix Andreas <[email protected]> 1647514536 +0100
committer Felix Andreas <[email protected]> 1647514536 +0100

second commit

This seems to be the new commit object. That makes sense, right?

Ohh … there seems to be a new field parent which points to our first commit. Because the hash of the last commit is part of the next commit’s content, a commit’s hash depends on its parent’s hash. Therefore, commits form a chain, which provides some integrity.*

A commit object always points to a tree object, which corresponds to the root directory of the repository. In this case, the hash it points to is the hash of the other new object file.

$ git cat-file -p a23671
040000 tree 819dcf8f883e6f376f502ac22d745a204cd1ebaf  src

And, this indeed seems to be a new tree object! Did you notice that Git did not create new object files for the src folder and the hello.py file? This is because Git stores objects in a content-addressable manner. And if the content of a file or folder does not change there is no need to store it again in the database! Pretty clever 🤓, isn’t it?

After the second commit, we have five objects in our database and their relationship looks somewhat like this:

Loading graph...

Quite a picture!

📝 Recap - Git Objects

So, that was a lot to digest. Let’s recap!

Git stores objects in a content-addressable database located at .git/objects. There are different kinds of objects:

  • blob objects - basically a file
  • tree objects - basically a folder
  • commit objects - commit, points to a tree and a previous commit

All objects are immutable. Changing their content would require changing their hash which would make them new objects

The commit objects point to the previous commit, which creates a chain.

👉️ Git References

What we discussed in the last section, is in principle sufficient to have a working version control system. But it would be very cumbersome to only work with commit hashes. Therefore Git provides an abstraction called references which are pointers to commit.

Let’s see how Git stores these references. Therefore we change the watch command in our second terminal to print out the contents of the .git/refs folder

$ watch tree .git/refs

which should print something like:

.git/refs
├── heads # these are the branches
│   └── main
└── tags # tags are used to define releases

Hmm … pretty empty at the moment. But under the heads folders, there is a file called main. Remember that this is also the name of our current branch:

$ git branch
* main

Git seems to store branches in the heads folder. But what are branches exactly? Let’s try to print the contents of the main file:

$ cat .git/refs/heads/main 
146b3ae97b5c3dc65aa55c1577ed87f6caec5932

Wow, is it really that simple? Branches are just plain text files, which contain the hash of the commit they are pointing to. So in simple terms, branches are just references to a certain commit.

In theory, this means, that we should be able to create a branch feature by ourselves by just writing a commit hash to the .git/refs/heads/feature file. Let’s try:

$ echo 146b3a... > .git/refs/heads/feature

Let’s verify if that worked by running the git branch command:

$ git branch
  feature
* main

Indeed, we now have a new branch that points to the same commit as the main branch. We can switch over to our new feature branch by typing:

$ git switch feature
Switched to branch 'feature'

As we already have a program that greets somebody, let’s implement a goodbye feature 👋:

$ echo 'print("goodbye!")' > src/goodbye.py
# run the file
$ python src/goodbye.py
goodbye!

We add and commit the program to our Git repository:

$ git add src/goodbye.py
$ git commit -m "third commit"

If we now run

> git branch -v
* feature 98ed5f6 third commit
  main    146b3ae second commit

we can see that Git moved the feature branch to point to the third commit.

After the last operation, we can visualize the current state of our Git repository. It looks like this:

Loading graph...

📝 Recap - Git References

Let’s recap! Git references are just pointers to a particular commit. They are stored in the .git/refs folder and are plain text files containing the hash of the commit. When we run a command like git commit, Git automatically moves the pointer of the branch that we are on to the new commit. What exactly it means to be on a branch, is what we discuss in the next section. Hint: It has something to do with the HEAD thing.

🌴🌳🌲 The Three Trees of Git

Now we will discuss the third important concept of Git. In the context of this section, a tree is a snapshot of your project. To anticipate a little, in Git there are three kinds of trees:

  • the working tree - what you see in your file explorer
  • index - binary file .git/index (sometimes called staging area)
  • HEAD - last commit, next parent

The working tree is your file system tree. You can see this as your sandbox. It is the easiest tree to manipulate. You can just open it in your file explorer or code editor.

The index is the tree where you move something to when you run git add. It is located in .git/index and is stored in a binary format, which is beyond the scope of this post to explain. When you run the git commit command the contents of your next commit will match the contents of the current index. You can therefore conceptualize the index as the next proposed commit.

Finally, the HEAD is a pointer to your current branch which in turn points to a commit. So basically the HEAD is the commit you have currently checked out. Git uses this commit as the parent for your next commit. The HEAD can also directly point to a particular commit (without going through a branch). We call this state a detached HEAD state. The HEAD is stored in .git/HEAD and is also just a plain text file.

cat .git/HEAD 
ref: refs/heads/feature

In our dummy repository HEAD points to the feature branch, which makes sense as we switched to it in the last section.

Git provides multiple commands to manipulate these trees. Some of them are visualized in the figure below:

Loading graph...

A brief overview of what these commands do:

  • The git add command moves a file from your working tree to the index.
  • The git commit command creates a new commit object from the current index and points HEAD to it (or the branch HEAD is pointing to)
  • The git restore command can restore files in the working tree or index using the index or HEAD as source

That’s it! Knowing these commands, you should be confident enough to manipulate all of Git’s trees 🌳.

🔄 Git Reset

There is one more thing! The git reset command is a versatile tool that operates on the three trees of Git we’ve discussed: The working tree, the index, and HEAD. Its power lies in its ability to unstage changes, revert commits, and alter the history of a branch. Unlike git restore, which selectively changes files or folders, git reset alters the entire state of the repository at once.

To help visualize the effect of the git reset command on the three trees, let’s consider a simple branch where we’ve made two commits:

Loading graph...

git reset can be used in three main modes, each affecting these trees in different ways:

  • Soft Reset (--soft): This mode affects only the HEAD. It moves the HEAD to a specified commit but does not alter the index or the working tree. This is useful for undoing commits while keeping all changes staged.
git reset --soft <commit-hash>

Loading graph...

  • Mixed Reset (--mixed): The default mode of git reset. This moves the HEAD to a specified commit and also updates the index to match this commit. However, it leaves the working tree untouched. This effectively unstages any changes made after the specified commit, making it useful for undoing additions to the staging area.
git reset <commit-hash>

Loading graph...

  • Hard Reset (--hard): This is the most drastic form of reset. It moves the HEAD back to a specified commit, and it also makes the index and the working tree exactly match this commit. Any changes made after the specified commit will be lost. This mode is useful when you need to revert all changes and return to a known good state.
git reset --hard <commit-hash>

Loading graph...

It’s essential to use git reset with caution, especially with the --hard option, as it can lead to loss of work. Always make sure your work is backed up or you’re sure about reverting changes before running a hard reset.

🌟 Wrapping Up

This exploration under the hood of Git has demystified its core mechanics and shown that, at its heart, Git is elegantly simple. We’ve seen how Git’s seemingly complex behavior is grounded in three fundamental concepts: Git objects, Git references, and the three trees. These building blocks work together to create a powerful, flexible version control system that feels almost magical in its ability to manage and merge complex project histories.

As you continue your journey with Git, keep exploring and experimenting. The knowledge of what’s happening beneath the command line will give you confidence and inspire you to leverage Git to its full potential, ensuring your projects are well-managed and your history is clean and navigable.

If you still want to know more about the three trees of Git, I can recommend this excellent talk.

❓️ Some Questions to test your knowledge

  • Name three different kinds of Git objects
  • Name three different kinds of Git references
  • Name the three trees of Git