Mastering Git
Mastering Git: From Advanced Workflows to Internal Architecture
An Intermediate Guide to Git
Chase Zhang Splunk Inc. May 21, 2017
Table of Contents
- Advanced Techniques
- Warming Up
- History Management
- Other Toys
- Underneath Git Command
- Git’s Object Model
- Git’s Internal Files
- Optimizations
- Git HTTP Server
- Provide info/refs
- Serving Fetch Request
- Accept Push Request
- References
Advanced Techniques
Warming Up
If you are not familiar with previous commands, please refer to some tutorials on the Internet:
Basic Status and Commits
# Show current status
git status
# Show commit logs on current branch
git log
# Add changes under current directory
git add .
# Reset status of HEAD
git reset HEAD
# Commit
git commit -m "commit message"Branching and Checking Out
# Checkout a branch
git checkout branch_name
# Checkout a commit
git checkout 23abef
# Revert unstaged changes
git checkout .
# Create a new branch
git checkout -b branch_name
# Rename a branch
git branch -m new_branch_name
# Delete a branch
git branch -d branch_name
# Merge from another branch
git merge another_branch_nameRemote Operations
# Add a remote upstream
git remote add origin git@remote.repo.url
# Fetch from upstream repo
git fetch
# Pull from upstream repo
git pull
# Push onto upstream repo
git pushHistory Management
Resetting Commits
# Revert commit softly. Changes will be kept
git reset --soft HEAD~3
# Revert commit hardly. Changes will lose
git reset --hard HEAD~3
# Revert commit mixed. Stage will be clear,
# changes will be kept (default)
git reset --mixed HEAD~3Reset Modes Comparison
| Mode | HEAD | Staging Area | Working Directory | Use Case |
|---|---|---|---|---|
--soft | ✅ Moves | ✅ Keeps changes | ✅ Keeps changes | Re-commit with different message |
--mixed | ✅ Moves | ❌ Clears | ✅ Keeps changes | Unstage and recommit selectively |
--hard | ✅ Moves | ❌ Clears | ❌ Clears | Completely discard commits |
Restoring Lost Commits (Reflog)
# Accidentally reverted a commit?
# No panic, there is a way to restore it.
git reflog
# Checkout the hash point you'd like be back to
git checkout 42efba
# Replace the origin branch
git checkout -B masterCherry-Picking
# Pick a commit from other branch, without merging it
git cherry-pick 892bfe
# Pick several commits at once
git cherry-pick 892bfe..42fdab
# If there is a conflict, you have to resolve it
# And use this to continue
git cherry-pick --continue
# Or use this to abort
git cherry-pick --abortCherry-Pick Workflow
Key Points:
- Cherry-pick copies a commit, creating a new commit with a different SHA
- The original commit remains in the source branch
- Useful for hotfixes: pick a fix from develop to main without full merge
Rebase
A more convenient way is to use rebase:
# Rebase HEAD with a previous commit in the same branch
git rebase HEAD^10
# Rebase current branch to master
git rebase masterInteractive Rebase
Use interactive rebase, we can change the history easily!
git rebase -i HEAD~3What can interactive rebase do:
- Pick or drop a commit
- Change commit orders
- Change commit message
- Modify edit contents
- Squash two commits into one
Merge vs. Rebase
It is highly recommended that we always use rebase instead of merge when syncing local developing branches with upstream.
Question: What’s the difference between git merge and git rebase?
Branch Model: Git Merge
Scenario: You are on develop branch and want to sync with master.
master
↓
A --- B --- C
\
D --- E
↑
develop master
↓
A --- B --- C
\ \
D --- E --- M (merge commit)
↑
developResult: A new merge commit M is created, combining C and E. Both branch histories are preserved.
Branch Model: Git Rebase
Scenario: You are on develop branch and want to sync with master.
master
↓
A --- B --- C
\
D --- E
↑
develop master
↓
A --- B --- C --- D' --- E'
↑
developResult: Commits D and E are replayed on top of C, creating new commits D' and E' with different SHAs. The history becomes linear.
Comparison:
- git merge
- Will not change any commits in both master and develop branches.
- Will generate a new merge commit at develop branch.
- Once merged back, merge commits in develop will appears in master.
- git rebase
- Will rewrite commits from develop branch.
- Will eliminate empty commits and keep logs clean.
Duplicate Commits Scenario
If two developers make the same change, git rebase can eliminate the second commit (the redundant one), whereas merge keeps both.
Figure: git rebase will eliminate empty commits
Key Point: The second commit (which was a duplicate) is automatically eliminated during rebase because it produces no diff when replayed.
Branch Flow: Git Merge vs. Git Rebase
| Aspect | git merge | git rebase |
|---|---|---|
| Commit History | Preserves all original commits | Rewrites commits (new SHA) |
| Merge Commits | Creates merge commit | No merge commit |
| Timeline | Non-linear (branches visible) | Linear (flat history) |
| Conflicts | Resolve once | Resolve per commit |
| Safety | Safe for public branches | Dangerous for public branches |
| Use Case | Merging feature to main | Syncing feature with main |
The Golden Rule The golden rule of git rebase is to never use it on public branches.
Additional Notices
- Once rebase is applied, you may have to use
git push -fto push local changes to remote. You should never do this on master, but it’s ok for your own branches. git pullis actually equivalent togit fetch + git mergeby default. You can override this by:git pull --rebase
Other Toys
Some git commands you may not know:
- git grep:
- Multi threads, will be much faster than bare grep command.
- Output to less by default.
- git clean:
- Clean uncommitted files.
- Can’t be reverted!
- gitk, git gui: build-in GUI client of Git.
Visualizing the Tree Let git log show tree graph other than just a commit list.
git log --graph --oneline --decorateConvenient Alias It is convenient to make an alias command:
git config alias.lg "log --color --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%an>%Creset' --abbrev-commit"
git config alias.ci commit
git config alias.st statusToys and Resources with Git Hooks
- lolcommits: take a photo of yourself every time you make a commit. (Link)
- Continuous Delivery: Heroku, Dokku, Deis.
- git-jira-hook: Automatically retrieve JIRA number from branch name and prepend to each commit. (Link)
- What The Commit: Humorous commit message generator. (Link)
Underneath Git Command
Git’s Object Model
To describe Git’s Object Model is actually the same as answering the question below because our file system is just a tree like data structure.
Question: How can we make an immutable in memory tree, which we can look up its modification history and revert to previous state at will?
Four Object Types
Git stores everything as objects, identified by SHA-1 hash:
| Object Type | Description | Content |
|---|---|---|
| blob | File content | Raw file data (no filename!) |
| tree | Directory | List of blobs and trees with names |
| commit | Snapshot | Points to tree + parent commit + metadata |
| tag | Named reference | Points to commit + tag metadata |
Git Object Relationship
Key Insight: When file2.txt is modified:
- A new blob (
B3) is created for the new content - A new tree (
T2) references the unchangedB1and newB3 - The unchanged
file1.txtblob is reused (content-addressable storage)
Git Tree Model
The Simplest Approach (Inefficient)
- For every change, copy everything and save them somewhere.
- Use an array of pointers pointing to each copy according to their order.
- Look up old time points by retrieving copies from the array.
- Result: Costs too much memory and is very slow.
Optimized Git Object Tree
The PDF shows two diagrams: First, a single tree with "Change This" marker, then Track Nodes added:
Key Points from PDF Figure:
- Track Node 1 has a pointer to Track Node 0 (previous node / parent commit)
- Both Track Nodes point to their respective Root copies
- File A is shared between both versions (unchanged content reuses same object)
- Only File B' is new (the modified file)
Conclusion: Optimizations
- Partial Copying: Once there is a change, we don’t copy everything, but only the path from root to the node that has been modified. This saves memory and time.
- Track Nodes: Use a series of track nodes to preserve the order of each time point. Each node has a pointer to the copy of root and a pointer to the previous node.
- Traversal: To find a history state, find the track node, get the root copy, and traverse the tree to get the snapshot.
Summary
- Git makes an object for every directory and file and saves them into
.git/objects. - Each TrackNode corresponds to a commit in Git’s object model; commits are also saved in
.git/objects. - Every operation maintains these object files and ensures consistency with source directories.
Git’s Internal Files
Let’s have an exploration of Git’s internal files (some files are omitted):
.git
├── HEAD
├── objects
│ ├── 0c
│ ├── 0d
│ ├── info
│ └── pack
├── refs
│ ├── heads
│ └── tags- objects: Contains all objects. Indexed by SHA1 hash. First two characters form directory names (categories).
- refs: Contains all pointers like branches and tags.
- HEAD: Points to the commit hash of the current working directory.
Viewing Objects Git’s objects are compressed with zlib. You can view content using Python (zlib.decompress) or the native command:
git cat-file -p 5b67fd90081Manual Commit Interaction
You can commit files manually without git commit:
# Write a new file object
git hash-object -w test.txt
# Create a new tree object and write a tree
git update-index --add --cacheinfo 100644 83baae61804e65cc73a7201a7252750c76066a30 test.txt
git write-tree
# Commit a tree
git commit-tree d8329f
# Read a tree from storage and apply to current working directory
git read-tree 5fd87Bare Repositories Update and maintain working directory is costly and unsafe. For remote repos, we usually init them without a working directory:
git init --bareOptimizations
Ideally, we should only save the diff if a huge file changed slightly. Git can pack files to save space and speed up lookups.
# Let git perform GC and pack files
git gc
# git gc will run every time you execute git push
git pushPackfiles are saved under .git/objects/pack.
Pack File Structure
Delta Compression:
- Similar objects are stored as deltas (differences from a base object)
- Reduces storage by 10-100x for large repositories
- Index file allows O(1) lookup by SHA
Shortcomings & Solutions Git's object model is slow for huge files or too many files.
- GLFS (Git Large File System): Open source tool saving huge files to external cloud storage. (Link)
- GVFS (Git Virtual File System): Microsoft project; won't download an object until it is read for the first time. (Link)
Git HTTP Server
An implementation of a Git HTTP server might be simpler than imagined.
Provide info/refs
- Before fetch/push, client sends requests to
info/refs. - Client finds minimum amount of objects to transfer.
- This is an RPC invoke requiring a Git client on the server.
Handling Requests
Requests have a URL parameter named service. Call the local git command:
git [service] --stateless-rpc --advertise-refsStream STDOUT back to the client via HTTP, prepending a head block:
Figure: Head Block Structure
001E # service = [service] 0000
↑ ↑
Length of head block EOF marker
(in HEX, 4 chars)Serving Fetch Request
Support fetching request is very simple. As the client has already known what to fetch from the info/refs call. It will fetch all files by itself and we’ll just serve objects directory as static files.
Accept Push Request
Push request is a little more complex than fetch. Although client git has already known what to push to the server, for the sake of security and concurrency, writing action must be performed in a safer and faster way. It is an RPC invoke too.
Figure: Handle Push Requests
Execution
Push requests will have a URL parameter named command, as before, we pass this command to git directly:
git [command] --stateless-rpcIf command name is receive-pack, we have to do one step more:
git update-server-infoThat’s it! We now have a workable Git HTTP Server. (More info: Git Internals - Transfer Protocols).