Coda File System

What is GIT?

Linus Torvalds needed a new way to manage the Linux kernel source tree and (besides BitKeeper) didn't find an SCM system that seemed to fit well in his existing workflow. So he wrote a set of small programs to efficiently archive, update, and distribute tree snapshots.

This is done by storing a compressed copy of each file according to the SHA1 checksum of the file's contents. As a result unmodified files will not need any extra storage. Once the file blobs are stored, the directories (now referencing the new SHA1 objects) are updated and stored the same way. Finally a commit object is added which refers to the parent tree(s) that the current tree is derived from. Checking out a tree starts with the commit, and then recursively extracts each tree and file object.

Why does this work so well with Coda

Objects are only written once. Any modification will change the SHA1 checksum and create a new object in the repository. So the only possibly conflicts are either directory conflicts which we can already resolve automatically, or 2 people simultaneously committing an identical change (i.e. they applied the same patch) and creating the exact same file. But that case is technically not really a conflict, the file contents are after all identical, so either version would be perfectly fine. Coda will still mark this as a conflict since they will have slightly different metadata, but it is a fairly trivial fix.

Git also encourages branching, if the repository is shared several developers can work on branches within the same repository without conflicting with each other's commits, in fact there is no need to lock the repository for a commit. Every update starts with files, then updates directories and adds the commit object last. As a result an update even works well when reintegrating over a slow link. Others will only see the commit after everything it depends on is already available on the servers.

The repository layout and naming works well with repect to our directory size limits. The 'objects' tree contains 256 subdirectories. Objects are placed in one of these subdirectories based on the first byte of the SHA1 hash. This results in a fairly even distribution, also the object names are only 18 character long, which means that the complete directory entry (which includes things like the vnode and uniquifier) fits in a single 32-byte directory allocation unit. This would mean that we can refer to close to 8000 files from any single subdirectory, and the repository should be able to store up to 2 million objects by which time we are probably hitting other problems than our (admittedly low) directory size limit.

The way new objects are added seems almost perfect for Coda's directory-based ACLs. When a new object is added, a temporary copy is written into the top-level objects directory, and then moved to the appropriate subdirectory and name. This means that a random developer only needs read, lookup and insert rights for the subdirectories. Only an administrators might at some point see a need to (carefully) prune unreferenced objects, but nobody ever needs to overwrite anything.

As far as the top-level objects directory is concerned, the developer would only need lookup, insert and delete rights, there is no reason to read anything here, and writing is implicitly allowed for (virgin) files that were just created by the same user. Delete permission is necessary to complete the rename operation, but will be mostly harmless. The subdirectories are pinned down by the undeletable objects they contain, removing some temporary object that is about to be committed should be caught fairly quickly by the developer performing the commit, since the final rename will fail. The developer will at this point also have the original around in his tree and can simply redo the commit.

The only other places of contention are the refs/heads and refs/tags directories where people share their branch information by creating a file that contains the SHA1 of the last commit to their branch. However all of the core git commands don't care if there is a '/' character in the branch name and will simply recurse. So it is possible to create per-developer branch directories and use branch names that look like 'jaharkes/devel'. Only gitk needs a small patch to recurse through these per-developer directories.

Great, so how do I get git

That's a somewhat harder question. Development is going at a rapid pace, but there isn't really an official release yet. Then there is a naming conflict with 'GNU Interactive Tools'. So even if distributions manage to work around the naming conflict, they will probably still have a hard time staying up to date with the current version. Luckily the underlying disk format is pretty much stabilized by now, only the command line options and some of the convenience scripts tend to change on a daily basis.

So your best bet at the moment is to try and build one of the hourly snapshots from codemonkey.co.uk .

Tracking Coda's (experimental) git repository

As my understanding (and the code) develop further, the following sections are constantly being revised.

Accessing the current development tree

Initial setup

We clone the central repository to a local copy. By specifying the -l and -s flags, git will not actually copy anything, but creates a reference to the main repository which will be checked whenever something is not found in our local tree.

repo=/coda/coda.cs.cmu.edu/project/coda/dev/coda.git
git clone -l -s $repo coda
cd coda

We can also replace some references with symlinks to simplify tracking branches we're interested in. If we don't set up this link, we would have to run git fetch once in a while to update the local copy.

ln -sf $repo/refs/heads/master .git/refs/heads/origin

Building the checked out tree

We don't place any generated files in the repository, so we have to bootstrap the tree. This is the same procedure we use when building from a copy checked out from the CVS repository.

./bootstrap.sh
./configure # optionally add '--prefix=/usr'
make

View changes since your last update

We want to get the commit messages that are currently in CVS, but exclude the ones that are already part of our local checked out copy.

git log origin ^HEAD

or if you want to see the individual patches as well as the commit log messages,

git whatchanged -p origin ^HEAD

Bringing the checked out copy up-to-date (merging with another branch)

git merge "Merging CVS updates" HEAD origin

Really what we are doing here is merging any updates that went into the CVS/master branch in the shared repository into the currently checked out branch.

If we have no local changes, the merge is a trivial fast-forward and the last argument will not be used. Otherwise if there were no merge conflicts the last argument will be used in an automatically generated log message by prepending it with 'Merge'. If there were one of more conflicts the files that failed are shown and will have to be fixed by hand.

Conflicts are represented similar to failed CVS merges, where the files will have ' < < < < ', '====', and ' > > > > ' markers around any conflicting regions.

Creating a new branch

Let's say we want to create a new branch based on the current HEAD.

git checkout -b branch

Switching between branches

We can switch between branches with git checkout. By adding a -f option to 'git checkout' we tell it to discard any local changes that have not yet been committed. If we don't specify -f, git will correctly track added or removed files, but we will carry over any pending changes which we might not want to commit to the new branch.

git checkout branch

Reviewing local (uncommitted) changes

Checking for modifications in the working tree,

git status
git diff # modified files that not yet selected to be committed
git diff HEAD # modified files ready to be committed

Adding new files to be included in the next commit,

git add path/to/file

Committing local changes

Committing the changes,

git commit --all

git has a problem figuring out the correct email address, at least it does on my system. So adding the following COMMITTER and AUTHOR exports to the .bashrc or (or their setenv equivalents to .cshrc) will avoid a lot of headaches in the long run.

export GIT_COMMITTER_NAME="My Name"
export GIT_COMMITTER_EMAIL="me@my.domain"
export GIT_AUTHOR_NAME="My Name"
export GIT_AUTHOR_EMAIL="me@my.domain"

Working on a development branch in Coda

The initial setup is pretty much identical, but we want to be able to write our modifications into the shared repository in /coda instead of to the local disk. A fairly reliable method is to replace .git/objects with a symlink to $repo/objects.

Then we set the local .git/HEAD as a symlink to $repo/refs/heads/ user / branch , I actually replaced .git/refs with a symlink to $repo/objects. This way it is possible for others to track our branch and for git-fsck-cache to identify unreferenced objects that are not actually used by anyone. Of course you also need a Coda login account that is a member of the Coda:Dev group in order to commit any changes to the repository.

An alternate solution is to push updates back to the shared repository, and optionally prune the local tree. The following can be used to push the locally checked out 'HEAD' branch back to the shared repository as 'foo/bar'. The extra refs/heads/ prefix for the destination branch is only necessary if the 'foo/bar' branch does not yet exist in the shared repository.

git push origin HEAD:refs/heads/foo/bar