Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Wed, 21 Aug 2002 13:40:38 -0400

On Thu, Aug 22, 2002 at 01:33:09AM +1000, Nick Andrew wrote:
> On Wed, Aug 21, 2002 at 10:17:09AM +0200, Ivan Popov wrote:
> > > I am in the process of setting up a _home_ fileserver with twin 80-gig
> > > disks (RAID-1 mirrored) and am looking for a distributed filesystem
> >
> > It should be doable, depending on how big your files are - i.e. it is the
> > ===> number of files <===
> > being the limitation, not the total size of the files.
> 
> At present I have about 150,000 files consuming about 40 gigs. The
> average file size will probably increase over time.
> 
> > The 4%-estimation is based on "typical" file size distribution that can
> > vary a lot.
> 
> Are you working on some number of bytes in the RVM per file?

Yup, only the file metadata is storen in RVM, pathnames, version
vectors, creator/owner/last author, references to copy on write
versions of the file in backup volumes etc.

> One thing which Coda's documentation does not explain clearly is
> its relationship to the underlying filesystem. There's a note in
> the docs which says "do not fsck this filesystem" but it doesn't
> explain why.

Ah, that's from the time that Coda servers used a special 'inodefs' hack
to get direct access to the underlying filesystem. Nowadays we store
files in a tree structure, which adds a bit of overhead, but is far more
generic and fsck can't mess with us (too much) anymore.

> As I was trying to figure it out, I considered some diverse possibilities,
> like (a) Coda (server) implements its own filesystem structure (allocation
> algorithms, etc) completely replacing any other filesystem, to (b) Coda
> creates huge files within the underlying filesystem, one per volume, and
> stores all managed files within each, to (c) Coda stores one managed
> file per physical file in the underlying filesystem. Each of those
> rationales had some problem:

c.

> (c) ... would be a disaster for performance (I recall somewhere in
> the documentation it said that Coda did not create directories). The
> size of the directories (remember I'm looking at 150k files) would
> kill the system, I'd have to use reiserfs as a base. Surely the Coda
> developers did not do this. Plus it doesn't explain the fsck issue.

Ah, we do have directories, but store them as files. The only existing
problem is that there are no double or triple indirect blocks in the
in-file directory representation. As a result Coda's directories are
limited to about 256KB in size. i.e. it is even impossible to have a
single directory with all RFC's.

It's in a way really funny, because the directory lookup code uses
extensive hashes to ensure that a directory lookup can be done very
quickly even when we have huge directories, but the actual directory
data structure can't scale to such sizes. I'd rather have had a scalable
structure with a dumb, but simple linear search because that would have
been easier to fix and optimize.

> Finally I arrived at rationale (d) which I hope you'll confirm ...

Used to be the case, but I dropped the whole (device,inode) file access,
it was causing too many problem when the underlying filesystem was
trying to do journalling etc. We basically could use Coda only on ext2,
now we have no problems with ext3, reiserfs, tmpfs, ramfs, vfat, etc.
Probably even XFS will work fine now. The access through a filehandle
stuff only really stabilized recently in the linux-2.4.19 pre patches.

> For example with 32-bit inode numbers (0x12345678) a 3-level
> 2-character directory tree could be used, so the stored file
> would be "12/34/56/78" ... 256 top-level directories, under

That's exactly what we do in the venus cache and the /vicepa partitions
on the server.

> > Windows client is considered in alpha stage but I haven't seen complaints
> > on the list, so it may work rather well.
> 
> Ok. Were there no Windows clients a few years ago? I first considered
> Coda in around 1998 as I was looking for a way to share disk and
> improve redundancy in my ISP systems infrastructure. I thought there
> were Windows clients then.

Flaky Windows 95/98 clients. The latest Windows client is using an
external filesystem development kit (from OSR), and seems to be getting
pretty reliable.

> > I think you have to stay below 2G of metadata yet. Not sure.
> > And you have to have more *virtual* memory than your RVM - that is
> > the sum of your physical RAM and your swap has to exceed that size,
> > say for 1.9G RVM you would need say 1G RAM and 1G swap giving 2G
> > virtual memory.
> 
> I guess it's a linux thing but I can't figure out why an mmap'ed
> file needs to be backed by swap capacity. If the host runs short
> on memory, I don't see why it can't just page out a block from
> the mmap'ed area back to disk, after all it can read it back anytime.

Simple, it is not a shared mapping, but an anonymous mapping. I.e. the
code 'mallocs' as much space as the RVM-data partition and reads all
of it into memory(/swap). Again, hysterical raisins, Coda started off
running under MachOS and was directly interacting with the OS's
pager/vm systems. The first ports to 'normal' UNIX systems already had
enough obscure things to port so they took the easy way for some areas.

Phil Nelson implemented private mappings about 2 years ago. This greatly
increases large server startup times. And only dirtied pages have to go
to swap, so swap only slowly fills up. We can't just page out a dirty
block from the mmapped area (i.e. no shared mappings) because first of
all, there can be 'committed changes' mixed up with 'uncommitted
changes' and we don't know when the OS will write the data back. It
would be possible to munmap/mremap a known fully committed private page.
But there are some efficiency issues here. Once we munmap the page, the
only way to get it back is to read it from disk. We also don't know
whether the system really needs us to free up some dirty memory, so we
might be actively reading back data on a system that has more than
enough memory available, or on the other side of the coin, we might be
freeing up pages that have already been swapped out (forcing swap-read,
vma splitup, disk read, vma merge). And if we're repeatedly modifying
the same page the system could at least use the page-aging to know
whether it is worth it to write it to swap or keep it around in memory.

> One possible issue with InterMezzo is update delay from server to
> client - it occurs on file close. But Coda is the same, right?

Correct, AFS semantics. It is far more efficient for any userspace
filemanager, because we don't get hit by a context switch on every write
and don't have to play games by actively purging memory pages from VM to
provide reliable unix sharing semantics.

> you can be sure that I will be replicating my email, so I might
> start using "maildir" as a storage arrangement. If you're not
> familiar with maildir it's a non-locking mailbox storage arrangement

Maildir works fine with Coda as long as you replace the 'link/unlink'
with 'rename', we don't allow cross-directory links and our rename is
atomic and Coda declares a 'conflict' whenever it notices that rename is
trying to remove a file that the client didn't know about (update/delete
sharing conflict).

> > Union mounts are available for linux, see e.g.
> > http://kernelnewbies.org/status/latest.html

Those are union mounts, not the union/overlay filesystem that people
always talk about.

Jan

Coda File System

Re: CODA Scalability