Coda File System

Re: CODA Scalability

From: Nick Andrew <>
Date: Thu, 22 Aug 2002 10:43:15 +1000
On Wed, Aug 21, 2002 at 01:40:38PM -0400, Jan Harkes wrote:
> Yup, only the file metadata is storen in RVM, pathnames, version
> vectors, creator/owner/last author, references to copy on write
> versions of the file in backup volumes etc.

Do you know an average size per file?

> Ah, that's from the time that Coda servers used a special 'inodefs' hack
> to get direct access to the underlying filesystem. Nowadays we store
> files in a tree structure, which adds a bit of overhead, but is far more
> generic and fsck can't mess with us (too much) anymore.


> Ah, we do have directories, but store them as files.

I was actually referring to no directories being used on the
underlying filesystem, which you have said is no longer the case.
The performance hit I envisaged was having 10,000 files in a
single ext2/ext3 directory.

> The only existing
> problem is that there are no double or triple indirect blocks in the
> in-file directory representation. As a result Coda's directories are
> limited to about 256KB in size. i.e. it is even impossible to have a
> single directory with all RFC's.

Ah, that limitation is probably worth adding to the docs with an
update on the filesystem access mode.

> It's in a way really funny, because the directory lookup code uses
> extensive hashes to ensure that a directory lookup can be done very
> quickly even when we have huge directories, but the actual directory
> data structure can't scale to such sizes. I'd rather have had a scalable
> structure with a dumb, but simple linear search because that would have
> been easier to fix and optimize.

Perhaps you could use more of the facilities of the underlying
filesystem, e.g. if it were ReiserFS just let the filesystem manage
the directory work.

> > Finally I arrived at rationale (d) which I hope you'll confirm ...
> Used to be the case, but I dropped the whole (device,inode) file access,
> it was causing too many problem when the underlying filesystem was
> trying to do journalling etc.

Ok, I guessed correctly what you used to have based on old docs :-)

> Simple, it is not a shared mapping, but an anonymous mapping. I.e. the
> code 'mallocs' as much space as the RVM-data partition and reads all
> of it into memory(/swap).

mmap's not my area, my brain just groups it with the black arts of
internationalization, the X protocol, glibc and satanism, so I'll
just take your word for it.

> Maildir works fine with Coda as long as you replace the 'link/unlink'
> with 'rename', we don't allow cross-directory links

Ah, I expected no cross-volume links, didn't realise it was more

> and our rename is
> atomic and Coda declares a 'conflict' whenever it notices that rename is
> trying to remove a file that the client didn't know about (update/delete
> sharing conflict).

You mean there will be a conflict when the server is too fast to
create and rename the message? Maildir writes to "new" subdir and when
it is done it moves (somehow) the file to "cur".

> > > Union mounts are available for linux, see e.g.
> > >
> Those are union mounts, not the union/overlay filesystem that people
> always talk about.

I guess it must be hard to do, if people always talk about it but
never do it. I imagine that the union filesystem code has to map the
inode numbers from its two (or more) underlying filesystems, into one
inode space for return to the kernel and user applications. Otherwise,
tools which traverse a filesystem might have a bad time as every time
they stat a file they find it has different major/minor, or maybe the
same inode number as another file, yet they are different files.

Received on 2002-08-21 20:44:43