Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Sat, 4 Mar 2006 01:42:39 -0500

On Fri, Mar 03, 2006 at 09:18:28AM -0500, Greg Troxel wrote:
>   But that is not really true, the inode we return as a in the coda_open
>   reply is not a Coda inode, but the device/inode pair of the container
>   file, so it really is a true ino_t.
> 
> I have changed NetBSD-current back to ino_t.  But this means an ABI
> change in the kernel/user protocol.  My inclination is to say "oh
> well, you really have to compile venus on the same system you run it
> on", but then I wonder if we should be bumping the number.  Presumably

I don't see it as an ABI change, since venus correctly uses ino_t. It
was an incorrect kernel-side change from ino_t to uint32_t that broke
things.

> the Linux ABI is different from the NetBSD ABI anyway - I wouldn't

Yes, although the difference is really small, for some reason the Linux
kernel module actually parses the same BSD directory format and
translates it into a Linux equivalent inside the kernel. I think the
only difference is that Linux uses CODA_OPEN_BY_FD, which expects an
open file-descriptor for the container file.

This was needed because not all filesystems actually guarantee unique
inode numbers. ReiserFS was the first, where inode numbers may collide
(I think the combination of the parent directory inode number and the
file inode number is still guaranteed unique), ramfs and tmpfs only
exist at the pagecache level and some journalling filesystems associate
journalling with the file handle. So even if we manage to open a file
based on the device/inode number we might actually lose updates since
they are never committed to the journal.

But really that should be the only change. Interestingly even on Linux,
venus will still accept the CODA_OPEN upcall, so you should mostly be
able to run a Linux compiled binary on FreeBSD (and possibly NetBSD) if
you have the necessary run-time support for Linux binaries. You are
right that there may be some issues around the mounting of /coda.

> I have been having a lot of crashes on recent NetBSD with an old
> venus, in coda_readlink.  I wonder if this is related to the ino_t
> changes (venus compiled with old ino_t definition, kernel with wrong
> type in coda.h, so they mostly match).  But a system with a recent
> kernel still crashes.

I don't really think it can be related, since readlink uses it's own
upcall and doesn't rely on CODA_OPEN to get the link contents. Is this a
kernel crash, or does venus die?

> NetBSD's coda.h  has:
> 
> static inline ino_t coda_f2i(CodaFid *fid)
> {
>         if (!fid) return 0;
>         return (fid->opaque[1] + (fid->opaque[2]<<10) + (fid->opaque[3]<<20));
> }
> 
> This will not necessarily produce unique inode numbers, but one can't
> collapse the opaque fields down to inodes anyway.  I presume that's ok
> and these are just to provide an inode to user space.  But perhaps we
> shoudl do better with the 64-bit space on NetBSD.

Correct, we have a 128-bit file identifier in venus, which we try to map
onto a 32-bit value. Linux still uses 32-bit inode numbers, but the
iget4 operation makes collisions not that critical anymore, we
effectively use the 32-bit space to identify hash buckets. The only
problem are userspace programs that try to keep track of inode numbers
to find hardlinks. I guess as long as we avoid using hardlinks, so that
every object has i_nlink == 1 this should be fine.

But back to the coda_f2i function. This one should really be kept in sync
between userspace and kernel space because of the way directory contents
are passed down to the kernel. Venus creates BSD-style directory entries
with (name,ino) pairs, and uses its own copy of coda_f2i to map from
fids to inode numbers. Now the kernel should use the same function
otherwise we end up with different inode numbers identifying the same
object.

This is really the only place where venus knows about Coda inodes, and
it would be a lot cleaner if it just sent down (name,fid) tuples for the
directory entries and left all the fid->ino_t mapping up to the kernel.

I was kind of surprised not to see opaque[0] (the realm) value being
used and there is a pretty big difference, venus actually seems to use a
very different calculation, which seems to makes a bit more sense for
trying to avoid collisions,

    static __inline__ ino_t coda_f2i(struct CodaFid *fid)
    {
	if (!fid) return 0;
	return (fid->opaque[3] ^ (fid->opaque[2]<<10) ^
		(fid->opaque[1]<<20) ^ fid->opaque[0]);
    }

The adds and shifts are somewhat intentional. The fid consists of Realm,
Volume, Vnode, Uniquifier. The realm is essentially a pointer, so it is
somewhat arbitrary, but will be within the range of venus's address
space (so a fairly limited 'random' value). The volume identifiers have
2 distinct parts, the top byte identifies the server (or 0x7f for
replicated), the lowest bytes the volume number, which is bumped by one
for every created volume. The vnode numbers are assigned by the server
and count up from 0 from the time the volume was created (first vnode
1), while the uniquifier is typically assigned by the client and is a
counter that is initialized with a random value when venus is
initialized.

Now if you have 64-bits available, I would expect it to be slightly
better to xor the randomizing ids (realm,unique) with the counter values
(volume,vnode),

ino = (ino_t)(opaque[0] ^ opaque[2]) << 32 | (ino_t)(opaque[1] ^ opaque[3]);

You would have to make a corresponding change in userspace though.

Jan

Coda File System

Re: recent venus not running on recent netbsd