Coda File System

From: Jan Harkes <jaharkes_at_cs.cmu.edu> Date: Mon, 29 Jan 2001 04:41:06 -0500

On Sun, Jan 28, 2001 at 11:27:58PM +0100, Petr Tuma wrote:
> Hello,
> 
> seeing some points I'd like to comment on (my problem with upgrade from 
> .11 to .12 miraculously went away after I rebooted all servers several 
> times to get fresh logs for Jan, now it's just tons of repairs, thanx 
> still :).

I don't believe in miracles, there could still be some problem. Although
it is possible that it won't show itself anymore now that the upgrade
went well.

> > There are several big hurdles we have taken and still have to take to
> > get Coda nicely integrated on Windows platforms.
> 
> Maybe focusing on NTs/2Ks would make more sense ? I mean, I don't see 
> much point in trying to bend over backwards to accomodate an outdated 
> virtual machine concept ?

Michael Callahan and Peter Braam started the Win95 effort, which was
improved by Marc Schnieder, however it never was all that stable and
broke easily on the various `subreleases' of Win9x. Shafeeq
Sinnamohideen has been working hard at getting it back into a usable
shape and I'm expecting he'll put the results up next week.

Peter also managed to get readonly access working on NT using a third
party file system development kit. Phil Nelson, who also did the Solaris
kernel module for Code has been delving into Windows NT/2K support using
an updated version of the same toolset although he might at some point
decide to develop a 'native' driver.

Ofcourse kernel support is only one part, admittedly the most difficult,
especially when having to insert a new filesystem into a black box.
However, there is also a lot of userspace, which is not really
Windows/GUI oriented. It is great to be able to pull some things
together into a script that creates a replicated volume on several
remote servers for you. But when you need to support a point-and-click
GUI, this simply doesn't work.

> > The only effect that clock skew has is when applications that use RVM
> > are restarted and the time has warped back. The client-server
> > interaction has definitely no time dependent parts, we wouldn't even
> > dare consider calling this a _Distributed_ File System if it did. 
> 
> I have different experience here. I had definite problems when one of 
> the replicated servers had a date set one day back compared to all the 
> others. Mostly it was loosing authentication tokens some time in the 
> middle of working with the volume (empirically, it was when the one 
> server with wrong time figured the token should expire), which caused 
> "false" conflicts to appear, and made the system basically unusable. 
> After syncing all with NTP the problems went away.

Ok, true. When the token was obtained from the server that lagged by a
day, it had technically already expired. The client succesfully puts the
data on the servers with the correct time, but the lagging server
rejects the operation (and the client drops the token). The ACL will
then block the unauthenticated client from triggering server resolution.

However when the client regains a valid token, it should be a trivial
resolution. The only consistenly problematic false conflicts I get
confronted with are cross-directory renames, and inconsistent symlinks.

> I definitely agree the documentation should be updated. At least the 
> "quick install" guide. What I personally missed in my hmm, month or so ? 
> with Coda, was:
> 
>   - Outdated installation instructions. In the end, it turned out to be 
> just "rpm -U coda*.rpm", but I kept worrying if it is really so easy 
> when all the docs says things should be started manually (not through 
> rc.d scripts) etc.

We've been switching back and forth between updating the Coda HOWTO, and
the Coda FS User and System admin manual. At the moment the `manual' is
the most uptodate of the two.

>   - Also, I tended to get myself lost in the various IDs used. After 
> several adds and removes of volumes, I kinda figured it out, but a few 
> lines in the docs (e.g. that one is supposed to come up with IDs of 
> groups for VSGDB) would help.

On one hand, that would need to be documented, on the other hand the
code is still changing rapidly and I'm hoping to abstract out the VSG
concept. The clients don't use the real VSG identifiers anymore, I'm
working on reintroducing the VSG on a more conceptual level to group the
multirpc connections.

For the server I'm hoping to be able to move to specifing the VSG only
during volume creation, at which point it should become possible to say
f.i. "createvol_rep rep_vol server1,server2,server3 /vicepa". Logically
the VSG is still there, but once we get rid of the VSG numbering we can
start doing things like growing/shrinking/migration of the replication
group for specific volumes.

>   - The documentation seems to give the system a more unstable feel than 
> it probably deserves. One example I recall is the private mmap support, 
> which I did not dare try due to warnings in the config file comments, 
> but switched to after I saw a message here that suggested it should work 
> OK, and it was a great improvement (one of my servers is low on memory 
> and this helped a lot).

Private mmap went in a while ago and has definitely proven to be very
stable, I'm considering switching the clients to using it by default.
However is does have problems like not being able to mmap a raw
partition on Linux. It also hides the memory cost of RVM, the longer the
server runs, the more pages are dirtied and will stay pinned in memory.
So your server will eventually run itself out of memory. There is
nothing that `cleans' these dirty pages after the operations have been
applied to the RVM data segment after log truncation. A thread that
compares dirtied pages to on-disk data could provide longer term
reduced memory usage in exchange for more CPU/IO usage.

> I also encountered the problem with local vs. fully qualified hostname. 
> What seems to work is keep /vice/hostname short and names in server list 
> and everywhere else long. This is something one has to do manually after 
> install.

Maybe /vice/hostname could be long if /vice/db/scm is long as well.

> There are bugs in scripts. At my servers, purgevol_rep (or whatever is 
> the script for deleting volumes called) fails, obviously looking for 
> some list of volumes that does not exist in the current version.

A fixed and significantly improved purgevol_rep went into 5.3.12.

> The questions I'm running into now are more of the sort, is Coda really 
> the system I was looking for in my particular situation ? I have three 
> machines I work on, geographically distant and not always connected very 
> well. I used to have a script that used rsync to keep the home 
> directories synchronized, but this was not really reliable (rsync has 
> problems with links, changes that involve modifying directory structure, 
> and bidirectional updates, among others).

Coda's design assumes that all servers in a VSG are located close to
each other. Directory resolution is a 5 phase process, where in every
phase all servers need to ship data to each other, this is much more
communication overhead than a remote client that reintegrates with the
servers. So a better layout would be to have 2 or 3 servers in one
location and rely on large client caches in the other two locations to
hide the network latency.

Some preliminary ideas for client-cache sharing and read-only 2nd level
replicas are being played around with to help reduce perceived network
latency in the remote locations.

>   - I have to use write disconnected mode, otherwise the delays in 
> propagating the updates to all the servers are too big. In that mode, 
> however, conflicts appear much more often (and even in situations where 
> I don't think they should).

Possibly weak-reintegration, i.e. reintegration to one server followed
by resolution. The replicas are not in sync until the resolution is
finished, so there could very well be a relatively large window of
vulnerability for bugs to crop up.

>   - Some of the files I have are logs, of several megs in size, that 
> only get appended to. In the write disconnected mode, it seems that the 
> updates are propagated in form of entire files, not just change logs, 
> which means huge amounts of data get passed over the net for even small 
> changes. (Just a hunch looking at the network traffic.)

Correct, that is the whole file thingy, Coda does not examine the
content of your files, and every store of the file is considered to be a
whole new file. An bulk transport for RPC2 based on the rsync algorithm
could help here.

>   - I use up twice as much disk space, because I have both the client 
> and the server running on a machine where I'd normally have just one 
> copy of the data.
> 
> Now, did I pick Coda right, or am I trying to use it in a situation 
> where something else (what ?) would be better ?

Cron + rsync + ssh is for many replication problems a good solution. 

Jan

Coda File System

Re: My experiences with Coda and why I went back to NFS