Coda File System

Re: My experiences with Coda and why I went back to NFS

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Sat, 27 Jan 2001 21:51:12 -0500
On Fri, Jan 26, 2001 at 01:38:09PM -0500, Douglas C. MacKenzie wrote:
> As a thank you to this mailing list and the Coda developers
> I would like to pass on my thoughts and experiences with Coda.

We appreciate that. Any feedback is good feedback. I've commented on
your annoyances with Coda, not to make them seem less important, but
more to show some of my views and attempts to delve to the causes.

> I ran Coda for about 8 months on a small office cluster of 
> 4 workstations and one server.  I really liked the promise
> of disconnected operations, and looked forward to running
> a coda client on my win98 laptop, but it never sounded 
> stable enough to bother with loading it up.

There are several big hurdles we have taken and still have to take to
get Coda nicely integrated on Windows platforms. First of all the
problem of `multitasking'. In contrast to the common belief the Win32
API implementation for 95/98/ME is not reentrant and all applications
are runnin in the same virtual machine (VM). Bouncing VFS calls up to
userspace and going back into the kernel leads to deadlocks. Luckily a
workaround was found by Michael Callahan, DOS boxes happen to be running
in their own VM, and he added socket and mmap APIs for DOS applications.

Then there is the difference in how applications look at the filesystem.
Windows filesystems are case insensitive, and even worse, still have the
8.3 filenames deep in their bowels, a filesystem like Coda is hit by
nice things like "Creat file_blah.txt" "Store FILE_BL~.TXT" "Open
FiLe_bLaH.tXt" (all from the same `save file to disk' operation) and is
assumed to do the right thing.

> I had three major Coda problems over the 8 months.  The first
> was due to clock skew.  Coda needs to have the clocks set pretty
> closely and we kept getting reconnect conflicts until we 
> started running ntpd.  The daylight savings time roll over was
> really a pain.

The only effect that clock skew has is when applications that use RVM
are restarted and the time has warped back. The client-server
interaction has definitely no time dependent parts, we wouldn't even
dare consider calling this a _Distributed_ File System if it did. The
only timestamps that are ever transferred are the file mtime's
and these are never used by either client or servers. We have the
non-time based versionvectors for conflict detection and resolution.

> The second problem was on-going, the clients would continually
> disconnect and reconnect, even when on a fast network connection.
> This caused no end of random clients running disconnected.

Were you going through masquerading firewalls or is your network very
congested?

I know of some connectivity problems on `normal' networks, but those are
not really `fast reliable' network connections. PPP connections suffer
from the default in-kernel route queues which are only 10 packets and
SFTP sends 8 packet bursts so there is a high likelyhood that the last
packets in the sequence are dropped. And ADSL lines because RPC2 assumes
a symmetric connection and fails to get a proper RTT estimate, so it
times out too quickly..

> A major problem with Coda is that there is no way for a casual
> user (the developers on our network) to quickly decide if they
> are running connected or disconnected.  There should be some
> obvious alarm given when a client disconnects.  Something like
> the popup dialogs that UPS software provides when the power fails.

But there are several feedback mechanisms, there is `cmon' which shows
the running status of our production servers. I've also got a modified
WindowMaker dockapp, which shows server names and a little green/red
`led' to indicate up/down status, while clicking the servername opens an
ssh-connection to the machine.

Then there is smon which pulls down statistics and records them in a
RRD database (sort of like MRTG). And there is a machine in the lab
running netsaint + rpc2ping which sends me direct email whenever any of
our Coda-servers doesn't respond to the ping.

> Losing your network connection is easily as critical.  Anyway,
> I spent a lot of time helping people get their clients back
> reconnected.  (I saw an e-mail on the list which suggested
> that this problem was fixed in the latest version, but I gave
> up before I got to try it.)

That is why there already is a lot of software available to monitor your
network. Netsaint is one example, but MON, MRTG, etc. You normally
shouldn't need Coda to tell you when you've `pulled the plug'.

To keep an eye on what venus is doing, there is codacon. Al long as it
displays "Create" "Store" etc, you are connected. When it doesn't you
are disconnected. In fact, as far as venus is concerned network
connectivity is almost like Schroedinger's paradigm. You won't know
whether you are connected until you try to make an RPC2 call. And in
most cases venus really doesn't know until we're already in disconnected
mode.

> The final problem was the killer.  One day the coda server 
> core dumped with an assert and wouldn't restart.  I fooled
> around with it for a day and got the server running and
> found that read operations from clients would work OK but
> that the server core dumped again on the first write operation.

Sounds like RVM allocations failed. We had that on both verdi and
viotti. It turns out that the RVM allocator assumes it can defragment as
a last resort, i.e. when allocations start failing. However by that time
there are hardly any fragments to merge anymore. Viotti couldn't
allocate 32KB even though 120MB was still `free'. The only solution
I have at the moment is norton-reinit -dump state / reinit RVM /
-restore state

> My basic conclusion is that Coda is not usable by anyone 
> other than very dedicated researchers until you get rid
> of all the asserts in the software and replace them with
> meaningful error messages.  My biggest frustration was
> trying to track down what a particular assert really meant.

An assert implies that something took a wrong turn somewhere are things
are really really gone wrong. Luckily RVM is transactional, and the last
transaction is aborted. When the server is restarted it doesn't even
remember it took a wrong turn. In some places we might be able to add
enough code to get out safely, but going through every possible path
that might lead to the error case and making sure we handle those return
paths correctly is a lot of work.

> Dumping the asserts, adding an alert mechanism to report
> when clients disconnect, and modernizing the conflict 
> repair mechanism are the three short comings that I would

Instead of dumping the asserts, I'd rather try to improving the code so
that we avoid making those bad turns in the first place.

Alert mechanism, was done 3/4 years ago in the AdviceMonitor. Never
really got off the ground.

Modernizing the conflict repair. Well, we've been working hard at both
avoiding `false-conflicts' and getting repair to do the right thing
most of the time several improvements already went into the repair
tools, venus, and 2.4 linux kernels.

> suggest working on first to make Coda ready for the 
> real world.  As it was, I (Ph.D. in Computer Science
> and used to advanced system administration problems) 
> just was spending way to much time keeping it going
> and didn't see how any of my users would ever be able
> to take over any Coda system admin.  When that happens
> I'm ready to give it another try.
> 
> Thanks for all your help,
> 
>     Doug


I'm sorry to see anybody become disappointed in Code, and sincerely hope
that some day we are able to achieve those goals and see you back on the
list.

Jan
Received on 2001-01-27 21:51:24