Coda File System

Re: replicated servers freezing under load

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 2 Jun 2004 15:34:09 -0400
On Wed, Jun 02, 2004 at 06:43:42PM +0200, Jim Page - emailsystems.com wrote:
> > But I would be interested in your current log, and if it is reproducable
> > a log at level 10 (volutil setdebug 10), and level 30. Just bump that
> > loglevel somewhere after the onslaught has started and hopefully before
> > the lockup.
> 
> Unfortunately loglevel 10 changes the dynamics and the error doesn't happen,
> as you thought. I have attached some sample SrvLogs anyway.
> 
> I will send more stuff when I can generate something useful - I'm under a
> bit of pressure at the mo but I'll get to it.

I noticed a couple of things. First of all, all accesses seem to only go
to the root volume (coda.root), there are never any files on the
mta.pending volume. Are you sure it's mounted in the right place?
(not that it should have an effect on this bad behaviour)

Second, since your application is unlikely to benefit from lookaside, it
isn't necessary to do the SHA checksum calculations. Disabling that
should speed up your servers a lot. In the file /etc/coda/server.conf
set the option allow_sha=0.

For the rest, there are a couple of places where it dumps an error
because the volume is already locked, but none of those seem to be
fatal. There must be some operation that tries to lock objects in the
wrong order which triggers a deadlock.

I wonder if I can reproduce this. I know that Steve Simitz is running
mostly readonly webservers on top of his Coda clients. He had to give
his clients more worker threads than the default to avoid client-side
problems. How many concurrent delivery processes do you think might be
running at any given time, I know that venus only has about 20 worker
threads. Any additional requests should just get queued and handled a
bit later, but this could also be a client-side only lockup. One way to
test that would be to have a third client that doesn't really do
anything and see if it can still access the servers when the other
clients are locked up. I noticed that volutil still works, which would
indicate that the server is at least still processing new requests.

> I have also spotted what I think is a mistake in the code that is preventing
> things like 'cfs fr <dir>' from working properly. Source release coda 6.0.6:
> file vproc_pioctl.cc, line 1002: looks like it's missing a 'break' to me.

Good find, absolutely right and it looks like a couple of the writeback
ioctls that were added around the same time have a similar missing break
statement as well.

Jan
Received on 2004-06-02 15:35:37