Coda File System

Re: coda client hangs

From: Ivan Popov <pin_at_medic.chalmers.se>
Date: Tue, 24 May 2005 17:24:14 +0200
Hello Patrick,

> We're going to have to abandon coda and start investigating
> commercial solutions soon if we can't resolve this.

it would be a pity indeed.

> 	The clients have hung again, but because we were in the middle of
> testing some other things, I couldn't take the time to gdb it.  It seems
> somewhat coordinated since 3 out of 4 coda clients were all hung and
> needed to be restarted.

I did not observe that before. Was that the clients doing updates or those just
accessing Coda readonly?

If some clients aggressively fetch files from Coda servers at the same time
as other clients update files, either of them may see big variations
in server response time and possibly disconnect. It might lead to unexpected
conflicts, but I did not see crashed or hangs because of that.

> 	Another issue: although we have a cron job that gets fresh tokens 3
> times per day, root (and possibly other users) sometimes lose their
> tokens.  I suspect this is because we run several clog commands at the

That should mean that clog fails three times one after another...
Can you see the authentication attempts/fails in the AuthLog?

Which clog do you use? The default one with Coda password authentication?
You may want to try the modular one, see if it helps and otherwise motivate me
badly to fix it :)

> same time for different users (user nobody, user root, etc., all try to
> get tokens at the same time).  Is it possible that this would cause a
> problem?

A good question, never tried. My scripts do clog for several accounts
in a "for" loop, so that there are no simultaneous clogs from cron.
They should just work, but you never know.

> > 	Twice in recent times the coda client has hung.  Restarting venus fixed
> > the problem.  When this happens next time I'll attach gdb to the process
> > to try to see what happened.  In the meantime, all I have is the console
> > and venus log files.

> > 	The very end of venus.log looks like this:
> > 
> > [ W(1783) : 0000 : 21:54:57 ] Cachefile::SetLength 7016538
> > [ D(1804) : 0000 : 21:55:00 ] WAITING(SRVRQ):
> > [ W(821) : 0000 : 21:55:00 ] WAITING(SRVRQ):
> > [ W(823) : 0000 : 21:55:00 ] *****  FATAL SIGNAL (11) *****

I would not just restart venus after a crash, but reinit instead - as rvm
state is probably corrupted and you can expect another crash or other
weird behaviour.

Some kind of "universal cure" is to watch for venus crashing (recompile to
remove zombying?), then set the reinit flag for venus and reboot the machine.
(hm, one more reason to not put clients on the same machines as servers -
though in theory a server reboot might be acceptable, it is to no good anyway).

If your services are redundant, you would be able to survive even with
such drastic measures.

> > 	Finally, to my questions: 1) is there something I can do to prevent
> > future signal 11's?  2) If such a signal (whatever it means) happens,

To distinguish between clients doing writes and readonly clients
will definitely make the "readonly" clients a lot more stable.

> > can coda just restart itself instead of going into a zombie state and
> > causing httpd and proftpd to hang?

Unfortunately that is hardly possible without rebooting the computer,
and then you might be better off reinitializing venus at the same time.
It will reduce performance and cause extra load on the servers,
but it is at least a certain way to recover a readonly client or a client
with local-global conflicts.

Hope it helps, somehow.

On the other side, if you manage to get your system running even being
annoyed with the drawbacks - it will certainly help to fix the bugs,
the sooner the more they are exposed.

In an extreme, your company might consider contributing to the development,
say by setting a price tag on a fix for a certain problem... - of course
right now it is only Jan who can - or can't - help. More time would give
more certainty in finding a qualified volunteer and fighting the problem.

Regards,
--
Ivan
Received on 2005-05-24 11:25:46