Coda File System

Re: read and write hangs

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 10 Aug 2001 16:20:26 -0400
On Fri, Aug 10, 2001 at 09:20:51PM +0000, irbis_at_orcero.org wrote:
> 
> 
>  Hello, coda hackers!
> 
>  I am working with a replicated coda server, and it works fine. Anyway,
> from times to times reading and writting on a client "hangs", that is, I
> can not read, I can not write, the applications stay waiting forever, and
> a ls -l also stay waiting forever.
> 
>  There is no colition -I tested also with only one venus client-, and
> venus says to me:
> 
> 20:57:22 DispatchWorker: signal received (seq = 134401)
> 20:58:38 DispatchWorker: signal received (seq = 135274)
> 20:59:54 DispatchWorker: signal received (seq = 136919)
> 21:00:06 DispatchWorker: signal received (seq = 136964)
> 21:00:22 DispatchWorker: signal received (seq = 137024)
> 21:01:54 DispatchWorker: signal received (seq = 137126)

Ok, these are a result of pressing ^C to interrupt the hanging process.
However the worker threads in venus are in most cases not aborted,
because they might have locks on volumes or they are in the RPC2 layers
waiting for the server reply. There are about 25 worker threads so it's
relatively easy to run out once something in the system locks up.

And I think I know who/what locked up. There is probably some volume
with a CML that is owned by an unusual user (such as root). Any other
user that tries to access the volume is put on hold until the CML has
been reintegrated, which blocks the worker thread. However the CML owner
user isn't getting tokens anytime soon and before you know it all worker
threads are blocked waiting for reintegration to start.

Probably around this point even simple cfs operations are beginning to
fail (they need a worker thread as well), and before you know it even
becomes impossible to pass the token which starts the reintegration up
to venus because even that requires an available thread.

You might have to restart venus to be able to see which volume has the
CML and who is the CML owner (and authenticate/reintegrate or purge it).

Jan
Received on 2001-08-10 16:20:35