Coda File System

Re: frozen venus, and cvs

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 23 Apr 2003 20:53:30 -0400
On Wed, Apr 23, 2003 at 06:06:33PM -0400, Steve Simitzis wrote:
> i'm running venus on two production machines, both linux 2.4.20. today
> i woke up to find venus frozen on both machines.
> 
> by "frozen" i mean:
> 
> (1) cfs listvol showed that the volumes were connected.
> (2) there were no reports of any crashes in any log.
> (3) venus.log activity appeared normal (nothing but the occasional
>     BeginRvmTruncate and EndRvmTruncate message).
> (4) no file access could take place, to the point where a simple ls
> of a volume or any file access would hang indefinitely.

That sound like venus ran out of worker threads to deal with new
upcalls. Possibly caused by something like lock-ordering, or some thread
not releasing a critical resource.

I've not seen anything like that lately. The only unfixed case that I
know of is a user doesn't have tokens and there is a reintegration log.
And incoming write from another user then blocks to wait for the
reintegration to 'complete', but it never does because the CML owner
doesn't have tokens. The second user kills the write (^C) but the thread
is still waiting, and when the user retries the operation he simply
'locks up' another thread.

> i, at first, suspected the coda server, since both venus clients had
> stopped responding at the same time. but restarting it fixed

Well, from the tcpdumps you sent me related to the 15-30 second stalls
during small file fetches, it looks like your network is dropping bursts
of packets every couple of minutes. Something I really wouldn't expect
on a switched 100base-T-FD local area network.

Perhaps both clients were affected in some way by some period of
sustained packet loss around that time?

> once i restarted each venus client, however, everything was fine, as
> if nothing had ever happened.

Sure sounds like some kind of worker thread starvation. Venus only has
about 20 worker threads. When files or attributes are fetched the
worker typically has a lock on the object to avoid concurrent fetches
for the same object. Other upcalls that try to access the same object
have to wait for the lock to clear object.

Now if the network drops a packet we have to retransmit a request and
this can be anything between 300 msec up to 15 seconds (pathetic worst
case behaviour).

So if we have a bunch of apache processes (>20) that receive about
50-100 requests per second we get about 15-30 new upcalls during the
300 msec timeout (and 750-1500 during a 15 second stall). Now typical
web accesses are pretty much focussed on the index.html files so we can
very quickly run out of available worker threads.

Reducing the number of apache processes, or increasing the number of
available worker threads (-maxworkers 50) could very well help a lot in
keeping the system running.

Jan
Received on 2003-04-23 20:55:11