Coda File System

Re: problems with random disconnects

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 18 Apr 2003 22:54:42 -0400
On Fri, Apr 18, 2003 at 07:49:46PM -0400, Steve Simitzis wrote:
> it started at 6am, when venus died randomly:
> 
> Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0
> Apr 18 06:40:10 sg2 last message repeated 25 times
> Apr 18 06:40:10 sg2 kernel: coda_upcall: Venus dead on (op,un) (7.1564091) flags 8
> Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0
> Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0

These are all messages from the kernel module when upcalls are no longer
accepted by venus. (i.e. venus had died around this time).

> (sorry, i don't have the venus log from that time, as this is a
> production machine, causing me to act hastily to bring it back up.)

I know the feeling, I hardly ever use our production servers or
webserver for debugging purposes. 

> out of desperation, i tried running venus-setup again to get
> everything into a known state. but it seems that everytime the volumes
> are disconnected, i get something like this in the logs:

An equivalent way to totally reset a clients state is to recursively
remove all the container files in /usr/coda/venus.cache (or
/var/lib/coda/cache), and then create an empty 'INIT' file which tells
venus to reinitialize RVM on the next startup. That is pretty much
identical to installing a client from scratch, and because we remove
the container files it does a bit more than just rerunning venus-setup.

  rm -rf /usr/coda/venus.cache/*
  touch /usr/coda/venus.cache/INIT

> [ W(23) : 0000 : 15:46:11 ] *** Long Running (Multi)Fetch: code = -2001, elapsed  = 30761.0 ***
> [ W(23) : 0000 : 15:46:11 ] Cachefile::SetValidData 52224

Ouch, it took 31 seconds to fetch a 52KB file.

> [ W(21) : 0000 : 15:46:11 ] WAIT OVER, elapsed = 14896.6

And another thread was blocked for about 15 seconds because of this.

> [ X(00) : 0000 : 15:46:15 ] DispatchWorker: out of workers (max 20), queueing message

This is pretty serious, all worker threads are blocked and we have to
queue a couple of requests coming from the kernel.

> [ W(22) : 0000 : 15:46:27 ] *** Long Running (Multi)Fetch: code = -2001, elapsed  = 30232.0 ***
> [ W(22) : 0000 : 15:46:27 ] Cachefile::SetValidData 2270

The worker threads were probably blocked because of this fetch, we took
30 seconds to fetch a lousy 2KB.

> [ W(27) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15025.2
> [ W(25) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15026.8
> [ W(24) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15026.6

And here 3 of the blocked threads start running again.

> what i've assumed is happening is that venus is trying to rebuild a
> rather large cache of files, but is tripping over itself in the
> process.

Possibly, but because the waits are 30 seconds on a fetch, and the
typical rpc2 timeout is about 15 seconds this looks like some strange
double timeout before the operation succeeds.

I'm assuming that you have multiple servers and one of the servers has
trouble with the sftp transfer connecting back to the client. I'm not
entirely sure how we get a 30 second stall, but once the sftp connection
setup fails (15 seconds) an error is returned and the client connects to
another server which actually does the data transfer.

Gettimeofday is used a lot in rpc2, it is used to timestamps outgoing
packets and check timeouts for retransmissions. Hmm, perhaps you could
grab a tcpdump of the traffic. I would prefer a raw dump because I can
then decode the rpc2 and sftp headers to see what is going on the wire.

    tcpdump -s 512 -w tcpdump.out port codasrv or port codasrv-se

Don't worry about 'Hoarding' it is the internal state name for
fully-connected mode as we 'hoard' data before a disconnection.

Jan
Received on 2003-04-18 22:56:24