Coda File System

problems with random disconnects

From: Steve Simitzis <steve_at_saturn5.com>
Date: Fri, 18 Apr 2003 19:49:46 -0400
hello. i'm running venus on two high traffic web servers. both
machines have been running fine with venus for a couple weeks, with no
problems. for reasons i haven't been able to diagnose, one of the web
servers had to be rebooted. since then, i've had numerous problems
with venus on that machine.

it started at 6am, when venus died randomly:

Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0
Apr 18 06:40:10 sg2 last message repeated 25 times
Apr 18 06:40:10 sg2 kernel: coda_upcall: Venus dead on (op,un) (7.1564091) flags 8
Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0
Apr 18 06:40:10 sg2 kernel: No pseudo device in upcall comms at f89cbdc0

(sorry, i don't have the venus log from that time, as this is a
production machine, causing me to act hastily to bring it back up.)

since it died, i haven't been able to get venus running for very long
without my volumes going disconnected - a problem causing visitors to
our website to be greeted by broken images.

out of desperation, i tried running venus-setup again to get
everything into a known state. but it seems that everytime the volumes
are disconnected, i get something like this in the logs:

[ W(27) : 0000 : 15:46:10 ] Cachefile::SetValidData 53987
[ W(27) : 0000 : 15:46:10 ] Cachefile::SetValidData 61650

[ W(23) : 0000 : 15:46:11 ] *** Long Running (Multi)Fetch: code = -2001, elapsed
 = 30761.0 ***
[ W(23) : 0000 : 15:46:11 ] Cachefile::SetValidData 52224

[ W(21) : 0000 : 15:46:11 ] WAIT OVER, elapsed = 14896.6

[ W(27) : 0000 : 15:46:11 ] volent::Enter: observe with proc_key = 1463
[ W(27) : 0000 : 15:46:11 ] WAITING(VOL): sg.media, state = Hoarding, [1, 0], co
unts = [2 0 1 0]
[ W(27) : 0000 : 15:46:11 ] CML= [0, -666], Res = 0
[ W(27) : 0000 : 15:46:11 ] WAITING(VOL): shrd_count = 2, excl_count = 0, excl_p
gid = 0

[ W(25) : 0000 : 15:46:11 ] volent::Enter: observe with proc_key = 1463
[ W(25) : 0000 : 15:46:11 ] WAITING(VOL): sg.media, state = Hoarding, [1, 0], co
unts = [2 0 2 0]
[ W(25) : 0000 : 15:46:11 ] CML= [0, -666], Res = 0
[ W(25) : 0000 : 15:46:11 ] WAITING(VOL): shrd_count = 2, excl_count = 0, excl_p
gid = 0
[ X(00) : 0000 : 15:46:15 ] DispatchWorker: out of workers (max 20), queueing me
ssage
[ X(00) : 0000 : 15:46:15 ] DispatchWorker: out of workers (max 20), queueing me
ssage
[ X(00) : 0000 : 15:46:15 ] DispatchWorker: out of workers (max 20), queueing me
ssage
[ X(00) : 0000 : 15:46:16 ] DispatchWorker: out of workers (max 20), queueing me
ssage
[ W(22) : 0000 : 15:46:27 ] *** Long Running (Multi)Fetch: code = -2001, elapsed
 = 30232.0 ***
[ W(22) : 0000 : 15:46:27 ] Cachefile::SetValidData 2270

[ W(27) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15025.2

[ W(25) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15026.8

[ W(24) : 0000 : 15:46:27 ] WAIT OVER, elapsed = 15026.6


now, bear in mind that the coda client and server are sitting on the
same 100 Mbps switched ethernet. i'm even using cfs strong to force
a strong connection.

what i've assumed is happening is that venus is trying to rebuild a
rather large cache of files, but is tripping over itself in the
process.

i'm also unclear why it's going into "hoarding" mode. i have never
seen this behavior before, and i'm not sure why it would start
happening all of a sudden. i've never had a problem with venus caching
files after a restart, so i'm a bit perplexed. if anyone has any
suggestions of what else to look for, that would be appreciated.

btw, i am running linux 2.4.20, coda-debug-client 5.3.20, rvm 1.7,
and rpc2 1.15.

thanks :)

-- 

steve simitzis : /sim' - i - jees/
          pala : saturn5 productions
 www.steve.org : 415.282.9979
  hath the daemon spawn no fire?
Received on 2003-04-18 22:26:34