Coda File System

Re: coda server crashed and won't recover

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 16 Aug 2000 12:00:12 -0400
On Tue, Aug 15, 2000 at 10:37:35PM -0400, Stephan Koledin wrote:
> While I had a read-only volume restored and mounted from a previous dump and
> directly after dumping some backup volumes to disk, my codasrv crashed and
> won't come back up. It seems to be looking for the restored volume, but
> can't find it, I assume because it was only temporarily restored from a
> file. The previously restored volume was given id 1000004, as you can see in
> the log below, where the server tries to startup and recover. 
> 
> Is there anyway to recover from this, or will I just need to rebuild the
> server again? 

Ok, first we'll see whether we can get the server running, then we can
remove the offending volumes.

> 16:22:46 Entering DCC(0x1000004)
> Magic wrong in Page i           
> 16:22:46 DCC: Bad Dir(0x1000004.6d.68e9) in rvm...Aborting
> 16:22:46 JE: directory vnode 0x1000004.6d.68e9: invalid entry ; 
> 16:22:46 JE: child vnode not allocated or uniqfiers dont match; cannot
> happen

Ok, this is the bad volume, create a file called /vice/vol/skipsalvage,
with the following content.

-8<--/vice/vol/skipsalvage------------------------------------------------
1
1000004
-8<-----------------------------------------------------------------------

This tells the server to not try and salvage/attach the volume during
startup. Then start the server and see if it crashes on anything else.

When the server is up and running, we have a nice list of volumes to
purge, but we can't really do this while the server is running. So we
have to shut the server down, and use another app to purge the volume
from RVM.

$ volutil shutdown
...wait for the server to shut down...
$ cat /vice/srv.conf
-rvm <logdevice> <datadevice> <datasize>
$ norton <logdevice> <datadevice> <datasize>
Loading rvm...
norton> delete volume 0x1000004
norton> quit
$ rm /vice/vol/skipsalvage
$

At this point we have a server which nicely garbage collects all memory
and data that used to be associated with the purged volumes during it's
startup.

> **** Here's some of the final SrvLog data from before the crash. Seems like
> my SrvLog has been much much busier than normal, but perhaps I just turned
> on more detailed logging somehow? You can spot the crash at the end of this
> segment pretty easily, but there don't seem to really be any clues as to
> what caused it since all the previous actions finished properly.
> 

> 16:12:13 S_VolNewDump:  volume dump succeeded
> 16:12:22 ****** FILE SERVER INTERRUPTED BY SIGNAL 11 ******

9 seconds is a long time, this is possibly unrelated to the volume dump
that just succeeded. Maybe your restored volume has the same name as the
next backup volume, and the creation fails. But that is just a guess.

> 16:12:22 Becoming a zombie now ........
> 16:12:22 You may use gdb to attach to 1853
...
> **** couldn't seem to attach a debugger to it either, looked like it just
> crashed and didn't even zombie like it said it would...

Ah, after several requests we've modified the servers to not become a
zombie when it crashes. Create a file /vice/srv/ZOMBIFY (which makes
the startserver pass a -zombify flag to codasrv) to get the old
behaviour back.

Jan
Received on 2000-08-16 12:02:25