Coda File System

an experience of recreating a lost replica

From: Ivan Popov <pin_at_math.chalmers.se>
Date: Sun, 2 Mar 2003 10:33:39 +0100 (MET)
Hello,

I want to share a success story, and document the steps I did
(in essence, followed Jan's recommendations and read some man pages)

Running a two-server setup with all volumes replicated.

One of the servers got a corrupted volume and refused to start,
complaining about an assertion error when running salvager on one of the
volumes.

[think of an irreparable fsck problem on a central NFS server? ;-]

Well, my other server stayed online so that I had the system running.

The steps to revive the crashing server:

get the info about the volume:

 [any of the servers]# grep <volname> /vice/db/VRList
you get
 <volname> <groupid> <replica_num> <volid_serv1> <volid_serv2> 0 0 0 0 0 0 <VSG>

now you have to find which <volid> you are interested in:

 [any of the servers]# grep <host_with_dead_server> /vice/db/servers
you get
 <hostname> <serverid>

the <serverid> is a small number and it matches the beginning of <volid>,
e.g. my dead server has number 2 and the matching volid was 200004d

Let us get more information about the volume (and destroy it, as it is
broken)

 [the host with the dead server]# grep rvm_ <where-you-have-it>/server.conf
 rvm_log="<LOG>"
 rvm_data="<DATA>"
 rvm_data_length="<LENGTH>"

 [the host with the dead server]# norton -mapprivate <LOG> <DATA> <LENGTH>
(-mapprivate is a lot faster than without it)

 norton> show volume <volid>
    Id: 0x<volid>       Name: <name>.<digit>     Parent: 0x200004d
    GoupId: 0x<groupid> Partition: <partition>

as I had to remove the broken volume:

 norton> delete volume <volid>
 norton> Ctrl/D

Now start the server, watch "tail -f /vice/srv/SrvLog" and see the volume
being destroyed, instead of crashing the server.

When the server is up, the moment of truth has come.

Run (substituting the values from the above) :

 [the host with now alive server, missing one volume]# \
   volutil create_rep <partition> <name>.<digit> 0x<groupid> 0x<volid>

Now go to a well connected client and do "ls -alR" on the corresponding
mountpoint.

You can watch the resolution to happen, looking at the results of

 volutil -h <hostname-of-serverN> info <volid_servN> | grep diskused

for the servers concerned (use output of "grep <volname> /vice/db/VRList"
above)

Enjoy Coda!
--
Ivan
Received on 2003-03-02 04:36:13