Coda File System

Re: How To Repopulate A Server

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Mon, 11 Sep 2006 15:09:34 -0400
On Mon, Sep 11, 2006 at 08:36:02AM -0600, Patrick J. Walsh wrote:
> 	Looking at the source code, I *think* we hit the limit for the  
> number of files we can have in a directory.  Luckily, and for some  

Looks like it, the error seems to be EFBIG (file too large) when it
tries to add a new entry to the directory.

> odd reason, our other coda server was still running without  
> problems.  So we turned off the problematic coda server and pruned  
> out the directories.

That's nice, servers have an annoying habit of dying at the same time in
these cases. I guess your client was weakly-connected and tried to
reintegrate with only this replica. Actually the server log you attached
seems to indicate that the server starts up fine, but then dies during a
resolution attempt. So the problem may actually be in the server that is
still running and is being propagated to the crashing server during
log-based resolution.

The safest thing right now would be to create a backup tarball of
anything in that volume that you care about. Destroying/re-resolving the
replica on the crashing server will use a different resolution mechanism
(runt-resolution), which may work and solve the problem (successful
resolution truncates the resolution logs so the bad create won't get
sent anymore), but it may also cause the still running server to realize
something is wrong and die.

> 	Now the question is, how can we get the problematic coda server  
> started back up?  Assuming there isn't some other problem, is there a  
> way to start up the coda server and have it wipe out its existing  
> knowledge of what files are on what volumes and then rebuild that  
> knowledge from the working server, similar to how we set it up in the  
> first place (with an ls -lR or something)?

If your server really crashes during the salvage phase, we can
temporarily disable salvaging and make sure there are no other volumes
with problems.

    cat > /vice/vol/skipsalvage << EOF
1
2000004
EOF

Then start the server and see if it comes up. Because the volume will
not be attached there are going to be errors in the logs about VLDB
lookup failures when clients attach and try to revalidate the missing
replica.

If this worked we can shut the server back down and use 'norton' to mark
the volume so that it will get deleted during startup before it tries to
fsck everything. Then the server should be able to start with the
missing replica. Finally we have to recreate the underlying volume
replica that was marked for destruction and purged during startup.

You'll need to gather some information which is probably easier to get
now before we start blowing away replicas and such, besides it is good
information to know so we can double check we're actually blowing away
the right volume.

It looks like the broken replica is 2000004, you need to find which
replicated volume it belongs to.

    grep -i 2000004 /vice/db/VRList

The replicated volume number is the one in the second column starting with 7f.
Also note which replica this one has in the list

e.g.
    vm:u.jaharkes 7f000604 2 d1000129 c80000df 00000000 00000000 00000000 ...

    replicate volume id = 7f000604
    replica index for d1000129 = 0
    replica index for c80000df = 1

Knowing the index is useful because the replicas are named based on the
replicated volume name + index. So in my example volume d1000129 has the
name vm:u.jaharkes.0 and volume c80000df has the name vm:u.jaharkes.1.

You also need to get the rvm log and data parameters from
/etc/coda/server.conf.

    grep ^rvm /etc/coda/server.conf

It should also be possible to have bash parse that file. So now we'll
shut down the server.

    volutil shutdown
... check the log to see if the server is completely shut down.

    . /etc/coda/server.conf
    norton -rvm $rvm_log $rvm_data $rvm_data_length

Then with norton we can double check the values we have,

    norton> show volume 0x2000004

This should show the name and replicated volume id (groupid?). If
everything seems to match up correctly we can mark the volume for
deletion,

    norton> delete volume 0x2000004
    norton> quit

Now we can remove the skipsalvage file, the volume will be completely
purged so there is no reason to skip it during salvage,

    rm /vice/vol/skipsalvage

Then we restart the server, it will take a while because it is going to
delete everything related to that volume.

    startserver &

Starting it in the background so we can keep an eye on the server log.
Once the server is back we can recreate the volume replica.

    volutil create_rep /vicepa <volume replica name> <replicated volume id> \
	0x2000004

(with my example the <volume replica name> is something like
vm:u.jaharkes.1 and <replicated volume id> is 0x7f000604)

At this point running 'cfs checkservers' and 'ls -lR /coda/path/to/volume'
should trigger runt resolution and rebuild the contents of the the newly
created replica.

Jan
Received on 2006-09-11 15:11:42