Coda File System

Re: venus crashed my data are zombies !

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 17 Jan 2007 17:30:01 -0500
On Wed, Jan 17, 2007 at 11:05:18PM +0100, François Cerbelle wrote:
> I wanted to put a lot of important files from my home client on the
> server. I began to do a massive move of the files. As it offen happens, my
> client got Replicated/writedisconnected. It does not matter. But when I
> wanted to reintegrate.. Everything became wrong.
> 
> without being logged in (cunlog) :
> cfs  cs : All servers are UP
> cfs lv /coda/sd-858.dedibox.fr/users/* : Permission denied for all volumes
> except for the one with the problem and it tells me :
> replicated/disconnected, 4914 CML pending !
> cfs la /coda/sd-858.dedibox.fr/users/* : some perms for every volume
> except for the one which has problem : "Connection timed out" (wrong, the
> network is good and I am still connected on the other volumes)

I think many of these problem are a result of the client cache being
totally filled up with locally modified files that have not yet been
reintegrated.

As a result we cannot fetch anything from the servers because we would
have to throw something out of the cache that was locally modified.

btw. we can never get ACLs for directories that have unreintegrated
changes. A side-effect of the GetACL rpc is that we also try to fetch
the latest version of the attributes (to detect possible conflicts), but
if there are local changes we would lose those, so GetACL is blocked.

> I clog as admin (the root account for my coda) and :
> cfs reconnect : nothing happens
> cfs wr /coda/sd.../users/* : list me the volumes without any other
> information
> cfs fr /coda/sd.../users/natalia (the problematic volume) : only one line
> of answer : 4914 CML entries remaining for volume u.natalia
> it does not reintegrate

Ah, the CML is owned by the user that made the changes, the error log
file (/usr/coda/etc/console or /var/log/coda/venus.err) may contain some
indication about that fact,

    Reintegrate <volume name> pending tokens for uid = <local uid>

If this is the case it should be as simple as getting tokens for that
local user. Another reason for not reintegrating may be that there is a
conflict, this should also get logged in probably in venus.log. Conflict
repair can be tricky especially when there are potentially a lot of
conflicts (up to 4914 in your case). In either case you should be able
to run

    cfs checkpointml /coda/.../natalia

which will create a local checkpoint (tar archive) that contains all the
files that we failed to reintegrate. This can be found as
    /usr/coda/spool/<local uid>/<volume name>.tar, or
    /var/lib/coda/spool/<local uid>/<volume name>.tar

If you were mostly creating files and everything is in that tar archive
then it is possible to copy that tarball to a safe place and purge the
modification log, or restart venus with a -init flag to really wipe it's
memory. And then untar the archive to get the remaining files copied.

    cfs purgeml /coda/.../natalia  # flush all non-reintegrated updates.

If you still have the original data around, it may just be simplest to
purge the CML / reinitialize venus, and use rsync -av to resume the
copy, that should check if all the existing files were copied correctly
by calculating it's checksum.

Jan
Received on 2007-01-17 17:32:42