Coda File System

Re: reintegration problems

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Mon, 19 Apr 2004 13:58:31 -0400
On Mon, Apr 19, 2004 at 04:08:52PM +0200, Johannes Martin wrote:
> I've got a problem reintegrating my laptop after it was disconnected. The
> laptop had been suspended for a few hours and files had been modified on
> the server. The laptop was then woken up but was still in disconnected
> mode as venus hadn't realized yet that the network connection was up.

Just over the past 2 weeks I found a bunch of reintegration/repair
related problems. It looks like they have been lurking for about half a
year (or more) and started with a local variable in a loop that obscured
an identically named one at the scope of a function, and a missed pointer
dereference when moving the conflicting CML entries to a special 'local
repair volume'.

> Now there are a few different scenarios:
> - sometimes, venus falls asleep as soon as I clog:
> 	07:25:04 fatal error -- cmlent::thread: can't find (5086c288.7f000001.1.1)

Fixed this one, because of the bad dereference, CML entries were not
correctly renamed and the symptom was that we couldn't find the repair
fid. The bad fix as to map them back to the original file identifier.
However repair related CML entries are created in several different
ways, and so the universal 'map to global fid' really doesn't work.

> - when I tried just now, I was able to clog, but repair failed:
> 	repair > beginrepair
> 	Pathname of object in conflict? []: jmartin
> 	No such replica vid=0xffffffff
> 	Could not allocate replica list
> 	beginrepair failed.
>   cfs beginrepair had actually been executed and ls -l jmartin showed the
>   following:
> 	total 4
> 	lrw-r--r--    1 root     nogroup        43 Apr 19 16:01 global -> \@7f000001.00000001.00000001\@notamusica.com
> 	drwxrwxrwx   28 root     nogroup      4096 Apr 18 20:38 local/

Hmm, I guess that volume is replicated across multiple servers, because
it looks like there is a server-server conflict which originally caused
the reintegration to fail.

> Any hints on how to repair this problem (if it helps, I don't need any of
> the data that is locally cached).

With the existing 6.0.5 code this probably can't be fixed. It does
recover a bit when restarted, but not enough to reliably repair the
conflict. If you would have cared about the local data, a snapshot that
contains any new files should be in /usr/coda/spool/<userid>/<volume>.tar
(actually, it is probably in /var/lib/coda/spool on Debian).

The easiest way out is to reinitialize the client, kill venus, unmount
/coda and restart venus with the -init flag,

    killall -9 venus
    umount /coda
    venus -init &

I was hoping to find at least one more (very slow) memory leak in the
servers before making a new release, but if I haven't found that by
wednesday I'll probably start building whatever is currently in CVS as a
new release.

Jan
Received on 2004-04-19 14:00:17