Coda File System

Re: Unresolvable Conflicts

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 21 Jul 2004 15:06:18 -0400
On Wed, Jul 21, 2004 at 02:12:08PM -0400, Martin Emrich wrote:
> I have 2 Problems on my notebook:
> 
> a)
> I have a conflicting subdirectory ("DEBIAN") which should not conflict, 
> because it never existed on the server or any client except my notebook. So, 
> if I try to repair the conflict, repair says
> 
> repair > beginrepair DEBIAN
> Too few directory entries
> Could not allocate replica list
> beginrepair failed.
> repair > quit
> martin_at_gwaihir:/coda/darkzone/packaging/baghira/baghira-engine-0.4b/debian/baghira-engine$ 
> ls
> DEBIAN  usr
> 
> with "DEBIAN" being a directory containing the local directory and a dead 
> symlink to the global copy, which never existed. Another repair on this 
> fails.

Well, it is a reintegration conflict, but since the file doesn't exist
on the server we can't expand it correctly. So repair fails and even
forgets to re-collapse the expanded tree.

This is a combination of several problems. First of all the conflict
probably should have been on the parent directory. I don't know why it
didn't mark that one instead, maybe it still had active references.
Those can be caused by things like an application or shell keeping the
directory cache entry pinned and we can't turn the thing into a dangling
link.

The second problem is that repair is very paranoid and refuses to do
anything if it can't reach all copies. That really shouldn't be
necessary in all cases, if a server is dead or unreachable it would
still be useful to perform a partial repair, even though the conflict
would come back as soon as the missing server returns.

Finally the repair tool forgot to collapse the expanded tree when it
failed. It can be done by hand with 'cfs endrepair'.

> Unless I resolve this, this volume won't be reintegrated (I already have 
> accumulated 957 CMLs ;-)

cd out of the parent directory, then do a cfs er baghira-engine to
collapse the tree and hopefully flush the cached data from the kernel.
Then 'ls -l' and hopefully the parent will show up as a conflict. But
removals are an area where repair is probably pretty weak in general.

> On another volume, I had this strage behaviour today:
> 
> martin_at_gwaihir:/coda/darkzone/organizer$ ls -l
> ls: kcal-remote: Die Wartezeit für die Verbindung ist abgelaufen
> ls: knotes: Die Wartezeit für die Verbindung ist abgelaufen
> insgesamt 4
> drwxr-xr-x    2 martin   nogroup      2048 2004-06-09 07:55 adressen
> drwxr-xr-x    2 martin   nogroup      2048 2004-07-01 19:52 kalender
> 
> (German for "Connection timed out"). I already restarted the server components 
> and the client, nothing happens. This volume stays disconnected, too.

Well, that could be the other reason why the global replica isn't
accessible. Maybe the server is unreachable for some reason. Do you have
multiple network addresses for the server? Are there any firewalls
between the client and the server? Does 'cfs cs' (checkservers) help?

> What can I do (except backing up everything from another client, removing the 
> two volumes and making new ones) ?

You can copy the tarball from /usr/coda/spool/<userid>/volumename.tar
which should contain all the changed files which are in the CML, it
doesn't have symlink, rename or remove operations though.

Then a 'cfs purgeml' will flush all the pending operations from the
local cache at which point it should be possible to bring the volume
back into connected state.

Jan
Received on 2004-07-21 15:08:25