Coda File System

Re: venus failure during file rename

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Fri, 7 Apr 2006 00:20:38 -0400
On Thu, Mar 30, 2006 at 10:30:14AM +0300, Denis Chapligin wrote:
> 1)"psdev_write: msg (12,545) not found" message in client's dmesg
> 2)"08:29:37 GetVolObj: Volume (1000001) already write locked                      08:29:37 RS_LockAndFetch: Error 11 during GetVolObj for 1000001.1041.c066" message at the one server's log
> 3)And very interesting log on the second server (attached)

I've been digging through the log file and found a couple of interesting
things. After your server came up, it started to resolve some things
from the other replica.

Very early on during the resolution it hits a remove/update conflict
during rename resolution.

    16:53:05 ChkResRenSem: RUConflict on target 2b48.b34a
    16:53:05 Error 199 from CheckResolveRenameSemantics
    16:53:05 Incorrect Res Rename: src = conf_global.php (2b4c.b34b), tgt = conf_global-bak.php (2b48.b34a)s

My guess is that this was an operation like,

    mv conf_global.php conf_global-bak.php

Such a rename operation implicitly removes the target if it already
exists, but the Remove/Update conflict tells me that there was a
different version of conf_global-bak.php (update) on the server that was
restarted and Coda doesn't want to lose that version so it errors out on
the rename.

    16:53:05 conf_global.php exists already N/N conflict, mark parent 0x203.ad8f inc
    16:53:05 InstallVV: Error 199 getting object 203.ad8f

Now this is strange, maybe just a logging error, my guess is that it
prints the wrong filename and really wanted to say 'conf_global-bak.php'
exists already.

In any case, it marks the parent directory as a conflict. And because of
the conflict we do not update the version vector.

Then we get a lot of
    16:55:02 VGetVnode: vnode 2000001.5672 is not allocated

This is probably because of clients revalidating their caches and
hitting files that haven't been resolved yet. Files cannot get resolved
before their parent directory, otherwise we'd end up with orphaned
file objects that might never get removed. So the server is simply
returning errors in the hope that at some point the client hits the
parent directory. The resolution of a parent will replay the create
operations which creates empty file objects and then the file resolution
can take over and propagate the data.

We see a couple of
    16:55:03 ComputeCompOps: fid(0x7f000000.1f05.a50e)

Indicating that in fact we are hitting some directories once in a while.
There actually don't seem to be that many missing objects, the client(s)
just seem to be very persistent in trying to validate them over and over
again. This would indicate that they are unable to recurse up to the
parent directory, my guess is that the parent is possibly the directory
that was marked as a conflict earlier on.

Around 17:15 we finally we get to see a couple of FileResolve
operations. This is almost 20 minutes after the server was started, that
that seems like a pretty long time. Maybe this is cause by some client
that is walking up the tree (possibly your client as you are trying to
figure out why things aren't working).

Another 20 minutes later, we start to see that one of the servers is
returning EAGAIN errors, not sure what that is about.
    17:35:58 CheckRetCodes: server 10.0.100.3 returned error 11

The other interesting thing here is that some client seems to be
repeatedly asking us to resolve the same object over and over again,

    17:36:13 Entering RecovDirResolve 7f000000.1055.c06b
    17:36:13 ComputeCompOps: fid(0x7f000000.1055.c06b)
    17:36:13 RS_ShipLogs - returning 0
...
    17:36:33 Entering RecovDirResolve 7f000000.1041.c066
    17:36:33 RegDirResolution: VECTORS ARE ALREADY EQUAL
    {[ 18 1 0 0 0 0 0 0 ] [ 167797763 1143613965 ] [ 0 ]}
    17:36:33 RecovDirResolve: RegDirResolution succeeded
...
    17:36:34 Entering RecovDirResolve 7f000000.1041.c066
    17:36:34 RegDirResolution: VECTORS ARE ALREADY EQUAL
    {[ 18 1 0 0 0 0 0 0 ] [ 167797763 1143613965 ] [ 0 ]}
    {[ 18 1 0 0 0 0 0 0 ] [ 167797763 1143613965 ] [ 0 ]}
    17:36:34 RecovDirResolve: RegDirResolution succeeded
...
... snipped a couple of dozen attempts.
...
    17:48:28 Entering RecovDirResolve 7f000000.1055.c06b
...
...
    08:29:23 Entering RecovDirResolve 7f000000.1055.c06b

So according to the server/version vectors this directory is already
resolved, but for some reason the client think something is wrong. Maybe
they are getting a VNOVNODE error from one of the directory's children
and are trying to recurse up the tree so that directory resolution can
create the file.

I also never saw a repair operation for the directory that was marked as
inconsistent by the rename conflict 7f000000.203.ad8f, This could be
related, maybe that is a sub- or parent directory of the one that we're
repeatedly trying to resolve.

I would first try to find the parent directory of conf_global.php, and
make sure it is repaired. Then you can try to find the directory that
we are repeatedly trying to resolve.

You might be able to find it with,

    find /coda/realm -noleaf -type d -print0 | \ # get a list of all directories
	xargs -0 cfs getfid | \                  # pass it to cfs getfid
	grep 7f000000.1055.c06b 		 # and grep the one we want

This might be kind of slow, since it will basically query every single
directory. Another way is to hope that venus already has the object
cached.

    coda-client-logrotate # rotate the venus.log file
    vutil stats		  # dump detailed statistics and cache contents
			  # to venus.log
    less /usr/coda/etc/venus.log
    /1055\.c06b

If this finds the object it has a lot of information that may help
figuring out what/where it is in the tree. (comp = .. is the filename,
parent = .. is the fid of the parent directory, etc)

Jan
Received on 2006-04-07 00:22:59