Coda File System

Re: venus, she is strange sometimes :-)

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Mon, 11 Mar 2002 13:48:18 -0500
On Mon, Mar 11, 2002 at 09:40:07AM +0100, Ivan Popov wrote:
> Now about the problems:
> 
>  - the problem has been triggered by "abcde" (A Better CD Encoder"
> that is a complicated shell script running a lot of posix-style programs.
> All of the programs it runs are (very simple) shell wrappers to the real
> binaries, running a statically linked shell as the interpreter, from Coda,
> like
> 
> #!/coda/<some-path-approximately-this-long>/sh
> 
> Once in a while I get messages like
> "/coda/<some-path-here>/grep: bad interpreter: Is a directory"
> or
> "/coda/<some-path-here>/tail: bad interpreter: Is a directory"

Interesting, you are possibly on the trail of yet another kernel
problem. This time I would say it could be related to how we handle
the kernel's directory entry cache.

Do /coda/<some-path-here>/{grep,tail} actually exist? Are those these
wrappers you are describing? And their first line is the #!/coda/<xxx>sh
thing? It almost looks like we're returning the wrong object during the
lookup. Strace -f output of such a failing execution is probably a bit
big, and possibly the strace even makes the race condition go away. But
I'm curious what syscall is returning EISDIR.

> After that message it is sometimes possible to (re)run the script,
> but venus is likely to behave strangely, *some* (sometimes all) processes
> hang while trying to access coda. Restart (reboot as I can't umount /coda

Curious, I know of one location where we can get that kind of hanging
behaviour. If the CML is owned by another user, the threads block until
the CML is reintegrated. Now if there are too many threads trying to
access the same volume, there aren't enough threads left over to
actually accept the token that enables us to reintegrate.

>  - another problem (seen on the clients with smaller caches) is that after
> bigger files updates that trigger write-disconnected mode, after some or
> many reintegrating...SUCCESS messages on the console, venus dies and then
> gets assertion failed (about "volume -> is replicated") on restars.
> That is curable by cache reinit.

This is a know problem that I haven't been able to fix yet. Whenever a
local-global conflict is detected, the locally cached objects are moved
into a fake volume. But they aren't only correctly recovered when repair
succeeds. If no repair is attempted, or repair fails, the CML entries
associated with these objects is relinked to the local volume instead of
the original volume. This not good, so the local volume doesn't allow
anyone to link CML entries to it (volume is not replicated).

Jan
Received on 2002-03-11 13:49:24