Coda File System

Re: new coda issue: touch a file and coda dies

From: Patrick Walsh <pwalsh_at_esoft.com>
Date: Thu, 07 Jul 2005 13:47:09 -0600
	I have now tried experimenting with the servers.  Restarting codasrv on
both servers didn't affect the problem.  Stopping the server on dir225
and trying to access the problem directory (which, I should mention, was
once a directory that worked fine but a problem file in the directory --
until I tried to delete the file) still triggers the familiar issue.
Stopping the server on dir224 and then trying to ls in the directory
gives a different error and one that doesn't show up in the logs:

# ls /coda/director/snapin/pool_scm/r
ls: /coda/director/snapin/pool_scm/r/readline-4.2-2.i386.rpm: Connection
timed out

	And I've hit upon what I think must be the problem:

# cfs whereis /coda/director/snapin/pool_scm
  dir224  dir225  dir225

	A quick look at VRList on the server shows:

/snapin 7f000003 3 1000004 2000004 200000a 0 0 0 0 0 0

	So it appears we are triply replicating a volume to two servers.  I
have no idea how this happened -- we've automated the setup of coda and
that code hasn't changed for some time.  So I'll look into this and try
to figure out what's going on.  Sorry to waste your time with a bad
setup.  I just can't figure out how it got setup wrong.

..Patrick



On Thu, 2005-07-07 at 14:22 -0400, Jan Harkes wrote:
> On Thu, Jul 07, 2005 at 09:41:03AM -0600, Patrick Walsh wrote:
> > # ls /root/pool_scm/r
> > readline2.2.1-2.2.1-4.i386.rpm
> > rpm-4.0.4-7x.20.i386.rpm
> > readline-4.2-2.i386.rpm
> > rsh-0.17-18.AS21.2.i386.rpm
> > restore-default-system-1.0-20031001.i386.rpm
> > rsh-0.17-18.AS21.4.i686.rpm
> > rootfiles-7.2-1.noarch.rpm
> > # du -s -h /root/pool_scm/r
> > 2.6M    /root/pool_scm/r
> > # ls r
> > ls: r/readline-4.2-2.i386.rpm: No such device
> 
> Ok, 7 directory entries wouldn't be enough to fill a directory.
> 
> > 	At this point, venus has crashed.  The console.log file has the
> > erroneous seeming errors that I pasted before, but to show again:
> > 
> > ***LWP (0x810ec50): Select returns error: 4
> > 
> > 09:28:28 worker::main Got a bogus opcode 36
> > 09:29:30 readline-4.2-2.i386.rpm (606e1fc8.7f000003.1018.4de)
> > inconsistent!
> > 09:29:30 fatal error -- fsobj::dir_Create: (dir225,
> > 606e1fc8.7f000003.fffffffc.80002) Create failed!
> 
> This is very strange, I looked at the source, we are trying to add a
> directory entry to some unknown directory (the name or fid of the parent
> in which we are trying to create is not logged). We do know that the new
> entry has the name "dir225" and it is pointing at a fake object in the
> same volume as the inconsistent rpm file.
> 
> However, server-server conflict do not in any way try to create names or
> anything. The lookup or getattr operation returns EINCONS and this is
> mapped to faked stat data right before we send the reply back to the
> kernel. As far as I know there isn't even an actual filesystem object
> associated with the inconsistent object, since the servers disagree
> about it's contents. Only reintegration related expansion is changing
> directory contents, since in that case we do have a locally cached copy
> of the object and it has to be modified before we can show the global
> version.
> 
> I also don't see how anything in that volume would even have a name like
> 'dir225', there are the [a-z] directories, and a bunch of *.rpm files.
> 
> But somehow these two must be related, since they seem to happen so
> reliably right after each other.
> 
> > 	I should have mentioned that I already tried this.  And as you can see
> > from the above terminal transcript, it had little effect.
> > 
> > 	Any other thoughts?
> 
> No idea, it just doesn't make sense. I don't see how a server-server
> conflict could possibly get into the expansion code that is used when a
> reintegration fails, if you are simply doing an 'ls'. I also don't
> understand why it is trying to create a directory named 'dir225' when
> all the names in the volume are either a single character 'a-z' or
> '*.rpm'.
> 
> Maybe start venus with loglevel 100 (venus -init -d 100) and repeat the
> same thing. At that point the log might show how we're getting to this
> point and if those two events (the inconsistency and the crash) are
> really related or not.
> 
> Jan
> 
-- 
Patrick Walsh
eSoft Incorporated
303.444.1600 x3350
http://www.esoft.com/

Received on 2005-07-07 15:48:05