Coda File System

Replication on more than two servers?

From: Daniel Schmitt <pnambic_at_kidata.de>
Date: Wed, 13 Oct 1999 11:47:35 +0200
Greetings,

I've beent rying to come to terms with the problem detailed below for the better
part of two days now, and would be extremely grateful for any hint, however
small.

The situation:

I'm trying to replicate a coda volume over four identical Intel-based servers
running Linux 2.2.12, patched for Beowulf ethernet channel bonding, glibc2.1,
based on a vanilla SuSE 6.2 installation. This doesn't work, as shown below.

First try:

Coda 5.3.1, set up according to the instructions in the HowTo. I was unable to
start a non-SCM server, while the SCM worked fine. This problem was mentioned
several times before on this mailing list, for example on February 2nd and 3rd
by Bernd Markgraf and on September 10th by Alex Fomin. After toying around with
this for a while, I scrapped the installation and moved on to the...

Second try:

Coda 5.2.7. Now this was slightly better. I was able to set up a replicated
volume, spread across two servers, by following the "exploring replication"
instructions in the HowTo. However, as soon as I try to add another, more widely
replicated volume, all non-SCM servers crash; the newly added ones on startup,
the already running one upon receipt of the new configuration files. Here's the
end of the log and the resulting backtrace from a server crashing on startup:

lqman: Creating LockQueue Manager.....LockQueue Manager starting .....
11:02:13 LockQueue Manager just did a rvmlib_set_thread_data()
 done
11:02:13 ****** FILE SERVER INTERRUPTED BY SIGNAL 11 ******
11:02:13 ****** Aborting outstanding transactions, stand by...
11:02:13 Uncommitted transactions: 0
11:02:13 Uncommitted transactions: 0
11:02:13 You may use gdb to attach to 388

(gdb) bt
#0  0x40125b71 in __libc_nanosleep () from /lib/libc.so.6
#1  0x40125aed in __sleep (seconds=1) at ../sysdeps/unix/sysv/linux/sleep.c:78
#2  0x8115973 in coda_assert (pred=0x8116047 "0", file=0x8116040 "srv.cc",
    line=314) at coda_assert.c:46
#3  0x804a8dd in zombie (sig=11) at srv.cc:314
#4  0x400af9b8 in __restore ()
    at ../sysdeps/unix/sysv/linux/i386/sigaction.c:125
#5  0x400b1141 in _quicksort (pbase=0xbffff220, total_elems=4294967294,
    size=4, cmp=0x80d1500 <cmpHost(long *, long *)>) at qsort.c:121
#6  0x400b17bb in qsort (b=0xbffff220, n=4294967294, s=4,
    cmp=0x80d1500 <cmpHost(long *, long *)>) at msort.c:114
#7  0x80d157a in vsgent::vsgent (this=0x8207d28, vsgaddr=3758096644,
    hosts=0xbffff220, nh=-2) at vsg.cc:67
#8  0x80d1bc2 in InitVSGDB () at vsg.cc:213
#9  0x80b0f0c in ResCommInit () at rescomm.cc:98
#10 0x804afe4 in main (argc=12, argv=0xbffff814) at srv.cc:510

I didn't save the backtrace from the already running one, but it, too, was
crashing within qsort while building the VSGDB.

This occurs regardless of whether I set up a two-server replicated volume first,
or whether I go for the full four-server setup immediately. My servers file
looks like this:

dogma-1		1
dogma-2		2
dogma-3		3
dogma-4		4

...and my VSGDB looks like this:

E0000100 dogma-1
E0000101 dogma-2
E0000102 dogma-3
E0000103 dogma-4
E0000104 dogma-1 dogma-2 dogma-3 dogma-4 

OK, this is about as much information as I can supply. Does that ring a bell
with anyone?

Thanks a whole lot in advance,

Daniel.

-- 
daniel schmitt - lead system architect - kidata ag, koenigswinter, germany
Received on 1999-10-13 05:49:13