Coda File System

adding second server crashes SCM

From: Andria Thomas <andria_at_tovaris.com>
Date: Thu, 14 Jun 2001 17:27:19 -0400
I wrote about a week ago about being unable to setup a volume on a
second coda server.  While that problem has been fixed (after upgrading
the SCM and the second server to FreeBSD 4.3-STABLE, and stopping the
updatesrv process on the second server), a related problem remains.  If
the SCM is aware of the second server (i.e. has entries for the second
server in /vice/db/servers and /vice/db/VSGDB, and a volume entry for
the new server in /vice/db/VSList), *no* client can access /coda
anymore.  If the coda services on the SCM are stopped, references to the
second server taken out of the relevant files, and the services
restarted, everything on the clients works fine again and they can
access SCM-stored files.   The SCM is running coda server version
5.3.13_1, and the second server has 5.3.14.

The SCM has the following processes:  auth2, rpc2portmap, updatesrv,
updateclnt, and codasrv
The second server has the following processes:  auth2, updateclnt,
codasrv (these are the only ones started by the
/usr/local/etc/rc.d/rc.vice script.)

When the SCM is configured for both servers, the coda processes start up
fine, but as soon as a client machine tries to access /coda (even with
an ls) the following error is returned on the client:

: andria_at_doj; ls -la /coda
ls: /coda: Device not configured

The second server doesn't seem to be logging anything of significance
(in either of the Update logs or the Srv logs) but the SCM has the
following SrvLog:

16:39:40 New SrvLog started at Thu Jun 14 16:39:40 2001

16:39:40 Resource limit on data size are set to 536870912

16:39:40 RvmType is Rvm
16:39:40 Main process doing a LWP_Init()
16:39:40 Main thread just did a RVM_SET_THREAD_DATA

16:39:40 Setting Rvm Truncate threshhold to 5.

Partition /data: inodes in use: 12633, total: 2097152.
16:39:55 Partition /data: 5127996K available (minfree=7%), 4369168K
free.
16:39:55 The server (pid 488) can be controlled using volutil commands
16:39:55 "volutil -help" will give you a list of these commands
16:39:55 If desperate,
                "kill -SIGWINCH 488" will increase debugging level
16:39:55        "kill -SIGUSR2 488" will set debugging level to zero
16:39:55        "kill -9 488" will kill a runaway server
16:39:55 Vice file system salvager, version 3.0.
16:39:55 SanityCheckFreeLists: Checking RVM Vnode Free lists.
16:39:55 DestroyBadVolumes: Checking for destroyed volumes.
16:39:55 Salvaging file system partition /data
16:39:55 Force salvage of all volumes on this partition
16:39:55 Scanning inodes in directory /data...
16:39:59 SFS: There are some volumes without any inodes in them
16:39:59 Entering DCC(0x37000001)
16:39:59 DCC: Salvaging Logs for volume 0x37000001


16:40:04 done:  13527 files/dirs,       514649 blocks
16:40:04 SFS:No Inode summary for volume 0x37000003; skipping full
salvage
16:40:04 SalvageFileSys: Therefore only resetting inUse flag
16:40:04 SFS:No Inode summary for volume 0x37000004; skipping full
salvage
16:40:04 SalvageFileSys: Therefore only resetting inUse flag
16:40:04 Entering DCC(0x37000006)
16:40:04 DCC: Salvaging Logs for volume 0x37000006

16:40:04 done:  135 files/dirs, 82148 blocks
16:40:04 SalvageFileSys completed on /data
16:40:04 VAttachVolumeById: vol 37000001 (/data.0) attached and online
16:40:04 VAttachVolumeById: vol 37000002 (u.andria.0) attached and
online
16:40:04 VAttachVolumeById: vol 37000003 (u.adrian.0) attached and
online
16:40:04 VAttachVolumeById: vol 37000004 (u.jmalone.0) attached and
online
16:40:04 VAttachVolumeById: vol 37000006 (admin.0) attached and online
16:40:04 Attached 5 volumes; 0 volumes not attached
lqman: Creating LockQueue Manager.....LockQueue Manager starting .....
16:40:04 LockQueue Manager just did a rvmlib_set_thread_data()

done
16:40:04 CallBackCheckLWP just did a rvmlib_set_thread_data()

16:40:04 CheckLWP just did a rvmlib_set_thread_data()

16:40:04 ServerLWP 0 just did a rvmlib_set_thread_data()

16:40:04 ServerLWP 1 just did a rvmlib_set_thread_data()

16:40:04 ServerLWP 2 just did a rvmlib_set_thread_data()

16:40:04 ServerLWP 3 just did a rvmlib_set_thread_data()

16:40:04 ServerLWP 4 just did a rvmlib_set_thread_data()

16:40:04 ServerLWP 5 just did a rvmlib_set_thread_data()

16:40:04 ResLWP-0 just did a rvmlib_set_thread_data()

16:40:04 ResLWP-1 just did a rvmlib_set_thread_data()

16:40:04 VolUtilLWP 0 just did a rvmlib_set_thread_data()

16:40:04 VolUtilLWP 1 just did a rvmlib_set_thread_data()

16:40:04 Starting SmonDaemon timer
16:40:04 File Server started Thu Jun 14 16:40:04 2001

16:40:04 client_GetVenusId: got new host 10.0.0.21:2430
16:40:04 Building callback conn.
16:40:04 No idle WriteBack conns, building new one
16:40:04 Writeback message to 10.0.0.21 port 2430 on conn 38780c5b
succeeded
16:40:04 ValidateVolumes: 0x7f000000 failed!
16:40:04 ValidateVolumes: 0x7f000005 failed!
16:40:04 GetVolObj: VGetVolume(7f000005) error 103
16:40:04 GrabFsObj, GetVolObj error Volume not online
16:40:06 GetVolObj: VGetVolume(7f000005) error 103
16:40:06 GrabFsObj, GetVolObj error Volume not online
16:40:06 GetVolObj: VGetVolume(7f000005) error 103
16:40:06 GrabFsObj, GetVolObj error Volume not online
16:40:10 GetVolObj: VGetVolume(7f000000) error 103
16:40:10 GrabFsObj, GetVolObj error Volume not online
16:40:10 GetVolObj: VGetVolume(7f000000) error 103
16:40:10 GrabFsObj, GetVolObj error Volume not online
16:41:17 client_GetVenusId: got new host 10.0.0.35:2430
16:41:17 Building callback conn.
16:41:17 No idle WriteBack conns, building new one
16:41:17 Writeback message to 10.0.0.35 port 2430 on conn c5ee03c
succeeded
16:41:17 RevokeWBPermit on conn c5ee03c returned 0
16:41:17 GetVolObj: VGetVolume(7f000000) error 103
16:41:17 GrabFsObj, GetVolObj error Volume not online
16:46:07 Worker3: Unbinding RPC connection 808164070
16:46:31 RevokeWBPermit on conn 38780c5b returned 0
16:46:31 GetVolObj: VGetVolume(7f000003) error 103
16:46:31 GrabFsObj, GetVolObj error Volume not online
17:00:53 RevokeWBPermit on conn c5ee03c returned 0
17:00:53 GetVolObj: VGetVolume(7f000000) error 103
17:00:53 GrabFsObj, GetVolObj error Volume not online
17:00:56 GetVolObj: VGetVolume(7f000000) error 103
17:00:56 GrabFsObj, GetVolObj error Volume not online
17:00:56 GetVolObj: VGetVolume(7f000000) error 103
17:00:56 GrabFsObj, GetVolObj error Volume not online
17:00:56 GetVolObj: VGetVolume(7f000000) error 103
17:00:56 GrabFsObj, GetVolObj error Volume not online
17:01:08 RevokeWBPermit on conn 38780c5b returned 0

And on and on and on....

The volumes above that are failing (7f000000, 7f0000003, 7f0000005) are
all the ones I tried to access from the client, and they are all
singly-replicated volumes (from the SCM).  All the tokens are the same
for both servers, they can otherwise communicate normally, and they each
have single entries in their /vice/db/vicetab files, like the following:

irs     /data           ftree   width=128,depth=3

Any help, or suggestions for what to try next, would be greatly
appreciated.

Thanks!
Andria
Received on 2001-06-14 17:28:06