Coda File System

crash in rvmlib_free during repair

From: Piotr Isajew <pki_at_ex.com.pl>
Date: Thu, 4 Jul 2013 07:25:44 +0200
Hi,

I've encountered a problem I'm not sure how to deal with.

I've got a 2 server coda 6.9.5 cell on linux (3.9.8) kernel.

Recently I was copying large (well, relatively large, a few
gigabytes) amount of files to coda and it happened, that during
copying SCM went off the network. Copying resumed to secondary
server and failed some time after that (I don't remember why it
was though, probably some venus crash).

So after restarting venus, doing purgeeml and restoring
connectivity to all servers, I had connection to all the (two)
servers in cell, but SCM having not up to date replica (not
having some files, that were copied to non-SCM replica).

This triggered several server/server conflicts for the
directories involved.

I was supprised that this kind of conflict is not resolved
automatically, but tried to use repair on directories causing
problems.

This appears to be the way to hell.

I'm able to beginrepair, comparedirs generates reasonable fix:

replica 192.168.9.6 02000001 
	removed java

replica 192.168.10.1 01000002 


but if I invoke dorepair, or removeinc non-SCM crashes. repair
just reports error due to lost connectivity with non-SCM.

non-SCM (192.168.9.6) SrvErr shows as follows:

XMIT: Sent long packet (subsys 5893, opcode 20, length 2236)
XMIT: Sent long packet (subsys 5893, opcode 20, length 2236)
No waiters, dropped incoming sftp packet
XMIT: Sent long packet (subsys 5893, opcode -8, length 2224)
repair_getdfile: starting
: Success
repair_getdfile: file opened: Success
repair_getdfile: list created: Success
repair_getdfile: replicas parsed: Success
repair_getdfile: replica processed: Success
repair_getdfile: completed!: Success
RVMLIB_ASSERT: Error in rvmlib_free

Assertion failed: 0, file "rvmlib.c", line 258
***BackTrace***
/usr/sbin/codasrv(coda_assert+0x5f)[0x4a4bff]
/usr/sbin/codasrv(rvmlib_free+0x181)[0x4a2ab1]
/usr/sbin/codasrv(_ZN5recle8FreeVarlEv+0x1aa)[0x4726da]
/usr/sbin/codasrv(_Z8PurgeLogP9rec_dlistP6VolumeP7vmindex+0x86)[0x4715c6]
/usr/sbin/codasrv(_Z10PutObjectsiP6VolumeiP5dlistiii+0x9c1)[0x4263d1]
/usr/sbin/codasrv(FS_ViceRepair+0x105)[0x430455]
/usr/sbin/codasrv[0x449aac]
/usr/sbin/codasrv(srv_ExecuteRequest+0x125c)[0x454f6c]
/usr/sbin/codasrv[0x41f8b4]
/usr/lib64/../lib64/liblwp.so.2(+0x5fe2)[0x7f8db5446fe2]
/lib64/libc.so.6(+0x36aa0)[0x7f8db4ba7aa0]
/lib64/libc.so.6(sigsuspend+0x16)[0x7f8db4ba7d76]
/usr/lib64/../lib64/liblwp.so.2(lwp_makecontext+0x10e)[0x7f8db544713e]
/lib64/libc.so.6(fflush+0x6b)[0x7f8db4bde81b]
/lib64/libc.so.6(_longjmp+0x2b)[0x7f8db4ba78ab]
/usr/lib64/../lib64/liblwp.so.2(+0x5f72)[0x7f8db5446f72]
/usr/lib64/../lib64/liblwp.so.2(lwp_swapcontext+0x22)[0x7f8db5447022]
/usr/lib64/../lib64/liblwp.so.2(LWP_DispatchProcess+0x3bd)[0x7f8db5445f7d]
/usr/lib64/../lib64/liblwp.so.2(LWP_QWait+0x57)[0x7f8db5446987]
/usr/sbin/codasrv(main+0xdc6)[0x41c286]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f8db4b92a95]
/usr/sbin/codasrv[0x41cee9]
EXITING! Bye!


After restarting everything I still have the conflict in the same
node or it's parent node depending on the situation.

Is there any hack, that would allow me to recover from that
situation?

Bests,

Piotr
Received on 2013-07-04 01:40:18