Coda File System

Re: codasrv crash on netbsd/sparc64 3.0

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Tue, 25 Apr 2006 11:53:46 -0400
On Tue, Apr 25, 2006 at 10:43:49AM -0400, Sean Caron wrote:
> On 4/25/06, Greg Troxel <gdt_at_ir.bbn.com> wrote:
> > In gdb, after attaching, do "bt" to get a stack backtrace.  Then do
> > "up" to move to where the signal was, and there "i frame" and "list".
> 
> Done! Here's the data:
> 
> (gdb) bt
> #0  0x403bc3a0 in sleep () from /usr/local/lib/libc.so.12
> #1  0x0008f2a8 in coda_assert (
>     pred=0xf9e <Error reading address 0xf9e: Invalid argument>,
>     file=0x94518 "srv.cc", line=302) at coda_assert.c:46
> #2  0x00013c64 in zombie(int) (sig=3998) at srv.cc:302
> #3  <signal handler called>
> 
> (gdb) up

Annoyingly it doesn't actually show where the signal occured, only what
happened after the signal was caught.

Another way to catch this in GDB is to run codasrv under gdb. Something
like,

    # gdb codasrv
    gdb> run -d 1

(the -d 1 bumps the debug level slightly and should also prevent the
server from detaching from the console).

Then any signals will be trapped first by GDB.

You seem to be getting a sig10 (sigbus?) which seems to commonly
indicate unaligned memory accesses. A null-ptr would have been sig11
(sigsegv), and an assertion typically generates a sig6 (sigabort).
Because signals might also be used to set up thread stacks it could be
that the signal we're looking for doesn't happen until later, so you
might have to enter 'continue' a couple of times.

However we used to have both clients and servers running on ARM with a
kernel that did no unaligned access fixups, so I thought we pretty much
had already dealt with most of those.

One other thing I am interested in, what is the output of LWP's
configure. I wonder which type of thread switching it picked. I think
NetBSD deprecated makecontext and friends, so it might be using tricks
with signal handlers (sigaltstack) to kickstart new threads, or it could
be falling back on the old assembly code which might not realize that
this is a 32-bit kernel/userspace. So it could be that the first thread
switch is trying to perform a 64-bit read which triggers the bus error.

Jan
Received on 2006-04-25 11:55:27