Coda File System

Re: Checkpointing causes local/global conflict ?

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Thu, 24 Feb 2005 14:02:30 -0500
On Tue, Feb 15, 2005 at 03:22:27AM -0500, Christer Bernérus wrote:
> This is a Darwin issue, that I need help with again.
> 
> Quite frequently, we get a local/global conflict while writing data to 
> coda. This happens when there is only one client involved, so there is 
> no real cause for a conflict.

Even a single client can get a conflict, one simple situation is when a
store completes, but the client doesn't see the final ACK message. The
client assumes disconnection, logs the store operation in the
reintegration log and retries the operation when the server is heard
from again.

However the store in the reintegration log has a different 'store id'
compared to the one that was already committed and it still has the
locally cached version vector of the original object. As a result the
server believes this retried store is a conflicting update and declares
a conflict.

> It seems to happen if a write coincides with venus making a checkpoint.
> 
> Is there any way I can turn off automatic checkpointing? In that case, 
> I'd like to do that and run som stress testing operations to see if 
> turning it off helps.

Hmm, checkpointing shouldn't interfere, I thought the same thread that
does the reintegration is responsible for the checkpointing (the
'voldaemon' thread). When reintegration returns an error we always
automatically create a checkpoint file. Originally local/global repair
used to involve replaying the operations in the checkpoint file one by
one instead of using an ioctl to ask for the exactl CML records involved
in the conflict.

Because of this automatic checkpointing when reintegration fails it might
look like the checkpointing caused reintegration to fail, which we only
checkpointed as a result of the failure.

In any case there is a 'VolCheckPointInterval' variable in
coda-src/venus/vol_daemon.cc, which controls the frequency of the
checkpointing. You could either set that to a large value, or comment
out the check later on in the file that triggers the perioding
checkpointing.

> Then, if turning it off helps, should there be any kind of mutex in the 
> kernel module to serialize these operations ?

There is a mutex, it is tested in vdb::CheckPoint (vol_daemon.cc), if
the CML_lock is taken we print 'volume foo CML is busy, skip
checkpoint!' to the error log (/usr/coda/etc/console?).

Jan
Received on 2005-02-24 14:03:30