Coda File System

Re: Unresponsive repair operation lets CML grow

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Sun, 03 Apr 2011 02:06:42 -0400
On 04/02/2011 02:15 PM, Simon de Hartog wrote:
> we are quite worried about dropping data, but if the current situation
> only becomes worse, we choose to discard some data indeed.

There are periodic CML checkpoint files in /var/lib/coda/spool that 
typically can be used to recover files that were not yet reintegrated.

There is a user specific directory which contains per-volume .cml and 
.cpio files. The .cpio contains the file data, the .cml lists the 
operations. I typically just grab a copy of the .cpio and extract it so 
that at least newly created files are recovered.

Of course the checkpoint is only made when the volume doesn't have a 
write lock, so in your failure case it might not have been updated. 
Normally you can also force checkpointing with

   cfs ck /coda/path/to/vol

> voicemails. What happens is that a file is created to store the .wav
> info and then after it has been closed it is quickly renamed so that the
> filename contains the length of the voicemail in seconds.
>
> Could this be an issue that not always goes smoothly, i.e., creating a
> file and renaming it before it is created and reintegrated on the
> servers? I can't really imagine it would be a problem, but I'm not
> familiar with Coda sources.

Rename is very reliable when both the source and destination are the 
same directory and it is still pretty good when the destination 
directory is a child of the source directory. There are some situations 
where a rename to a sibling or parent directory can be problematic.

This is because resolution works per-directory, when both src and tgt 
are in the same directory everything can resolve in a single operation, 
but if they are different directories and we fail to resolve the target 
directory first we use the resolution logs to find and try to resolve 
the source directory, but there is a limit to how deeply we recurse when 
resolution fails.

Then when the automatic resolution fails and we are left with a 
server-server conflict, we have to manually repair the source first and 
the default suggested repair option (to recreate missing files on all 
replicas) actually won't work because the file is technically not 
missing but located in a different directory.

Jan
Received on 2011-04-03 02:06:58