Coda File System

Re: Coda development

From: Jan Harkes <>
Date: Thu, 5 May 2016 00:22:09 -0400
On Wed, May 04, 2016 at 07:43:46PM -0400, Greg Troxel wrote:
>   The last big thing is to make conflict repair more robust, so that
>   normal people rarely lose.  It's quite good already, but the last
>   little bit is likely real work.

I've been thinking long and hard about this. I've pretty much been
trying to get repair working more reliably ever since I joined CMU.
We've had several undergraduate and master students do project work on
trying to make repair and application specific resolvers work reliably.

During the recent break from working on Coda, I've played around with
building applications using what you can call 'web-technology'. Simple
restful web apps, sharing state through sql/redis/nosql databases, even
handling things like push updates to client 'programs' consisting of the
javascript running in browsers. And there are a lot of ways to scale
with load balancing multiple webapps behind nginx or haproxy, sharding
data across databases etc.

Anyhow, it allowed me to take a step back and look at Coda from a
different perspective and the one thing that adds a lot of complexity
and ultimately causes every single server-server conflict is
replication. Now optimistic replication is totally awesome and Coda has
proven that it can be made to work well enough to have a usable system
for users that are not afraid to repair the occasional conflict.

But my wordprocessor doesn't want to deal with that conflict, neither
does my webserver, or many times me when I'm working on a deadline. So
lets look at the pros and cons of (optimistic) replication for a bit.

- Awesome concept to prove it can work, you can write papers about it.
- When every server but one crashes you can still continue working.
- When network routing messes up and different clients see a different
  subset you can continue on both sides of the split.
- When something gets corrupted in a client's volume replica data
  structure you can remove the replica and rebuild by walking the tree.

- Somewhat tricky fallbacks when operations cross directories (rename),
  or to handle how to deal with a resolution log when a directory is
  removed, especially when it contains a mix of source and target of
  rename operations. These are still the most common reasons for manual
  repair failures, it is a real pain when you need to repair
     mkdir foo ; touch foo/bar ; mv foo/bar . ; rmdir foo
- Extra metadata to track what we know of other replicas, version
  vectors, store identifiers.
- Extra protocol messages (COP2) to distribute said knowledge.
- Special set of heuristics for most common replica differences, missed
  COP2 by looking for identical storeids, missed update on one or more
  replicas by looking at version vectors, runt resolution to rebuild a
  replica, etc.
- Keep track of reintegration (operation) logs so we can merge directory
  operations from diverged replicas.
- A protocol that goes through 6 phases for each conflicting object to
  handle server resolution. This actually makes placing servers in
  different locations not work very well.
- As a result, the need to have all servers basically co-located, so we
  still can't handle datacenter outages or multiple site replication.
- Manual repair fallback when the heuristics and logs fail, which
  requires very different steps for directories (fixing log operations)
  and files (overwrite with new/preferred version) which isn't obvious
  to the user when repair starts.
- Need to have the concept of a high level 'replicated volume' and a low
  level 'volume replica' on the client.

I probably should stop here.

Now when there is no optimistic replication:
- we just need read-write and read-only volumes on clients which
  probably could even be represented by a single 'readonly' flag on a
  single volume data structure. Turning a readonly backup snapshot back
  in to a read-write volume may become just toggling that flag.
- If a server crashes or otherwise becomes unreachable we still have
  disconnected operation on a (hoarded) cache of files. 
- We only have to deal with reintegration conflict, however unlike
  server-server conflict they only block the local user who performed
  the operations, and the server side implementation is pretty much (or
  should be close to) the normal reintegation path. There is a also very
  cheap fix if the user doesn't care about the local changes in the form
  of a 'cfs purgeml'. A headless server automated conflict resolution
  could be a cfs checkpointml/cfs purgeml/send reintegration log with
  the conflict to the appropriate user.
- Reduce the size of a version vector to 2 values, a single data version
  and the store identifier, on the wire it can be even more compact
  because the client identifier part of the storeid does not change for
  the lifetime of a client-server connection so it could be passed on
  connection setup. This in return makes operations like ValidateAttrs
  more efficient because we can pack a lot more results in the response.
- The server can lose functionality related to directory operation
  logging, resolution, (server-server) repair.
- On the server side things like RAID can help deal with drive failures,
  and everyone is making backups, right?
- More exciting, back the server data with something like Ceph's RADOS
  object store which gives replication, self-healing and a lot of
  goodies, and have Coda provide disconnected operation/hoarding/etc.
- As servers become simpler and store data in a replicated backend store
  (rados/s3) and mostly deal with handling applying reintegration logs
  and breaking callbacks they become much easier to scale out, clients
  can be scattered across multiple frontends, grabbing a volume
  lock/lease and callback breaks can be distributed between frontends
  with publish/subscribe queues.

I probably should stop here, because it is getting quite a bit pie in
the sky and with the available manpower might just hit that magic 10
year window that Rune was talking about.

Received on 2016-05-05 00:22:25