Coda File System

Re: write failure issues

From: <shivers_at_cc.gatech.edu>
Date: Fri, 14 May 2004 19:08:13 -0400
> Ahh, cable modem, asynchronous network... I don't have DSL or cable
> myself and Coda used to only work reliably on networks that had
> identical up and download speeds.
> 
> What happens is that during the fetches we see an amazingly fast
> network, but we time out and get disconnected as soon as we try to write
> even a little bit of data because the acks are taking far too long. RPC2
> 'thinks' we have a 3MB/s sync network, so when sending several KB and
> not seeing the ack within a couple of milliseconds it believes the
> packet got lost and retransmits. This only makes the congestion on the
> uplink even worse. Once we hit about 5 retransmissions and haven't yet
> seen the ACK message, the client gives up and disconnects from the
> server.
> 
> > When this bottleneck causes enough reintegration data to build up, blammo.
> > The lossage is as I described in my last message: cfs lv shows the system
> > in some kind of disconnected state, and cfs wr won't make it reconnect.
> > 
> > So the message seems to be that if I don't press the system hard, it works.
> > Under pressure, it falls over. For me, that's progress. Now I want to
> > understand the current hosage. Can anyone help?
> 
> Well, one thing is that your connection really is 'weak' in Coda's
> terms. The uplink speed is probably in the order of 64 or 128Kb/s, so it
> prefers to work write-disconnected. You can tell it not to adapt to
> network bandwidth estimated by using 'cfs strong'. This should prevent
> the (connected -> write-disconnected) transition. However you can still
> become write-disconnected because of the (connected -> disconnected ->
> write-disconnected) transition, in other words if RPC2 misses the bat
> and times out you end up logging the change and won't automatically
> return to connected state when we notice that the server hasn't really
> gone.
> 
> The reason your client isn't reintegrating is either because the pending
> changes haven't 'aged' long enough. Statistically, any file that hasn't
> been removed within 5 or 10 minutes after creation, it is likely going
> to be around for several months. So a lot of bandwidth is saved by
> delaying reintegration long enough so that short lived (temporary) files
> can be optimized away locally.
> 
> The other reason could be that the estimated bandwidth is so incredibly
> low, that the client thinks it can't even reintegrate a single record
> without blocking the user for a significant amount of time. I believe
> the formula was something like, size of reintegration / bandwidth has to
> be less than 15 seconds. The low bandwidth estimate would be caused by
> RPC2's own insistence on retransmitting 'lost' packets, if every packet
> is sent 4 or 5 times, these all eat up the available link bandwidth.
> 128Kb/s would end up looking more like 32Kb/s (4KB/s) which really is a
> trickle.

Waittaminit.
  - I have a real live, continuous, non-telephone-modem net connection.
  - The server's up
So how come I am unable to convince the client that it is *not* disconnected? 
That seems like a bug to me.

Now that I'm disconnected, and unable to get the system to reconnect, it seems
like I'm hosed. I am trying to write a mess of data into /coda. But if the
client is disconnected, then the writes just pile up in the venus cache until
it fills up, them boom, trouble. And this is for a stationary box in
continuous connection to the net, no less.

I tried "cfs fr" as per your suggestion, but it just bombs out with a
mysterious error message that tells me nothing:

    $ cfs fr /coda/myserver/shivers
    VIOC_SYNCCACHE: Invalid argument
      VIOC_SYNCCACHE returns -1
    $

The server is fine, btw -- other clients on other systems are connected
& happy.

What is the distinction between "write-disconnected" and "disconnected"?
    -Olin
Received on 2004-05-17 15:25:20