Coda File System

Re: codasrv gets stuck

From: Steve Simitzis <steve_at_saturn5.com>
Date: Sat, 13 Dec 2003 20:22:51 -0800
i should add that i just decreased serverprobe on all of my venus clients
down to 120 seconds. hopefully that will help things. i've only noticed
this problem within the last month since our site's traffic has recently
doubled at the peaks. (the curse of good PR.) 

something to beware of for others who are running coda at scale.

On 12/13/03, Steve Simitzis <steve_at_saturn5.com> wrote: 

> interesting. do you have any suggestions for what i might do to
> get around the problem? this seems to be happening to me with
> increasing regularity.
> 
> On 12/05/03, Jan Harkes <jaharkes_at_cs.cmu.edu> wrote: 
> 
> > On Tue, Dec 02, 2003 at 08:25:05AM -0800, Steve Simitzis wrote:
> > > the problem is that codasrv will freeze, apparently unbind all its
> > > connections, and refuse to do much of anything. the only way to get it
> > > running again is to kill -9 codasrv, and restart everything.
> > 
> > I've seen similar freezes on our testserver and attributed those to
> > clients that are connecting from behind a masquerading firewall without
> > lowering the server-probe timeout.
> > 
> > The problem is that the netfilter/iptables UDP connection tracking
> > forgets about forwarded ports within 3 minutes, but the normal server
> > probe is only about once every 5 minutes. So each probe sets up a bunch
> > of new connections from a new port when it revalidates the local cache.
> > 
> > The server isn't very smart yet, and tracks a client based on the
> > ip-address. So over time it builds up more and more RPC2 connection
> > endpoints, but because some of these connections have always recently
> > been used it never expires them. After a couple of days (weeks) it
> > spends so much time looking for a matching connection endpoint for each
> > incoming packet that the server seems to freeze. This disconnected any
> > clients with pending operations, and they reconnect, only making the
> > problem worse.
> > 
> > This is my current 'theory' about what is causing this. A server
> > restart clearly fixes it for a while because that we we get rid of all
> > those 'dead' endpoints. Another solution is to pull the network wire for
> > about 10 minutes :)
> > 
> > I'm not yet sure where to 'attack' this problem. For one, the server
> > should become a little smarter about tracking clients and which
> > connections belong to them/are still active. But maybe rpc2 has a
> > exponential growth problem in the lookup path where it is matching
> > incoming packets.
> > 
> > Jan
> 
> -- 
> 
> steve simitzis : /sim' - i - jees/
>           pala : saturn5 productions
>  www.steve.org : 415.282.9979
>   hath the daemon spawn no fire?
> 

-- 

steve simitzis : /sim' - i - jees/
          pala : saturn5 productions
 www.steve.org : 415.282.9979
  hath the daemon spawn no fire?
Received on 2003-12-13 23:29:00