Coda File System

Re: What is the "normal" speed to copy to /coda Server

From: Jan Harkes <jaharkes_at_cs.cmu.edu>
Date: Wed, 7 Feb 2007 16:16:58 -0500
On Wed, Jan 31, 2007 at 10:07:59AM +0100, Reiner Dassing wrote:
> I have setup two coda servers, replicated and a client.
> Venus can connect, a token is there and I see my volumes.
> 
> When I am testing the connection from the client to the servers by
> cp -rv . /coda/tarzan1.net/usr/iersdc
> this copy is very very very slow:
> small files, some kilobytes, are taking 10 seconds and more.
> There is a 100 MBit/s net between the client and the servers
> and tests via scp are performing as expected.
> 
> What is the "normal" speed to expect for cp to /coda?
> Where to look for bottleneck?

I would expect the actual data transfers to be something like 3MB/s. But
file and directory creation would be pretty slow compared to a local
file system, all updates are synchronous and wait for the servers to
sync all updates to disk before continuing.

However if you client is operating in write-disconnected mode, we should
to be much faster for small copies (where small is smaller than the
available client cache size) and slightly slower when a large amount of
data is copied.

In write-disconnected operation the client keep a local log of pending
updates which are written back periodically in the background, we call
this reintegration. These reintegrations combine and commit up to 100
operations at a time, which is considerably faster compared to
individual operations.

But reintegration attempts only happen once every 5 seconds or so,
during this time way we can get a nice little backlog and optimize away
useless operations (e.g. a compilation may create temporary files that
are removed immediately which don't have to be sent to the server).
If we fill up the cache too quickly, the application will be slowed down
or even blocks until reintegration has a chance to catch up.

Most of the time you see unusual slowdowns, it is because we need to
break callbacks to clients that have disappeared. Sometimes it is simply
a mobile client that lost network connectivity or a client that was
restarted. Sometimes it is caused by a masquerading firewall that times
out the internal state too soon and when the client reprobes it is
assigned a different outbound port on the firewall.

One thing I noticed on our servers is that the new crypto code actually
was draining the entropy pool of /dev/random pretty quickly during
backups. Everytime a Coda/RPC2 application starts it reads about 48
bytes to seed the internal random number generator. But in our case when
backing up a few hundred volumes, we would fork off a couple of hundred
volutil commands to check the last backup time, clone the volume and
dump the data to a TCP connection with the Amanda server. And once the
pool is drained, processes start to block and some backups would fail.

Same thing happened when I tried to use pam_coda authentication for
some pages on the web-server, which then forked off a new clog process
for every page hit. But I wouldn't expect you to hit such situations
when running just venus and codasrv, since those aren't restarted all
the time.

I'm not sure why your connections seem so slow. Maybe there is a network
issue that isn't triggered by TCP connections. For one, we don't really
take the link mtu into account and assume that ip packets will be
fragmented and reassembled by the underlying network when they are too
big. Also the data transfer (SFTP) doesn't really scale it's window
down to a single packet like TCP can. We assume that once we get an ack,
that we can send at least 8 or 9 packets (1KB each), a router may end up
consistently dropping the last couple of packets in such a series
resulting in a timeout and retransmission of the missing packets.

If you run 'vutil -swap ; vutil -stats', this will rotate the venus.log
file (-swap) and dump a lot of statistics (-stats) in venus.log. The
file is either at /var/log/coda/venus.log or /usr/coda/etc/venus.log.

At the end of that file you can find the RPC2 and SFTP statistics,

RPC Packets:
RPC2:
  Sent:		Total		Retrys  Busies   Naks
    Uni:    211716 : 28257968   1739      15       0
    Multi:      0 : 0		   0       0       0
  Received:     Total		Replys       Reqs	    Busies    Bogus Naks
    Uni:    211037 : 21419816  152008 : 3   56214 : 1592   1111 : 0    112   0
    Multi:      0 : 0		    0 : 0	0 : 0	      0 : 0      0   0
SFTP:
   Sent:	Total		Starts     Datas       Acks    Naks   Busies
      Uni:    71120 : 61930461      0   57685 : 5     13435       0       0
      Multi:      0 : 0		    0       0 : 0	  0       0       0
   Received:       Total	Starts     Datas       Acks    Naks   Busies
      Uni:    82414 : 74268329   2764   71200 : 0      8450       0       0
      Multi:      0 : 0		    0       0 : 0	  0       0       0


These numbers are from my client and looking at them now I wonder why
we're not getting any MultiRPC or MultiSFTP numbers. But some of these
numbers may give some indication why your system seems slow.

For instance my client has sent 211K rpc2 requests and less than 1% needed
to be retransmitted. My client sent a minimal number of BUSY packets,
and bogus packets are received when tokens expire and we were trying to
send something to the server based on an expired key.

I don't see SFTP retransmissions counted anywhere, so maybe this isn't
really enough information to analyze the problems though.

Jan
Received on 2007-02-07 16:25:03