Coda File System

Re: crash in rvmlib_free (not necessarily) during repair

From: <u-codalist-rcma_at_aetey.se>
Date: Tue, 9 Jul 2013 19:15:42 +0000
Hello Piotr,

On Tue, Jul 09, 2013 at 06:52:43PM +0200, Piotr Isajew wrote:
> As for now it seems that it's possible to crash both servers in
> repetitive way, even when copying small amount of files to single
> directory (both servers crashed on the same rvm assertion when
> copying 20 pdf files, 60M total).

Hmm. This looks much worse than my experience here. Our servers and
clients are not compiled from the official upstream git but our patches
touch mostly the authentication part, not the file service.

We use though more server threads compared to upstream. This might
create a certain difference.

> The most stable behaviour can be achieved by turning off non-SCM,
> performing copy operation, waiting for venus to reintegrate
> everything to SCM and than bringing up non-SCM and propagating
> changes to it. This, however, for larger sets of data gives "No
> space left in volume log." error on the SCM, and it crashes on

How large log do you have? I was using 16384 for several years but since
I once also got "No space left in volume log" I am now now using 65536.
(Apparently the data amounts have grown as the time goes :)

Of course, a server being down at a massive update operation will
eventually lead to log overflow anyway - but your test operation is
not huge at all.

> another assertion. Turnig on non-SCM in such situation leads to
> repeatable suicide at it's start, and the whole situation starts
> to look like a dog trying to catch his own tail.

You wrote
"Both servers communicate over WiFi, so it's possible that they will
lose connectivity for a while."

This may contribute to triggering odd bugs - Coda service was
developed with an assumption of servers having reliable contact
with each other. It is the clients who are supposed/allowed to have
intermittent connections. A server going down is supported but not
as a regular situation (and this is not the same as servers being
partitioned from each other).

I am not aware of any "fundamental" reasons preventing the servers from
working properly if they lose contact with each other. Nevertheless:
regretfully or naturally Coda does not [try to] cover every possible
situation and servers with unreliable network are an unsupported
configuration. The relevant code paths are in the best case probably
not fully tested and in the worst case non-existent (among others -
leading to asserts).

I have successfully run Coda with geographically spread servers - but
the network between them was reliable and such a setup is not something
inherently supported by Coda design.

Jan will hopefully correct me if I am wrong on the above.

Would it be feasible for you Piotr to make a test with the servers on
a wired connection?

Regards,
Rune
Received on 2013-07-09 15:25:11