Coda File System

Re: crash in rvmlib_free (not necessarily) during repair

From: Piotr Isajew <pki_at_ex.com.pl>
Date: Tue, 9 Jul 2013 22:13:24 +0200
On Tue, Jul 09, 2013 at 07:15:42PM +0000, u-codalist-rcma_at_aetey.se wrote:

> Hello Piotr,
> 
> On Tue, Jul 09, 2013 at 06:52:43PM +0200, Piotr Isajew wrote:
> > As for now it seems that it's possible to crash both servers in
> > repetitive way, even when copying small amount of files to single
> > directory (both servers crashed on the same rvm assertion when
> > copying 20 pdf files, 60M total).
> 
> Hmm. This looks much worse than my experience here. Our servers and
> clients are not compiled from the official upstream git but our patches
> touch mostly the authentication part, not the file service.
> 
> We use though more server threads compared to upstream. This might
> create a certain difference.
> 
> > The most stable behaviour can be achieved by turning off non-SCM,
> > performing copy operation, waiting for venus to reintegrate
> > everything to SCM and than bringing up non-SCM and propagating
> > changes to it. This, however, for larger sets of data gives "No
> > space left in volume log." error on the SCM, and it crashes on
> 
> How large log do you have? I was using 16384 for several years but since
> I once also got "No space left in volume log" I am now now using 65536.
> (Apparently the data amounts have grown as the time goes :)

I run the default setting for now. I can increase it of course,
but I like your idea to check wired connection first.

> 
> Of course, a server being down at a massive update operation will
> eventually lead to log overflow anyway - but your test operation is
> not huge at all.
> 
> You wrote
> "Both servers communicate over WiFi, so it's possible that they will
> lose connectivity for a while."
> 
> This may contribute to triggering odd bugs - Coda service was
> developed with an assumption of servers having reliable contact
> with each other. It is the clients who are supposed/allowed to have
> intermittent connections. A server going down is supported but not
> as a regular situation (and this is not the same as servers being
> partitioned from each other).
>
> I am not aware of any "fundamental" reasons preventing the servers from
> working properly if they lose contact with each other. Nevertheless:
> regretfully or naturally Coda does not [try to] cover every possible
> situation and servers with unreliable network are an unsupported
> configuration. The relevant code paths are in the best case probably
> not fully tested and in the worst case non-existent (among others -
> leading to asserts)

True. I don't want to go against the current. Having replicated
volumes would be great, but the main reason of my interest in coda
is client-side cache, so if anything other will fail, I'll just
go with one-server system, and use a machine with RAID for it.


> 
> I have successfully run Coda with geographically spread servers - but
> the network between them was reliable and such a setup is not something
> inherently supported by Coda design.

I suspect, that this kind of setup is extremely sensitive to
disconnections during writes, but can easily handle scenario,
where there are disconnections during reads, so depending on
usage pattern it may work flawlessly or lead to problems I have
now.


> 
> Jan will hopefully correct me if I am wrong on the above.
> 
> Would it be feasible for you Piotr to make a test with the servers on
> a wired connection?

I can do this. It's just that putting both servers in the same
Ethernet eliminates some redundancy in the setup. 
Received on 2013-07-09 16:13:42