Coda File System

RE: venus crash

From: Florin Muresan <Florin.Muresan_at_atc.ro>
Date: Thu, 15 Mar 2007 17:14:08 +0200
And the logs ....

> -----Original Message-----
> From: Florin Muresan [mailto:Florin.Muresan_at_atc.ro]
> Sent: Thursday, March 15, 2007 5:03 PM
> To: Jan Harkes
> Cc: codalist_at_coda.cs.cmu.edu
> Subject: RE: venus crash
> 
> Hello!
> 
> Thank you Jan for your answer.
> 
> > -----Original Message-----
> > From: Jan Harkes [mailto:jaharkes_at_cs.cmu.edu]
> > Sent: Tuesday, March 13, 2007 4:57 PM
> > To: Florin Muresan
> > Subject: Re: venus crash
> >
> > On Tue, Mar 13, 2007 at 12:45:32PM +0200, Florin Muresan wrote:
> > > Hello everybody!
> >
> > I guess this was intended to be sent to codalist,
> 
> Yes it was but I missed the reply all button. Sorry for that.
> 
> >
> > > I have a similar problem in my coda realm. Venus crashes when I
try
> to
> > > copy(overwrite)/delete many files.
> > > I use coda in a production environment for a web hosting solution.
> > > During the testing period this situation never happened. My guess
is
> > > that the problem occurs because of the high number of accessed
files
> per
> > > second that triggers false conflicts.
> >
> > High rate of accesses isn't really something that would trigger
false
> > conflicts. Also, false conflicts should not cause a client to crash.
> 
> I must say that you are right and I had made a mistake saying that
Coda
> client crashes. In fact it just hangs. I had read some of the emails
> from the codalist posted last year and I think I understand better how
> Coda works.
> 
> >
> > There are some problems that I have observed with our own web/ftp
> > servers, which get pretty much everything out of /coda.
> >
> > - The Coda client is a userspace process, so all file open and close
> > requests are forwarded to this single process where they are queued.
> > There is a set of about 20 worker threads that pick requests off the
> > queue and handle them. This is in a way similar to how a web-server
> > has one incoming socket where requests are queued, which are then
> > accepted by a web-server instance (thread or process).
> >
> > In-kernel filesystems have it a bit easier, when the webserver makes
a
> > system call the kernel simply uses the application context, so there
> > is no queueing and it can handle as many requests in parallel as
there
> > are server instances.
> >
> > Now we only really see open and close requests all individual read
and
> > write calls are handled by the kernel, so if the client has most or
> all
> > files cached the worker threads don't have to do much and are pretty
> > efficient. However if a some (or a lot) of files are updated on the
> > server, most the locally cached data is invalidated and almost every
> > request ends up having to fetch data from the Coda servers. So each
> > request takes longer, and may have some locks on a volume because it
> is
> > updating the local cache state. So in the worst case only process a
> > single request at a time and the queue becomes very long, blocking
all
> > web-server processes.
> 
> I think this is exactly what happened when I tried to delete about 900
> files at one time and the result was that apache webserver got
blocked,
> slowing down the whole system.
> 
> My Coda setup is formed from one Coda server and three Coda clients
(for
> the moment). It is very important to have the same web documents on
all
> three clients because I implemented an load-balancing solution and all
> clients must serve to visitors the same content. Coda solves this very
> elegantly.
> 
> Back to the point. Trying to solve the problem I terminated the apache
> processes and then restarted venus. After restart the whole /coda
volume
> was in disconnected state and I feared that I lost all the data. I had
> to move quickly because all of my websites were down and the only
quick
> solution that I could think of was to purge and reinstall the
> coda-client package on the system where I deleted the files. I thought
> this way I will avoid any conflicts that could appear.
> The curios thing is that the other two Coda clients were hanging after
> this problem occurred.
> 
> For the sake of readability I atached the relevant logs.
> 
> After I reinstalled the coda-client and restarted the Coda server the
> /coda become accessible on all the clients but I had to waith for a
> while to reintegrate.
> 
> I belive now that it wasn't necesary to reinstall the coda-client
> because all the conflicts would been resolved automaticaly and that an
> restart of the Coda server would have been enough.
> 
> >
> > - Another thing that can happen is that when one client is updating
a
> > file, another client sees the update before the final commit. At
this
> > point the versions are still skewed. A write first updates each
> replica
> > and then uses the success/failure status to synchronize the
versions.
> > So if we see a changed object before the versions are synchronized
the
> > reading client believes there is a conflict and triggers
server-server
> > resolution. As a result the servers lock down the volume, exchange
> their
> > copies of the file or directory, compare the differences and decide
> > which one would be considered the most up-to-date version.
> >
> > We detect the missed version synchronization because the contents
are
> > identical, this is a 'weak-equality' type resolution and so the
> servers
> > reset the versions to be correct again. Then when the writing client
> > finalizes the operation, the versions end up getting bumped for a
> second
> > time, skewing them again, requiring the reading client to refresh
it's
> > cache and triggering another resolution. There is not a correctness
> > issue here, but the additional 2 resolution phases definitely slow
> down
> > everything because they add an additional 10 roundtrips and take an
> > exclusive lock on the volume, preventing all readers and writers
even
> > for unrelated objects within the volume.
> >
> > Neither of these would introduce crashes or conflicts though, mostly
a
> > temporary performance degradation where all web servers are blocked
> > until the system catches up again with all the queued requests.
> 
> Its clear for me now why I get this performance degradation when
trying
> to copy many files but this is not at all desireable in an production
> environment.
> 
> >
> > > Do you think that if I install the version 6.9.0 the problem with
> false
> > > conflicts will be avoided?
> >
> > Not sure if false conflicts are your problem. A crash is a bug, even
> > when there are conflicts we shouldn't crash. With 6.9 we basically
end
> > up using an existing code-path that was normally only used during
> > disconnections or when connectivity was poor. That code has been
> around
> > for a long time, but really hadn't been tested all that well because
> it
> > was the fallback path. So I do actually expect it to be somewhat
less
> > reliable. However as it is the same code that is used by older
clients
> > when things went wrong it isn't really a step back. Any bugs that
> could
> > happen in unusual situations have become bugs that will happen.
> >
> > > What any other suggestions for this situation?
> >
> > What does your setup look like? Are there replicated volumes or is
> > everything backed by a single Coda server (i.e. a single server
would
> > never have resolution issues). How many Coda clients are there. Are
> > updates being written by one client or by several client. Which
client
> > crashes, the one that is writing or another that is only reading.
> 
> I had described the setup above but I must make some remarks.
Typicaly,
> all the Coda clients will write from time to time. One of them is
mainly
> used to write files, but only on specific times and was not doing any
> writing at the moment of the incident. In this case, the client that
> hanged was one of the clients that rarely writes. At that time I used
it
> to delete the files.
> 
> >
> > What web server are you using? How many threads/processes does it
use?
> > How many requests per second are we talking about?
> 
> I run on every client Apache 2.0.54 with mpm_prefork_module and is
> currently setup for 200 MaxClients. The prefork module uses only one
> thread/process. The average of requests vary from 20 to 50 requests
per
> client per second.
> 
> >
> > What is logged in /var/log/coda/venus.err when the client crashes?
> >
> > Jan
> 
> 
> Thank you for your time.
> Florin




Received on 2007-03-15 11:15:32