Coda File System

From: Eric McCoy <emccoy_at_haystacks.org> Date: Sat, 27 Nov 2004 09:58:07 -0500

Hello all.  I'm looking for some advice on whether Coda would be 
appropriate for my situation.  I have read the FAQ and docs, but it 
seems that most people are using Coda for a different application and 
all the docs have that in mind.  It may be nobody uses Coda the way I'm 
thinking because it's a stupid idea, so that's why I'm asking!

At present we have a small server farm which processes HTML logs for a 
large number of websites (in the tens of thousands).  The way this works 
is we have seven processing servers which do nothing but parse logs and 
generate reports.  The reports, once created, are saved forever to a 
storage server (via NFS) which is directly attached to a big (~1.9TB) array.

There are two problems here.  First, if the storage server goes offline 
for whatever reason (like if NFS decides to flake out for a few 
seconds), all the processing servers hang and have to be power cycled. 
This is Very Bad.  And second, the processing servers all need to get 
the actual logs and store them on a local filesystem (because NFS is way 
too slow).  These logs can get very very big so if a bunch of requests 
arrive all at once - like when the monthly reports are automatically 
generated - they tend to run out of disk space and die.

I am thinking that Coda might be able to provide a solution to these 
problems - a better one than the one we are using now, anyway, which is 
"throw more hideously expensive servers at the problem and hope it goes 
away."

My rough-sketch thought process is that, naturally, the storage server 
would still provide all the disk space.  Each Coda client (the 
processing servers) would make a cache about the size of the partition 
it's using now for its temporary files (anywhere from 50-140GB).

The first problem would be solved, or at least mitigated, because the 
coda clients could still do their thing if the storage server crapped 
out for a while.  Most of the time the processing servers run at 40% 
disk capacity so that should leave enough space for at least an hour or 
two of disconnected operation.  If the array fails or something, we're 
pretty screwed anyway, but at least we could process requests for the 
time it takes to reinitialize the thing.

The second problem would be, again, solved or at least mitigated because 
the coda clients would have "emergency backup" storage.  The temporary 
files would be written to /coda and go in the cache.  Since those files 
only have lifetimes of a few minutes and are never requested by other 
clients in the farm, they should never need to go over the network 
unless the cache fills up.  If the cache does fill up, which it will at 
least once a month, performance will degrade (significantly) but at 
least the requests will still be processed.  Once the backlog gets 
handled (takes about a day) the caches will clear out and everything 
will go back to normal.  If the storage array fills up to the point 
where we're running out of disk space again, we can add another (much 
smaller) one just for this temporary storage.  This way we could avoid 
dropping huge sums on a quad-CPU box, which is what we're doing now, 
just to add storage capacity.

Now after that lengthy story and rationale, my first question is 
obvious: Is my reasoning correct?  Is this something Coda can do, even 
though most people aren't using it quite like I want to?

Second question should also be pretty obvious and it's a FAQ: Is Coda 
reliable enough to be used for this?  I know the FAQ says "no" but our 
current solution is already terribly unreliable (we lose a ton of 
reports every month due to the disk space problems alone; I have no hard 
figures but I'd estimate up to an eighth of the monthly requests are 
lost, forcing our customers to request their reports manually a day or 
week later).  As long as Coda can more-or-less guarantee that the 
archived reports on the storage array won't get trashed... If the major 
reliability concern is having to restart coda processes occasionally, we 
can do that.

Third question is related to the first two: If Coda is not, for whatever 
reason, appropriate for this, is there something similar which is?  I've 
looked at Lustre, and probably will continue looking at it, but it seems 
geared towards much larger clusters than ours.  It may also be much less 
of a "drop-in" replacement than Coda, which is by now a standard part of 
the free Unix-like OSes.

Thanks in advance for any tips or pointers, and to anyone who actually 
read this whole thing, I admire your stamina!

Coda File System

Appropriateness of Coda