wiki:CodaDirectories
close Warning:

Version 1 (modified by jaharkes, 10 years ago) (diff)

--

Breaking the Coda directory size limits

I have been thinking about this problem for a long time now and am leaning towards the following approach.

I am pretty sure we really want to store the actual contents on disk in container files. That way fetching a directory uses the same code paths as fetching a file and we can even push it as far as teaching the kernel to return readdir results directly from this container file. So the read path all the way from the stable storage on the server to the kernel on the client would have no translations between formats and can handle whatever sizes we already support for files. Initially we probably would still keep that last venus.cache -> kernel container translation step.

Now in addition to this we may want more efficient lookup operations than just walking the directory. The current directory format includes a hash based lookup tree in the format. Just storing 128 pointers to directory pages would only require 512 bytes, the remaining 1.5KB of the top-level directory page is used up by free block bitmaps and various hash table structures (haven't looked that closely recently, my memory is a bit hazy on this part, but there definitely is additional overhead to administer the hash-table). Such hash tables could be implemented as a cache, where we only keep in-memory hash tables for directories in which we recently did a lookup. Possibly more like a path lookup cache where we just store hit and miss information for previously searched paths.

The final part is to get the appropriate persistence and recoverability for directory updates. Now I am pretty sure that RVM can actually handle multiple data files and is able to map those in different places in the memory space. I don't know how many files we can map at a given time, but when we make changes to a directory we can mmap it as a new RVM segment and use that to update the directory. Then unmap it after the log has been flushed and truncated. Again this could be done as an LRU style cache and if we have to many modified directories mapped we can force a flush/truncate. I wonder if an rvm unmap implicitly forces a truncate, in which case we can just rvm_map and rvm_unmap directory containers.

Moving the bulk of the directory data out of RVM removes 50% of the current RVM usage, so it definitely improves server scalability at the same time.