Questions whose answer requires sophisticated processing of huge data sets come up increasingly often in our networked and interlinked, and (increasingly) DNA-sequenced world. Attacking such problems with traditional techniques, such as loading data into memory for processing or querying a relational database, is cumbersome and inefficient. Data sizes are growing inexorably, while disk-based data structures and applications relying on them, optimized to handle sequential retrievals and relational joins, often prove inadequate for running complex algorithms. Therefore, sophisticated processing of huge complex data sets requires us to rethink the relationship between disk-based storage and main-memory processing.
Some features of modern systems, namely 64-bit architectures, memory mapped sparse files, virtual memory, and copy on write support, allow us to process our data with readable and efficient RAM-based algorithms, using slow disks and filesystems only for their large capacity and to secure the data's persistence. I demonstrate this approach through a series of C++ programs that run on Wikipedia's data looking for matching words and links between unrelated entries. Through these programs I will show how we can use STL containers, iterators, and algorithms to access disk-based data without performing any system calls. Although RAM-based processing opens up again many problems that database systems already solve, I will argue that such processing is the right move, because it provides us with a unified programming and performance model for all our data operations irrespective of where the data resides.