Blog - Gianni Gambetti

A Journal Of History

An Adventure in Performance Tuning

Once upon a time I decided to start keeping a journal (aka diary). It started simply enough: a plain text flat file which could be edited with almost any software. At some point this became a nuisance, I had to do too much work by hand, and adding entries was annoying.

A Brief History

The journal has gone through approximately 5 fairly different revisions, each with their own advantages and disadvantages. It started life as a simple flat text file, eventually became a SQLite database, turned back into a simple text store, and finally into a wiki. Each format migration was catalyzed by a different performance problem I encountered while using it.

At one point, it was using SQLite as the data-store, and doing nothing at all fancy to communicate between the journal and SQLite. It turns out that by using the ‘sqlite’ command line interface I was seriously penalizing the performance of a few of the operations that I was frequently using. Rather than learn the SQLite C api and replace the bash script I had been using, I decided to write a customized data-store that would be perfectly tailored to my purpose. And so, chronos was born.

The API

The API that I would require for my journal proved to be very straight forward. There were only a few operations that were needed, but they had to be fast. The operations that would be supported are find <key>, iterate <print-format>, and append. The biggest bottleneck that I was running into while using the bash script was iterating over the entire journal to run data aggregation across all of the entries.

The Design

Chronos would be the simplest possible time stamped data-store that I could design. This included some fairly simple design constraints:

Chronos would have to work through the command line, accepting data from standard input, and writing to standard output.
Chronos would only store a time-stamp and date. It would not be a fancy Key-Value Store.
Chronos would be append only, and not allow for editing of any of it’s data.

After some thought on the implementation of chronos I decided that it would have to live in a directory, and have two separate files: one for the entry index, and another containing the data. This makes it much easier to traverse the entries, over say a linked list file format.

Building It And Profiling

The majority of the work with chronos was fairly straight forward: consisting largely of file io, with a small amount of data structure modification. Profiling the code was fairly easy, Valgrind has a wonderful tool called Callgrind that spits out run summaries to be viewed in KCachegrind.

Synchronization

Since file io takes a non-zero amount of time, added to the list of problems is cross process synchronization (also required for historical reasons). Introduce the linux flock(2) system call, it allows applications to have mutual exclusion locks on file system entries. The easiest way to do locking (initially) is to have exclusive locks for write operations (append), and use shared locks on all read operations (get, and iterate).

The optimization of flock(2) synchronization usage was to drop all shared locks, allowing read operations to access both the index and data store without synchronization. This calls into question the ability of chronos to ensure a correct read against any of it’s data. If we reason through the way in which chronos appends data, there isn’t much work to be done to ensure that it fails in a safe manner.

The strategy for data integrity is two fold: 1) appending data still uses an exclusive lock (1 write at a time), and 2) when appending data the index is the second file written to. Since the size of writes between the data store and index are expected to be very different (data store > > > index) writing to the index second should ensure that any failure only adds garbage data to the data store, but never produces an incorrect index entry.

File io

Since chronos is always doing io (as either: accepting input from standard input, or writing output to standard output), it requires a lot of io buffering. Most of the io buffering is very simple, and occurs in a read(2) -> write(2) while loop. In addition to operate safely as a pseudo-database, chronos uses the fsync(2) to ensure that data is propagated to physical storage before exiting.

There is no nice optimization for fsync(2), instead chronos builds a variant that disables fsync(2) usage in case it is unlikely to be an issue for the user (this is generally the case). This turns out to be one of the largest optimizations that chronos can perform, changing the amount of time taken by almost an order of magnitude (after all other optimizations are performed).

The second file io optimization that is performed is the usage of splice(2): this system call moves data between two file descriptors without it ever entering user space. This allows us to pass (almost) any amount of data through chronos without incurring context switch penalties because of buffering the data through our application.

Overhead

One of the things noticed while profiling chronos with Callgrind was the unreasonable amount of time that was spent initializing the standard C library. More than half of the program execution time was spent in routines that would have no effect on the programs execution. This is a difficult optimization problem, since rewriting the standard library was not something I was willing to do.

There are two optimizations that can be applied, the first is using a different implementation of the standard library (see performance and size comparisons), and the second is to amortize this cost over multiple append operations. Changing the standard library is fairly straight forward, usually modifying the last one or two build steps of an application (assuming you use someone elses rewrite).

Attempting to amortize this time (and conveniently the process forking overhead) we required that chronos be able to handle multiple append operations. To do this, chronos has a second mode of operation (invoked using the chronosd binary) which opens a fifo(7) for communication. A quick note about fifos: fifos pretend to be files, but instead connect a pair of open operations (read/write) which are likely from different applications. Since a fifo is a file descriptor that is compatible with splice(2), there is very little overhead in the io loop using the fifo (open, append, close). chronosd operation only requires that an application write their data in full to the fifo, and then close it, after the close chronos time-stamps the data, and finishes the append.

Conclusions

The best advice I could give to anyone attempting to optimize every last cycle out of their application run-time would be to start with optimization in mind, and attempt to align data structures and usage patterns to this goal. While the narrative presented here appears to be a reasonable progression of optimization attempts, in reality they were not as nicely separated, and often punctuated by experiments of fairly large rewrites.

Previously:

RSS