My reading notes from the book: Martin Kleppmann. Designing Data-Intensive Applications. O’Reilly, 2017.
In the previous post, we introduced log structured databases and talked about hash indexes as a kind of index in such databases.
In this post, we discuss Log-Structured Merge-Trees, or LSM-Trees, that have several advantages over hash index based databases.
Log-Structured Merge-Tree was described as an indexing structure by Patrick O’Neil et al [1] in 1996.
An LSM-Tree is at the heart of storage engines that are based on the principle of merging and compacting sorted files. Such storage engines are typically called LSM storage engines.
LSM-Trees maintain data in two or more separate data structures, each of which is optimized for handling specific operations and for its respective underlying storage medium.
SSTable
LSM-Trees use SSTables as segment files.
The term SSTable was introduced in Google’s Bigtable paper [2]. SSTable stands for Sorted Strings Table.
An SSTable is a file containing key-value string pairs sorted by keys. This is in contrast to hash indexes where the key-value pairs have no specific ordering. SSTables also require that each key appears only once within each merged segment file (the compaction process already ensures that).
The sorted nature of SSTables means that writing new key-value pairs is not as simple as appending new entries in the segment and thus, seems to break our ability to use sequential writes. Enter, memtable.
Memtable
Maintaining a sorted structure on disk is possible (B-Trees) but maintaining it in memory is much easier.
To provide high write-throughput and to take advantage of sequential writes, LSM-Trees maintain a second data-structure called the memtable that is a tree data structure, such as a red-black tree or AVL tree. With this data-structure, you can insert keys in any order and read them back in sorted order.
The term memtable was also introduced in Google’s Bigtable paper.
Handling writes
Writes are handled by memtables. When the memtable gets bigger than some threshold, it is written out to disk as an SSTable file. The new SSTable becomes the most recent segment of the database. While the SSTable is being written, writes can continue to a new memtable instance.
The one problem with this scheme is that, in the event of a database crash, most recent writes which are in memtable but not yet written out to disk are lost.
To ensure durability, LSM-Trees keep a write-ahead log to restore the memtable.
Write-ahead Log (WAL)
In order to recover a lost memtable that was not yet written out to an SSTable, LSM-Trees can keep a separate log on disk to which every write is immediately appended, just like in case of hash indexes.
The log is not in sorted order, but it doesn’t matter since its only purpose is to restore the memtable after a crash. Every time a memtable is written out to an SSTable, the corresponding log can be discarded.
Handling reads
In order to serve a read request, we first try to find the key in the memtable, then in the most recent SSTable, then in the next older SSTable and so on.
Merging and Compaction
Old segments are merged and compacted in the background, as in case of hash indexes, to combine segment files and to discard overwritten or deleted values.
Advantages of SSTables
Why go through all this hassle to maintain sorted keys in segments? SSTables have several big advantages over log segments with hash indexes:
- Merging segments is simple and efficient, even if the files are bigger than available memory. This is basically the merge step of the mergesort algorithm. When multiple segments contain the same key, we can keep the value from the most recent segment and discard the values in older segments.
- In order to find a particular key in file, we no longer need to keep an index of all the keys in memory. We can instead keep a sparse index to tell us the offsets for some of the keys. We can exploit the sorted nature of keys to find keys that are not in the index, by scanning the SSTables from a lower bound key to an upper bound key. For example, if we are looking for the key cat which is not contained in the index, but we do know the offsets for cab and caterpillar, we know that cat must appear between these two. So we can jump to the offset for cab and scan from there until cat is found (or not, if it is not present in the file).
- Since read requests need to scan over several key-value pairs in the requested range anyway, we can group those records into a block and compress it before writing to disk, thus saving disk space and I/O bandwidth use.
Performance Optimizations
LSM-Trees can be slow when looking up keys that do not exist in the database. We have to check the memtable, then the segments all the way back to the oldest before we can be sure that the key doesn’t exist.
In order to optimize this kind of access, storage engines often use Bloom filters. A Bloom filter is a memory-efficient data structure for approximating the contents of a set and can tell if a key does not appear in the database.
There are also different strategies to determine the order and timing of how SSTables are compacted and merged. The most common options are size-tiered and leveled compaction.
In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables.
In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space.
Summary
The basic idea of LSM-Trees of keeping a cascade of SSTables that are merged and compacted in the background is simple and effective. It scales well with the size of dataset. Since data is stored in sorted order, range queries are efficient and because writes are sequential, LSM-Trees can support high write-throughput.
Several well-known storage engines use similar principles as LSM-Trees. Notable mentions are LevelDB, RocksDB, Cassandra, HBase and Lucene.
References
- [1] Patrick O’Neil., Edward Cheng, Dieter Gawlick, Elizabeth O’Neil. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 351–385, 1996.
- [2] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al. Bigtable: A Distributed Storage System for Structured Data, 7th USENIX Symposium on Operating System Design and Implementation (OSDI), 2006.
