Database Data Structures: Log-Structured Hash Table

My reading notes from the book: Martin Kleppmann. Designing Data-Intensive Applications. O’Reilly, 2017.

In a previous post, we discussed what data-intensive applications are and what reliability, scalability and maintainability means in context of such applications.

This post talks about data from the database’s point of view: how it can store the data that it is given and how it can find the data again when asked for it. Specifically, we look at a particular type of database and an index that powers it.

Log-structured Database

Many databases internally use a log which is an append-only sequence of records. These are log-structured databases. Adding data to a log is generally very efficient because it is requires appending data to the end of the sequence. However, searching for data has terrible performance, if the search depends on scanning the entire database file from beginning to end (O(n)).

In order to efficiently find data, a different data structure is used: an index. An index is an additional data structure that is derived from the primary data. Many databases allow you to add and remove indexes.

However, maintaining additional structures incurs overhead, especially on writes. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

Well-chosen indexes speed up read queries, but every index slows down writes. Hence, databases don’t usually index everything by default, but require the application developer or database administrator to choose indexes manually, using the knowledge of the application’s typical query patterns.

Hash Indexes

Hash indexes are used to index key-value data.

Key-value stores are similar to dictionary types found in most programming languages, typically implemented as hash maps. In-memory hash maps can also be used to index key-value data. For example, if the data storage consists only of appending to a file (log), the strategy is to keep an in-memory hash map where every key is mapped to a byte offset in the data file.

Such a strategy is used in Bitcask [1], the default storage engine in Riak [2]. Bitcask offers high-performance reads and writes subject to the requirement that all the keys fit in the available RAM, since the hash map is kept in memory. Values can use more space than available memory, since they can be loaded from disk with just one disk seek. Such a storage engine is well suited to situations where the value for each key is updated frequently.

To avoid eventually running out of disk space in an append-only log, we break the log into segments of certain size by opening a new segment file when the previous file reaches a certain size. Compaction is performed on these segments to throw away duplicate entries followed by merging of several segments (which are now smaller).

Segments are never modified after they are written, so the compacted and merged segment is written to a new file.

Merging and compaction can be done in a background thread, switching read/write requests to the new segment after it is ready. Old segments are deleted.

Each segment has its own in-memory hash table, mapping keys to file offsets. In order to find the value of a key, lookup is performed on all segments, most recent segment first followed by older segments. We rely on the merging process to keep the number of segments, and hence the number of lookups, small.

File Format
It is faster and simpler to use a binary format.

Deleting records
Deletion appends a special tombstone record corresponding to the key. When log segments are merged, the merging process discards values for any tombstoned key.

Crash recovery
In-memory hash maps are lost if the database is restarted. Hash maps can be recreated by reading all the records, but this is slow. Bitcask speeds up recovery by storing a snapshot of each segment’s hash map on disk, which can be loaded into memory on restarts.

Partially written records
Checksums can be used to detect and discard corrupt records.

Concurrency
Since writes are strictly sequential appends, a common implementation is to have a single writer thread. Multiple threads can read concurrently, since data files are immutable.

Advantages of append-only log

Appending and segment merging are sequential write operations, which are usually faster than random writes.
Simpler concurrency and crash recovery, because we don’t have to worry about overwriting values.
Merging old segments makes sure data files are not fragmented.

Limitations of append-only log

Hash table must fit in memory. So it is not suited to situations where you have a very large number of keys.
Range queries (scanning over all keys between X and Y) are not efficient.

In the next post, we look at yet another data structure that is used in storage engines and does not suffer from the above limitations.

Database Data Structures: Log-Structured Hash Table

On this page

Log-structured Database

Hash Indexes

Advantages of append-only log

Limitations of append-only log

References