Data-Intensive Applications: Reliability, Scalability and Maintainability

My reading notes from the book: Martin Kleppmann. Designing Data-Intensive Applications. O’Reilly, 2017.

Reliable, Scalable and Maintainable Applications

A data-intensive application is one for which raw CPU power is rarely a limiting factor — it has bigger concerns over the amount of data, complexity of data and the speed at which the data is changing.

Building blocks for data intensive applications are usually as follows:

Store data (databases)
Remember result of an expensive operation (caches)
Allow users to search data (search indexes)
Send a message to another process, to be handled asynchronously (stream processing)
Periodically crunch a large amount of accumulated data (batch processing)

Knowing what a data intensive application is, we can now focus on teh important concerns in most software systems:

Reliability

The system should continue to work correctly (performing the correct function at the desired level of performance) in the face of adversity (hardware or software faults, and even human error).

Scalability

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Maintainability

Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Reliability

Things that can go wrong are called “faults”. A system that anticipates faults and can cope with them is called fault-tolerant or resilient.

Fault v/s Failure
Fault is defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user.

It is impossible to reduce the probability of faults to zero, hence, we should design fault-tolerance mechanisms that prevent faults from causing failures.

In fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately, so that the fault-tolerance machinery is continually exercised and tested. Example: Netflix Chaos Monkey.

Hardware Faults

When there are a lot of machines (like in a large datacenter), hardware faults occur all the time.

Add redundancy to the individual hardware components in order to reduce the failure rate of a machine. For example, RAID configuration for disks, dual power supplies for servers, hot-swappable CPUs. Such redundancy makes a single machine more reliable.

There is a move towards systems that can tolerate loss of entire machines, by using software fault tolerance techniques in preference or in addition to hardware redundancy. Such systems have operational advantage too, since such a system can be patched one node at a time without downtime of the system (rolling upgrades).

Software Errors

These are systematic errors. Harder to anticipate, they tend to be correlated across nodes and hence, cause many more system failures than uncorrelated hardware faults.

Examples: software bug that crashes processes on a particular input, memory leaks, a slow or failed service that the system depends on, etc.

No quick solution. Careful design; thorough testing; process isolation; measuring, monitoring, triggering alarms on system behaviors and anomalies.

Human Errors

Humans are known to be unreliable and will make mistakes. To make systems reliable, even when they are developed, maintained and operated by unreliable humans, we can combine several approaches:

Minimize opportunities for error. Example: well-designed APIs
Provide non-production sandboxes
Test thoroughly. Make good use of unit, integration and manual tests.
Allow quick and easy recovery to minimize impact. Fast rollbacks, gradual rollouts.
Good telemetry
Good management practices

Scalability

System’s ability to cope with increased load.

Describing Load

Load can be described with a few numbers called the load parameters. Example: requests per second, ratio of reads to writes in a database, hit rate in a cache, etc.

In some systems, average case might be more important while in others the bottleneck may be dominated by a small number of extreme cases.

Twitter as an example system.
Two operations:

Post tweet
Home timeline

Two ways of implementing:

Posting a tweet inserts the new tweet into a global collection of tweets. To get a user’s home timeline, look up all people they follow and get all their tweets and merge them.
Maintain a cache of each user’s timeline. When a user posts a tweet, insert the new tweet into the timeline caches of each follower.

Approach 1 struggled to keep up with home timeline queries and Twitter switched to Approach 2. Better because average rate of home timeline queries is 2 orders of magnitude more than average rate of published tweets. But approach 2 involves a lot of extra work when posting a tweet. Doing this for some users (like celebrities) who have millions of users in a timely fashion is challenging.

A hybrid approach will be implement approach 1 for celebrities and approach 2 for most users. Most users’ tweets are fanned out to their followers, but tweets of celebrities are not fanned out and are fetched separately and merged.

Key load parameter: distribution of followers per user (weighted by often they tweet), since it determines fan-out.

Describing Performance

When you increase a load parameter and keep the system resources (CPU, memory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Latency v/s Response time
Response time is what the client sees. It includes networking delays, queueing delays and the actual service time.
Latency is the duration that a request is waiting to be handled - during which it is latent, awaiting service.

Mean response time is commonly reported. However, it may not be a good metric, since it doesn’t tell you how many users actually experienced that delay. Therefore, percentiles are used.

Look at the higher percentiles to figure out how bad the outliers are: 95th, 99th, 99,9th percentiles are common (abbreviated as p95, p99, p99.9). They are the response time thresholds at which 95%, 99% or 99.9% of requests are faster than that particular threshold.

High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service.

Percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs).

Head-of-line blocking: Small number of slow requests hold up processing of subsequent requests. Subsequent requests are down the line (queue) waiting.

The client sees an overall slow response time due to the queueing delay - hence, it is important to measure response times on the client side.

When generating load artificially in order to test the scalability of a system, the load-generating client needs to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than they would be in reality, which skews the measurements.

Tail latency amplification: If an end-user call requires several calls to backend, the end-user request has a higher chance of being slow compared to the chance of getting a slow backend call. Even if calls are made in parallel, the end-user request needs to wait for the slowest of the parallel call and hence, it just takes one slow backend call to make the entire end-user request slow.

Approaches for Coping with Load

Scaling up
Vertical scaling or moving to a more powerful machine.

Scaling out
Horizontal scaling or distributing load across multiple smaller machines. Also called shared-nothing architecture.

Good architectures usually involve pragmatic mixture of these approaches.

Stateless services are easier to distribute across multiple machines.

The architecture of systems that operate at large-scale is usually highly specific to the application. No generic, one-size-fits-all scalable architecture exists. Different applications have different problems to solve - volume of reads/writes, volume of data, response time requirements, complexity of data, access patterns, etc.

In an early-stage startup or an unproven product it’s usually more important to be able to iterate quickly on product features than it is to scale to some hypothetical future load

Maintainability

Three design principles to minimize pain during maintenance and to avoid creating legacy software:

Operability
Make it easy for operations team to keep the system running smoothly.

Simplicity
Make it easy for new engineers to understand the system, by removing as much complexity as possible.

Evolvability
Make it easy for engineers to make changes to the system in the future. Also known as extensibility, modifiability or plasticity.

Operability

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities.

Providing visibility into the runtime behavior and internals of the system, with good monitoring
Providing good support for automation and integration with standard tools
Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted)
Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
Providing good default behavior, but also giving administrators the freedom to override defaults when needed
Self-healing where appropriate, but also giving administrators manual control over the system state when needed
Exhibiting predictable behavior, minimizing surprises

Simplicity

Making a system simpler does not necessarily mean reducing its functionality.

It can also mean removing accidental complexity. This kind of complexity arises only from the implementation and is not inherent to the problem that the software solves (as seen by the users).

Abstraction
One of the best tools to remove accidental complexity.

A good abstraction can hide implementation details behind a clean, simple-to-understand facade. Promotes reuse of software because the abstraction can be used by many different applications.

Finding good abstractions is very hard. It is not clear how to package distributed systems algorithms into abstractions that help keep the system complexity at a manageable level.

Evolvability

In terms of organization processes, Agile working patterns provide a framework for adapting to change.

Techniques developed by the Agile community like test-driven development (TDD) and refactoring are helpful when developing software in a changing environment. Here refactoring does not only refer to source code refactoring, but also to refactoring an architecture from one design to another.

The ease with which a system can evolve and adapt to changing requirements is closely linked to its simplicity and its abstractions.