This post is a summary of the paper:
Epidemic algorithms for replicated database maintenance, Demers, et al., 1988
Introduction
Published in the late 80’s, this paper lays out early ideas on gossip based replication algorithms and focuses on eventual consistency, in contrast to the traditional ACID model. The idea of eventual consistency or BASE is well-known these days, however, in the late 80’s, it was one of the most important novel ideas.
The problem that the authors, while working at Xerox, were trying to solve was to maintain mutual consistency among several different replicas of a database located at different “sites”. A site may receive updates independently of other sites and these updates need to be propagated throughout the system to maintain mutual consistency.
The distinguishing features of the algorithms presented in the paper are as follows:
- The algorithms are simple and require very few guarantees from the underlying communication system.
- The algorithms use randomization and their cost and performance are tuned by choosing appropriate distributions in the randomization step.
- The algorithms aim for eventual consistency and not atomicity.
Epidemic Algorithms
Epidemic algorithms, also known as gossip protocols, are procedures that are based on how epidemics spread. However, in contrast to spreading disease in a biological population, the goal is to disseminate information reliably in transient groups of machines, sites, users or processes.
The authors present three different algorithms, two of which are examples of epidemic processes.
- Direct Mail: Each update is immediately sent from entry site to all other sites. This is not an epidemic process.
- Anti-entropy: Every site regularly chooses another site at random, exchanges database contents with it and resolves the differences. Anti-entropy is highly reliable, however, it is expensive and updates propagate slowly.
- Rumor mongering: A newly received update becomes “hot rumor” and the site holding a hot rumor periodically chooses a site at random to share the rumor. A site stops spreading the rumor if it has tried to share the rumor with too many sites that have already seen the rumor. We say that the site has stopped treating the rumor as “hot”. Rumor mongering is faster and less resource intensive than anti-entropy but there is a non-zero probability that an update will not reach all sites.
Of the above, anti-entropy and rumor mongering are epidemic processes. Further adopting the terminology of the epidemiology literature, a site is classified as “infective” if it holds an update and is willing to share it, “susceptible” if it hasn’t yet received the update or “removed” if it has received the update but is no longer willing to share it. Anti-entropy is an example of a simple epidemic where the sites are always either infective or susceptible.
Direct Mail
The Direct Mail strategy attempts to send an update to all other sites soon after it occurs. Messages are queued so the sender isn’t delayed and the queues are kept on stable storage at the mail server so they are unaffected by server crashes. Even though it is quite reliable, it can fail because queues can overflow, the destinations might be inaccessible for long durations or the sender might not have an accurate knowledge of all the sites. Such failure events in small systems can be handled by humans administering the network; however, a reliable distributed failure handling mechanism is needed in large systems.
Anti-entropy
Anti-entropy has been proposed and used as a mechanism to recover automatically from failures in Direct Mail. According to theory of epidemics, simple epidemics infect the entire population. Hence, in theory, anti-entropy handles any Direct Mail failure by eventually distributing a failed update throughout the network. However, a naive anti-entropy algorithm is very expensive since it involves sending a complete copy of the database over the network and running a comparison between two copies of the database.
Optimization
A possible performance improvement is for each site to maintain a checksum of its database contents, recomputing checksums incrementally on each update and comparing the entire database only if the checksums disagree between two partner sites performing anti-entropy. This method, although it saves a lot of wasted effort, starts becoming less useful as the time required for an update to be sent to all sites increases (which is a function of network size).
Optimization
To make the above approach scalable, one approach is to define a time window t > expected time to send an update to all sites. Sites also maintain a recent update list that contains all entries whose ages are less than t along with checksums. Two sites, when performing anti-entropy, first exchange their update lists, then update their databases and checksums and finally compare checksums. If the checksums disagree, complete database contents are compared and differences are resolved.
Note that the choice of
tto exceed the expected distribution time for an update is critical; iftis chosen poorly, or if growth of the network drives the expected update distribution time abovet, checksum comparisons will usually fail and network traffic will rise to a level slightly higher than what would be produced by anti-entropy without checksums.
Optimization
To avoid the overhead of choosing the right value of time window t, each site can maintain an inverted index of its database by timestamp. Two sites perform anti-entropy by exchanging updates in reverse timestamp order, incrementally recomputing their checksums, until the checksums agree. This scheme has the disadvantage of additional expense of maintaining an inverted index at each site.
Rumor mongering
The rumor mongering algorithm starts with N susceptible sites. One of the sites becomes infective and starts spreading the update (rumor) at random to other sites. Each site that receives the update becomes infective and starts spreading the rumor. When a site tries to share the rumor with a site that has already seen the rumor, the sender site becomes removed with probability 1/k. The authors show that this setup can be modeled mathematically and prove that increasing the value of k results in a rapid decrease in residue (number of susceptible sites when there are no more infective sites).
Several variations exist. The infective site can become removed with a probability of 1/k if the recipient has already seen the update (feedback variation) or it can become removed regardless of the recipient (blind variation). Similarly, instead of becoming removed with a probability of 1/k (coin variation), the site can become removed only after k unnecessary contacts (counter variation).
Such a “complex” epidemic algorithm can spread updates rapidly with very low network traffic. Unfortunately, unlike a simple epidemic like anti-entropy, a complex epidemic can fail to propagate the update to all sites. Even though the probability of such an event can be made extremely small, the event leaves the system in a stable yet inconsistent state. To eliminate this possibility, anti-entropy can be run infrequently to back up a complex epidemic.
Handling Deletions
Deletion of an item in a distributed database is tricky because it cannot be handled simply by removing the item locally where the update is received. Instead of spreading the absence of the item throughout the system, the propagation mechanism will spread old copies of the item from other sites to the site where the item is deleted, effectively “resurrecting” the deleted item. This situation is remedied by replacing deleted items with death certificates which carry timestamps and spread like ordinary data. However, the system still needs a mechanism to delete death certificates eventually, otherwise they will consume all available storage.
One strategy is to delete the certificate when it can be determined that every site has received it. Several protocols have been described in earlier papers for this purpose; however, the authors claim that in a large distributed system, there is a high probability that these protocols cannot always be completed because of sites that may be down or unreachable.
A simpler strategy is to hold the death certificates for a fixed time and then discard them. The tradeoff here is between the amount of space devoted to death certificates and the risk of obsolete items being resurrected. The authors present ideas to extend this time threshold much further than the space on any server would permit by using a scheme called dormant death certificates. Very old death certificates are deleted at most sites, retaining dormant copies at only a few sites. When an obsolete update encounters a dormant death certificate, the death certificate can be “awakened” and propagated again to all sites.
If anti-entropy is used for distributing updates, dormant death certificates should not normally be propagated during anti-entropy exchanges.
Spatial Distribution
The cost of sending an update to a nearby site is lower than the cost of sending the same update to a distant site. Therefore, a completely random choice while selecting partner sites to exchange information may not be optimal. The authors show that letting each site independently choose connections according to a non-uniform spatial distribution can significantly reduce traffic generated by anti-entropy. An example is a distribution that is a function of the cumulative number of sites at distance d or less. Using such spatial distributions along with anti-entropy can reduce traffic on links that would otherwise be “hotspots” if a uniform distribution is used. Similarly, non-uniform spatial distributions can produce improvements in the performance of rumor mongering, particularly the traffic on critical links.
