Deep Dive Research
24 Jun 2022
Consensus protocols are the most inextricable aspect of blockchain technology and all other large-scale industrial distributed systems since the advent of distributed computing and transactions. However, over the past fifty years, consensus has only witnessed two distinct schools of thought: classical consensus and Nakamoto consensus.
Classical consensus protocols, which rely on all-to-all communication, can have low latency and high throughput but cannot scale and are not robust in the presence of membership changes, resulting in deployments that are typically static and permissioned. In other words, classical consensus protocols have nodes (participants) that are fixed, meaning that the alteration of the population must be approved either by consensus or by a centralized figure, the latter being much easier to execute.
Nakamoto consensus presents a more robust method for achieving consensus among a fluid network of participants by using proof-of-work mining coupled with the longest-chain-rule, which states that the iteration of the blockchain with the longest history or the highest number of preceding blocks or the most “work” put into it is the valid iteration. This allows for the consensus of the state of the network to remain constant in the midst of participant inflows and outflows, but it also results in high confirmation latencies, low throughput, and absurd energy expenditure to maintain its secure state.
The Snowman consensus protocol – often abbreviated as SoA – utilized in the Avalanche network combines favorable properties of both classical and Nakamoto consensus models. Through repeated random sub-sampling, the Snowman protocol, comprised of three single-purpose protocols, achieves low latency and high throughput, without requiring agreement on the precise membership of the system. In other words, the Snowman protocol does not require information regarding how many participants within the system agree or disagree with a single transaction (or change in the state of the network); all it needs is to perform an iterative voting process among randomly selected subsets of network participants to determine the valid network state.
As mentioned just prior, the Snowman protocol operates by repeated, random sampling of the network. When a validator node seeks to determine whether a change in the state of the network (i.e. a transaction) should be accepted or rejected, it polls a small, constant-sized, and randomly chosen set of neighboring validator nodes on its decision regarding that change: whether they accept, reject, or prefer a conflicting state change. If a sufficiently large portion (alpha) of the queried validators wish to accept the change, then the validator prefers to accept it as well. The same goes for when a validator prefers to reject a transaction. This process is repeated until alpha of the validators queried reply the same way for beta consecutive rounds and consensus is achieved. See below.
When transactions present no conflicts, finalization happens incredibly quickly. In the face of conflicts in polls regarding the network, honest validators quickly cluster around these conflicting transactions and indicate their preferred transaction. This creates a positive feedback loop in which the acceptance of one transaction necessitates the rejection of the other, resulting in a much quicker resolution of the conflict than any other blockchain to date.
SoA is actually a family of three protocols that build on top of one another. Each protocol is a single-degree consensus protocol, increasing in robustness consecutively. The first of these is a non-BFT protocol named Slush, which is designed to introduce metastability, or a lack of equilibrium.
The simplest protocol in the Snowman family, Slush is not tolerant to Byzantine faults, only crash-faults (CFT). In other words, Slush is capable of achieving network consensus if components (nodes) fail to indicate a state change, but it is not capable of combatting devious nodes. Inspired by gossip algorithms, Slush is responsible for categorizing all nodes in an identical manner with very high probability. For the sake of simplicity, Slush can be explained by discussing nodes with two possible states: Red or Blue.
Initially, nodes begin “colorless,” meaning that they have no state or definable quality. When a client (or a transactor) that is “red” provides transaction information, that initial node will assume the color of the client is “red” and adopt it. The node then initiates a query, by which the node selects a small, constant (k) sample uniformly from the network to query.
Of the sample population (k), all uncolored nodes will update its color to match that of the query, and these nodes will then respond and initiate their own query. For any colored nodes that receive the query, they will respond simply with their own color (i.e. Blue).
When k responses are collected, a protocol parameter a > 0.5 is employed to ensure that the fraction of nodes falling on the same color is greater than or equal to ak. In other words, after responses are recorded from all of the nodes that were sampled, if the fraction of the nodes that are the same color is greater than 50% of the sample population, then the threshold is met. At this point, the initial node color will flip if it differs from the sampled nodes’ color. Finally, it will repeat the query step and initiate a subsequent round of queries for m rounds to finalize the color.
In the case where k responses are not collected, the node will select an additional sample uniformly at random from the rest of the nodes and repeat its query process.
Slush, by design, presents key properties that allow consensus on the network to occur more efficiently and securely.
1. Memoryless – Nodes retain no state between rounds other than their current “color” and maintain no history of interaction with its peers. This means that consensus occurs without bias at all times.
2. Random Sampling – Rounds involve the sampling of small, constant-sized portions of the network at random, instead of the entire network like traditional consensus protocols. This results in a more efficient method of polling and decreases time-to-finality.
3. Penchant for Metastability – Even in a fully bivalent state, or a 50/50 split between “colors”, Slush is able to achieve consensus, since random perturbations in sampling will cause one “color” to gain an edge, resulting in a cascading effect in which repeated samplings amplify that “advantage”.
4. High Probability of Confidence – If there is a high enough number of rounds m, Slush ensures that all nodes will be “colored” identically with high probability. Each node has a constant predictable communication overhead per round m, as such, the total communication overhead grows logarithmically, rather than exponentially, with the number of nodes polled n.
Snowflake is an augmentation of the Slush protocol that serves to discern Byzantine nodes among the network. Slush, which is only crash-proof tolerant, cannot account for “colored” or “adversarial” nodes that have the potential to switch their purported “color” or state. Additionally, such nodes can nondeterministically affect the value of the number of rounds m, thus influencing Slush in unpredictable ways. Snowflake implements a mechanism that quantifies the conviction of a node and only accepts nodes’ states if they reach a prespecified threshold.
Snowflake implements a single counter to the Slush protocol whose sole purpose is to capture the strength of a node’s conviction or loyalty to its current “color” or state. In other words, if a node that is a certain color, for example, red, its state or color will not be accepted when queried until its internal counter, which stores the number of consecutive samples, ≥ α, of the node that resulted in the same color, exceeds a security threshold, β. In the case that a color change occurs, the node resets its counter to 0.
Snowflake, by design, targets Byzantine actors or nodes in the network and requires an additional measure to elicit an accepted state from them.
1. Byzantine Fault Tolerance – Snowflake essentially quantifies the degree of consistency that the network demonstrates in its state across a number of rounds of querying. This consistency, if and only if maintained, provides nodes with what can best be described as loyalty to the majority, thus reducing the degree of uncertainty – or “Byzantine-ness”
– regarding their colors.
2. Safety & Liveness – When the protocol is correctly parameterized for a given threshold of Byzantine nodes and desired ε-guarantee, safety and liveness are guaranteed (see below).
Snowball provides an additional security measure for Snowflake that has to deal with the fact that Snowflake’s notion of state is ephemeral. As previously mentioned, in Snowflake, whenever a Byzantine node changes its color, the counter of samples that result in the same color is reset to zero. In other words, Snowflake can introduce an inclination of “stickiness” among nodes in the network, but it cannot ensure that sampled nodes that report the same color will do so again in subsequent queries.
Put simply, SoA thus far cannot quantify or measure the likelihood that nodes will stay true to their color in subsequent sampling. Snowball achieves this by implementing confidence counters that record the number of queries that have yielded a threshold, or irreversible, result for their corresponding color.
Snowball requires that every node will increment its confidence counter for a specific color after every successful query. Put another way, whenever a node samples the network and finds that ≥ α nodes respond with the same color frequently enough to exceed the threshold β consecutive rounds (Snowflake), Snowball will increase that node’s confidence counter by 1. After a certain number of rounds, a node will only change its preference based on the total accrued confidence. As such, a node will only switch colors when the confidence in its current color is lower than that of the new color. At that point, if the confidence counter for its current color exceeds that for the other color, the node finalizes. If not, then the node switches.
Confidence – The confidence counter functions effectively as a limited form of history, with which nodes can be “informed” of its likelihood to remain a particular color over the course of multiple samplings. As a result, random perturbations in network samples are significantly reduced.
Avalanche is the final culmination of the SoA protocol, consisting of multiple, single-decree Snowball instances executed simultaneously as a multi-decree protocol, which is recorded in a dynamic, append-only directed acyclic graph (DAG) with all known transactions. In layman’s terms, Avalanche is the final data structure within the SoA protocol that visualizes all of the transactions and all of the votes of all polled nodes.
When a client creates a transaction, it names one or more parents, which are included inseparably in each transaction and form the edges of the DAG. These parent-child relationships, referred to as an ancestor set, do not have to illustrate their correlation on the application level, but they can. In other words, a child or “progeny” transaction does not have to spend or have any relation to the funds received by its parent transaction; the relationship functions purely to account for the level of confidence attributed to transactions or state changes during the Snowball protocol phase.
These ancestor sets bundle together all transactions – both ancestors and progeny – along the edges of the DAG. When Avalanche initiates Snowball instances, Avalanche takes advantage of these ancestor sets to assign all transactions along the DAG edges to a singular transaction T, which represents all its associated transactions when queried. Thereafter, Avalanche refers to a single sink within the DAG known as the genesis vertex, at which a vote for any transaction T will automatically and implicitly generate a vote for all of the ancestors and progeny of T. In short, Avalanche creates families of transactions that are represented on the DAG and can be voted for as single entities.
Whenever enough nodes respond positively to a query regarding T and its ancestry, surpassing a certain threshold, that transaction will collect what is known as a chit, which represents a node’s report of the majority response it received during its query. Nodes then compute their confidence as the total number of chits in the progeny of that transaction T.
In the end, this means that nodes will only have to query a transaction once, and then they can rely on new vertices to build up their confidence, through their progeny.
1. Efficiency – A single vote on a vertex of the DAG implicitly creates votes for all the transactions on the path of to the genesis vertex.
2. Security – Since transactions of the same ancestry are intertwined along the vertices of the DAG, their reversal is difficult to accomplish without the approval of the correct nodes.