(Aspnes & Herlihy)
Consensus problems arise in many distributed applications, for example, when it is necessary to agree whether to commit or abort a transaction in a distributed database. A distributed consensus protocol is an algorithm for ensuring that a collection of distributed processes, which start with some initial value (1 or 2) supplied by their environment, eventually terminate agreeing on the same value. Typical requirements for such a protocol are:
A number of solutions to the consensus problem exist. If the processes can exhibit stopping failures then the Termination requirement is too strong and must be replaced by:
Unfortunately, the fundamental impossibility result of [FLP85] demonstrates that there is no deterministic algorithm for achieving wait-free agreement in the asynchronous distributed model with communication via shared read/write variables even in the presence of at most one stopping failure. One solution is to use randomisation, which necessitates a weaker termination guarantee:
The algorithm we consider in this paper is due to Aspnes & Herlihy [AH90]. It is a complex algorithm which uses a sophisticated shared coin-flipping protocol. In addition to Validity and Agreement, it guarantees Probabilistic wait-free termination within polynomial expected time for the asynchronous distributed model with communication via shared read/write variables in the presence of stopping failures.
The randomised consensus protocol of Aspnes and Herlihy [AH90] consists of N asynchronous processes that communicate via read/write shared registers. The processes proceed through possibly unboundedly many rounds; at each round, they read the status of all other processes and attempt to agree. The agreement attempt involves a distributed random walk (a Markov decision process, i.e. a combination of nondeterministic and probabilistic choices): when the processes disagree, they use a shared coin-flipping protocol to decide their next preferred value. Achieving polynomial expected time depends in an essential way on ensuring that the probability that all non-failed processes draw the same value is above an appropriate bound.
One possible approach to analyse the Aspnes and Herlihy algorithm is to model and verify it using a probabilistic model checker such as PRISM. However, there are a number of problems with this approach:
Therefore, we adopt a different approach, introduced by Pogosyants, Segala and Lynch [PSL00]: we separate the algorithm into two communicating components, one non-probabilistic (asynchronous parallel composition of processes which periodically request the outcome of a coin protocol) and the other probabilistic (a coin-flipping protocol shared by the processes). For the non-probabilistic part we use the proof assistant Cadence SMV, which enables us to verify the non-probabilistic requirements for all N and for all rounds by applying the reasoning introduced in [MQS00]. The shared coin-flipping protocol is modelled and verified using the probabilistic model checker PRISM. For a finite number of processes (up to 10) we are able to mechanically calculate the exact (minimum) probability of the processes drawing the same value, as opposed to a lower bound established analytically in [AH90] using random walk theory. The correctness of the full protocol (for the finite configurations mentioned above) follows from the separately proved properties.
Recall that in order to verify the above protocol correct we need to establish three properties: Validity, Agreement and Probabilistic wait-free termination. The first two are independent of the actual values of probabilities. Therefore, we can verify these properties by conventional model checking methods, having replaced the probabilistic choices with nondeterministic ones.
We are left to consider Probabilistic wait-free termination. This property depends on the probabilistic properties of the coin-flipping protocol. However, there are several probabilistic progress properties which do not depend on any probabilistic assumptions. Similarly to the approach of [PSL00] we analyse such properties in the non-probabilistic variant of the model, except we use Cadence SMV, thus confining the probabilistic arguments to a limited section of the analysis.
Non-probabilistic Verification: the verification of Validity and Agreement and the non-probabilistic arguments for proving Probabilistic wait-free termination using Cadence SMV can be found here.
Probabilistic Verification: the verification of the probabilistic arguments for proving Probabilistic wait-free termination using PRISM can be found here.