(von Neumann)

**Related publications**: [NPKS05]

The case study concerns NAND multiplexing, a technique for constructing reliable computation from unreliable devices. As we move from deep submicron technology to nanotechnology for device manufacture, the need for architectures to be defect-tolerant is gaining importance. This is because, at the nanoscale, devices will be prone to errors due to manufacturing defects, ageing, and transient faults. Micro-architects will be required to design their logic around defect tolerance through redundancy.

In 1952, von Neumann studied the problem of constructing reliable computation from unreliable devices (due to unreliable valve based computers at that time), introducing a redundancy technique called NAND multiplexing [vN56]. He showed that, if the failure probabilities of the gates are sufficiently small and failures are statistically independent, then computations may be done with a high probability of correctness. Later, in [DO77], it was shown that a logarithmic redundancy is necessary for some Boolean function computations, and sufficient for all Boolean functions. In [Pip88], Pippenger showed that von Neumann's construction works only when the probability of failure per gate has a limit strictly less than 1/2, and that computation in the presence of noise (which can be seen as the presence of defect), requires more layers of redundancy. In [HJ02, NSF01], von Neumann's NAND multiplexing was compared to other techniques for fault-tolerance (such as R-fold Modular Redundancy and Reconfiguration) and some theoretical calculations were done to show that the redundancy level must be quite high to obtain acceptable levels of reliability.

The basic technique of multiplexing is to replace a processing unit by multiplexed units which
have **N** copies of every input and output of the unit. In each multiplexing unit, there are **N** devices
which in parallel process the copies of the inputs to give **N** outputs which represent the output of the
original processing unit. If the inputs and devices are reliable, then
each element of the output set should be identical and equal to the output of the corresponding processing
unit. However, in the case when there are errors
in the inputs and errors due to faulty devices, the outputs will not all take the same value.
Instead, after defining some critical level Δ ∈ (0,0.5),
the output of the multiplexing unit is considered to be stimulated
(taking logical value **true** or `**1**') if at least (1 - Δ) ⋅ **N**
of the outputs are stimulated and non-stimulated (taking logical value **false** or `**0**') if no more than Δ ⋅ **N** lines are stimulated.
In cases where the number of stimulated outputs does not meet either of these criteria, that is, the number
of stimulated outputs is in the interval (Δ ⋅ **N** , (1 - Δ)⋅ **N**), then the output is undecided,
and hence a malfunction will occur.

The basic design of a multiplexing unit consists of two stages:
the *executive* stage and the *restorative* stage.
The executive stage carries out the basic function of the performance unit to be replaced,
while the restorative stage is used to reduce the degradation in the executive stage caused by errors
in both the inputs and the faulty devices.

We now consider multiplexing in the case when the processing unit is a single NAND gate.
We therefore replace both the inputs and output of the gate with **N** copies
and in the executive stage duplicate the NAND gate **N** times, as shown below.

The unit **U** represents a *random permutation* of the input signals, that is, each signal of the first input is randomly
paired with a signal from the second input to form an input pair for one of the copies of the NAND gate.
Also shown in the figure is the restorative stage which is made using the same technique as the executive stage
duplicating the outputs of the executive stage to use as its inputs.
Note that, applying this approach only once will invert the result, therefore two steps are required.
To give a more effective restoration mechanism the restorative stage can be iterated [vN56].

In [vN56], von Neumann concluded that, for extremely large **N**, the number of stimulated outputs of the executive
stage is a stochastic variable, approximately
normally distributed, and he gave an upper bound of 0.0107 for the probability of gate failure that can be tolerated.
In other words, according to von Neumann, if the failure probability is
greater than this threshold, then the probability of the NAND multiplexing system failing will be
larger than a fixed, positive lower bound, no matter how large a bundle size is used.
Recently, it was shown that, if each NAND gate fails independently, the tolerable threshold probability
of each gate will be 0.08856 [EP98].

We now explain the PRISM model of von Neumann's NAND multiplexing system.
We first note that it was during this phase that we noticed the error made by [HJ02] in
modelling the random permutation made by the unit **U**.
In the analysis technique of [HJ02] the random permutation made by **U** is instead modelled
by a random choice *with* replacement.
More precisely, in the approach of [HJ02],
if in the previous stage there are **k** stimulated outputs, then after passing through the unit **U**
the probability that any one of the resulting inputs of the current stage
is stimulated is **k/N**.

As our analysis will demonstrate, these different approaches to modelling the unit **U** will lead to different conclusions
about the reliability of the system under study.
We should note however, that, as the bundle size increases, the differences between the results obtained
by these different approaches
will converge and, in fact, if the bundle sizes were infinite then these approaches would be equivalent.
This can be seen in our results where the difference between the two approaches when the bundle size 20
is greater than when the bundle size is 40.

The first approach to modelling NAND multiplexing in PRISM was to directly model the system given in the figure above:
for each stage construct a PRISM module for each of the **N** NAND gates in the stage and then combine
these modules through synchronous parallel composition.
However, this approach leads to the well know state space explosion problem. For example,
if I/O bundle size equals 20,
then just modelling the executive stage of the NAND multiplexing unit required more than 1.0e+14 states.

The important observation, which allowed us to overcome this problem, was
that the actual value of each input and output is not important: instead one need only store the total number of
stimulated (and non-stimulated) inputs and outputs. Furthermore, to allow us to compute these values,
without having to store all the outputs of the NAND gates in each stage, we replace the set of **N** NAND gates working in
*parallel* with **N** NAND gates working in *sequence*. This allows us to fold space into time, or in other words reuse the same NAND gate over time
rather than making redundancy over space. We also apply the same methodology to the stages, that is, reuse
the same module for each of the stages while keeping a record of the outputs from the previous stage.

Following these observations, the PRISM description of the NAND multiplexing system, for the case when the bundle size equals 20, is given in below. Note that taking this approach does not influence the performance of the system since each NAND gates works independently and the probability of each NAND gate failing is also independent.

// nand multiplex system // gxn/dxp 20/03/03 // U (correctly) performs a random permutation of the outputs of the previous stage dtmc const int N; // number of inputs in each bundle const int K; // number of restorative stages const int M = 2*K+1; // total number of multiplexing units // parameters taken from the following paper // A system architecture solution for unreliable nanoelectric devices // J. Han & P. Jonker // IEEEE trans. on nanotechnology vol 1(4) 2002 const double perr = 0.02; // probability nand works correctly const double prob1 = 0.9; // probability initial inputs are stimulated // model whole system as a single module by resuing variables // to decrease the state space module multiplex u : [1..M]; // number of stages c : [0..N]; // counter (number of copies of the nand done) s : [0..4]; // local state // 0 - initial state // 1 - set x inputs // 2 - set y inputs // 3 - set outputs // 4 - done z : [0..N]; // number of new outputs equal to 1 zx : [0..N]; // number of old outputs equal to 1 zy : [0..N]; // need second copy for y // initially 9 since initially probability of stimulated state is 0.9 x : [0..1]; // value of first input y : [0..1]; // value of second input [] s=0 & (c<N) -> (s'=1); // do next nand if have not done N yet [] s=0 & (c=N) & (u<M) -> (s'=1) & (zx'=z) & (zy'=z) & (z'=0) & (u'=u+1) & (c'=0); // move on to next u if not finished [] s=0 & (c=N) & (u=M) -> (s'=4) & (zx'=0) & (zy'=0) & (x'=0) & (y'=0); // finished (so reset variables not needed to reduce state space) // choose x permute selection (have zx stimulated inputs) // note only need y to be random [] s=1 & u=1 -> prob1 : (x'=1) & (s'=2) + (1-prob1) : (x'=0) & (s'=2); // initially random [] s=1 & u>1 & zx>0 -> (x'=1) & (s'=2) & (zx'=zx-1); [] s=1 & u>1 & zx=0 -> (x'=0) & (s'=2); // choose x randomly from selection (have zy stimulated inputs) [] s=2 & u=1 -> prob1 : (y'=1) & (s'=3) + (1-prob1) : (y'=0) & (s'=3); // initially random [] s=2 & u>1 & zy<(N-c) & zy>0 -> zy/(N-c) : (y'=1) & (s'=3) & (zy'=zy-1) + 1-(zy/(N-c)) : (y'=0) & (s'=3); [] s=2 & u>1 & zy=(N-c) & c<N -> 1 : (y'=1) & (s'=3) & (zy'=zy-1); [] s=2 & u>1 & zy=0 -> 1 : (y'=0) & (s'=3); // use nand gate [] s=3 & z<N & c<N -> (1-perr) : (z'=z+(1-x*y)) & (s'=0) & (c'=c+1) & (x'=0) & (y'=0) // not faulty + perr : (z'=z+(x*y)) & (s'=0) & (c'=c+1) & (x'=0) & (y'=0); // von neumann fault // [] s=3 & z<N -> (1-perr) : (z'=z+(1-x*y)) & (s'=0) & (c'=c+1) & (x'=0) & (y'=0) // not faulty // + perr : (z'=z+(x*y)) & (s'=0) & (c'=c+1) & (x'=0) & (y'=0); // von neumann fault [] s=4 -> true; endmodule // rewards: final value of gate rewards [] s=0 & (c=N) & (u=M) : z/N; endrewards

In the PRISM model given above we have assumed that the inputs **X** and **Y** are identically distributed (with probability 0.9
of being stimulated), and the NAND gates suffer from a von Neumann fault (inverted output) with probability 0.02.
Note that, in the executive stage, a random permutation of the inputs cannot be performed as we only know
the probability of each individual input being stimulated. However, for a fixed initial input,
one can easily modify the PRISM model so that the system performs a random permutation of the input.
To change the number of restorative stages, bundle size, input probabilities or probability of
the NAND gates failing requires only modification of the constants given at the start of the description.

To model the approach of [HJ02], the only modifications that need to be made are to the probabilities with which the variables x and y are set (when the local state s equals 1 and 2 respectively), that is, replace the random permutation of the inputs with a random choice with replacement.

The tables below shows statistics and construction times of the models when the probability of gate failure equals 0.02. The tables include:

- the number of
**states**in the DTMC representing the model; - both the number of
**nodes**and**leaves**of the MTBDD which represents the model; - the
**construct time**which equals the time taken for the system description to be parsed and converted to the MTBDD representing it, and the time to perform reachability, identify the non-reachable states and filter MTBDD accordingly; - the number of
**iterations**required to find the reachable states (which is performed via a BDD based fixpoint algorithm).

We also give the same statistics when the random permutation performed by the unit **U**
is replaced by a random choice with replacement as used by [HJ02].
The results corresponding to this case
are referenced 'UR', and we use 'U' to reference the results corresponding to the case when
**U** does indeed perform a random permutation.

bundle size (

**N**) equals 20Restorative stages: U (random permutation) UR (random choice with replacement) States: Nodes: Leaves: Construction: States: Nodes: Leaves: Construction: Time (s): Iter.s: Time (s): Iter.s: 1 78,311 20,838 131 1.373 241 69,741 1,730 23 0.21 241 2 154,921 20,847 131 2.53 401 137,781 1,739 23 0.32 401 3 231,531 20,848 131 3.706 561 205,821 1,740 23 0.44 561 4 308,141 20,856 131 4.846 721 273,861 1,748 23 0.56 721 5 384,751 20,856 131 6.093 881 341,901 1,748 23 0.67 881 6 461,361 20,857 131 7.279 1041 409,941 1,749 23 0.81 1041 7 537,971 20,857 131 8.511 1201 477,981 1,749 23 0.90 1201 8 614,581 20,865 131 9.598 1361 546,021 1,757 23 1.07 1361 9 691,191 20,865 131 10.94 1521 614,061 1,757 23 1.19 1521 10 767,801 20,865 131 11.98 1681 682,101 1,757 23 1.27 1681 11 844,411 20,865 131 13.44 1841 750,141 1,757 23 1.45 1841 12 921,021 20,866 131 14.43 2001 818,181 1,758 23 1.53 2001 13 997,631 20,866 131 16.01 2161 886,221 1,758 23 1.65 2161 bundle size (

**N**) equals 40Restorative stages: U (random permutation) UR (random choice with replacement) States: Nodes: Leaves: Construction: States: Nodes: Leaves: Construction: Time (s): Iter.s: Time (s): Iter.s: 1 1,004,821 46,221 898 5.42 481 534,681 3,262 43 0.61 481 2 2,003,041 46,230 898 9.97 801 1,062,761 3,271 43 1.00 801 3 3,001,261 46,231 898 16.4 1,121 1,590,841 3,272 43 1.45 1,121 4 3,999,481 46,239 898 20.5 1,441 2,118,921 3,280 43 2.03 1,441 5 4,997,701 46,239 898 24.2 1,761 2,647,001 3,280 43 2.44 1,761 6 5,995,921 46,240 898 30.7 2,081 3,175,081 3,281 43 2.69 2,081 7 6,994,141 46,240 898 35.1 2,401 3,703,161 3,281 43 3.96 2,401 bundle size (

**N**) equals 60Restorative stages: U (random permutation) UR (random choice with replacement) States: Nodes: Leaves: Construction: States: Nodes: Leaves: Construction: Time (s): Iter.s: Time (s): Iter.s: 1 4,717,531 100,443 2,421 11.7 721 1,778,821 4,328 63 1.91 721 2 9,420,361 100,452 2,421 27.9 1,201 3,542,941 4,337 63 2.15 1,201 3 14,123,191 100,453 2,421 37.4 1,681 5,307,061 4,338 63 2.59 1,681 4 18,826,021 100,461 2,421 52.8 2,161 7,071,181 4,346 63 3.58 2,161 5 23,528,851 100,461 2,421 62.0 2,641 8,835,301 4,346 63 4.62 2,641 6 28,231,681 100,462 2,421 78.8 3,121 10,599,421 4,347 63 4.96 3,121 7 32,934,511 100,462 2,421 99.2 3,601 12,363,541 4,347 63 5.74 3,601

The properties we consider are the probability of there being k stimulated outputs (which in terms of the PRISM model presented in above, corresponds to the probability of reaching the state where z=k, u=3 and c=20), for k=0,...,N where N is the system's I/O bundle size. By performing this analysis we have in fact computed the output distribution of the system, and hence any measure of reliability can be calculated from these results. Note that PRISM can be used directly for computing these measures of reliability, for example, the probability of errors being less than than k% and the expected number of incorrect outputs of the system.

The model checking statistics given below are for calculating the probability of there being 0 stimulated outputs when the probability of gate failure equals 0.01 using PRISm's hybrid engine.

There are two precomputation steps in the model checking procedure involving
BDD based fixpoint algorithms: **Prob1** and **Prob0**
which finds those states such that the probability
of satisfying the formula is 1 and 0 respectively.

In the main algorithm we stop iterating when the *relative
error* between subsequent iteration vectors is less than 1e-6,
that is:

bundle size (

**N**) equals 20Restorative stages: U (random permutation) UR (random choice with replacement) Time: Iterations: Time: Iterations: Prob1: Prob0: Main: Prob1: Prob0: Main: 1 3.29 246 241 2 0.411 246 241 2 2 5.30 406 401 2 0.779 406 401 2 3 7.61 566 561 2 1.091 566 561 2 4 9.71 726 721 2 1.521 726 721 2 5 12.3 886 881 2 1.862 886 881 2 6 14.6 1,046 1,041 2 2.19 1,046 1,041 2 7 16.0 1,206 1,201 2 2.42 1,206 1,201 2 8 19.2 1,366 1,361 2 2.72 1,366 1,361 2 9 21.1 1,526 1,521 2 3.11 1,526 1,521 2 10 23.9 1,686 1,681 2 3.44 1,686 1,681 2 11 27.8 1,846 1,841 2 4.00 1,846 1,841 2 12 28.9 2,006 2,001 2 4.39 2,006 2,001 2 13 32.0 2,166 2,161 2 4.55 2,166 2,161 2 bundle size (

**N**) equals 40Restorative stages: U (random permutation) UR (random choice with replacement) Time: Iterations: Time: Iterations: Prob1: Prob0: Main: Prob1: Prob0: Main: 1 28.4 486 481 2 2.78 486 481 2 2 46.1 806 801 2 4.53 806 801 2 3 64.3 1,126 1,121 2 5.90 1,126 1,121 2 4 76.1 1,446 1,441 2 7.19 1,446 1,441 2 5 88.3 1,766 1,761 2 9.93 1,766 1,761 2 6 118 2,086 2,081 2 12.4 2,086 2,081 2 7 130 2,406 2,401 2 13.6 2,406 2,401 2 bundle size (

**N**) equals 40Restorative stages: U (random permutation) UR (random choice with replacement) Time: Iterations: Time: Iterations: Prob1: Prob0: Main: Prob1: Prob0: Main: 1 48.7 726 721 2 3.88 726 721 2 2 97.1 1,206 1,201 2 10.6 1,206 1,201 2 3 140 1,686 1,681 2 13.6 1,686 1,681 2 4 170 2,166 2,161 2 17.4 2,166 2,161 2 5 254 2,646 2,641 2 20.8 2,646 2,641 2 6 427 3,126 3,121 2 22.7 3,126 3,121 2 7 608 3,606 3,601 2 24.2 3,606 3,601 2

We now study the performance of NAND multiplexing systems both when the I/O bundles are of size 20, 40 and 60.
In all the experiments, we assume that the inputs **X** and **Y** are identical
(this is often true in circuits containing similar devices) and that initially 90% of the inputs are
stimulated (correct). We suppose that the gate failure in each NAND is a von Neumann fault,
that is, when a gate fails, the value of its output is inverted.

Our analysis of the reliability of the NAND multiplexing system using probabilistic model checking concentrates on the effects of the failure probabilities of the NAND gates and of the number of restorative stages. Recall that, to change either of these in the PRISM language description of the system, one needs only change the parameter probz or the parameter M. The results we present show:

- the shape of the output distribution as the probability of gate failure varies;
- an analysis of reliability, in terms of the probability that at most 10% of the outputs are incorrect, as the probability of gates failure varies;
- for different probabilities of gate failure, the resulting change in shape of the output distribution when additional restorative stages are added;
- how, in the case when the probability of gate failure is very small, the reliability can be improved by increasing the number of restorative stages;
- by comparing the probability that at most 10% of the outputs are incorrect and the expected percentage of incorrect outputs for different numbers of restorative stages, the maximum probability of gate failure allowed for the system to function reliably.

In graphs below we present, for systems with a different numbers of restorative stages and both in the case when the unit **U** performs a random
permutation and when it performs a random choice with replacement,
the output distribution when the probability of gate failure equals 0.1, 0.04, 0.02 and 0.0001.

Note that the system is supposed to model a correctly functioning NAND gate and we assume that the inputs are correct when stimulated. Hence, the more outputs that are non-stimulated, the greater the reliability of the system.

**1 restorative stage**(random permutation)**1 restorative stage**(random choice with replacement)**4 restorative stages**(random permutation)**4 restorative stages**(random choice with replacement)**7 restorative stages**(random permutation)**7 restorative stages**(random choice with replacement)

In graphs below we present, for systems with a different numbers of stages and both in the case when the unit **U** performs a random
permutation and when it performs a random choice with replacement,
the output distribution when the probability of gate failure equals 0.1, 0.04, 0.02 and 0.0001.

Note that the system is supposed to model a correctly functioning NAND gate and we assume that the inputs are correct when stimulated. Hence, the more outputs that are non-stimulated, the greater the reliability of the system.

**1 restorative stage**(random permutation)**1 restorative stage**(random choice with replacement)**4 restorative stages**(random permutation)**4 restorative stages**(random choice with replacement)**7 restorative stages**(random permutation)**7 restorative stages**(random choice with replacement)

In graphs below we present, for systems with a different numbers of stages and both in the case when the unit **U** performs a random
permutation and when it performs a random choice with replacement,
the output distribution when the probability of gate failure equals 0.1, 0.04, 0.02 and 0.0001.

Note that the system is supposed to model a correctly functioning NAND gate and we assume that the inputs are correct when stimulated. Hence, the more outputs that are non-stimulated, the greater the reliability of the system.

**1 restorative stage**(random permutation)**1 restorative stage**(random choice with replacement)**4 restorative stages**(random permutation)**4 restorative stages**(random choice with replacement)**7 restorative stages**(random permutation)**7 restorative stages**(random choice with replacement)

The following graph plots the probability that at most 10% of the outputs of the overall system are incorrect (stimulated) against the failure probability of the gates.

bundle size (

**N**) equals 20bundle size (

**N**) equals 40bundle size (

**N**) equals 60

The following graph plots the probability that at most 10% of the outputs of the overall system are incorrect (stimulated) against the number of stages of the system.

bundle size (

**N**) equals 20bundle size (

**N**) equals 40bundle size (

**N**) equals 60

The following graph plots the expected percentage of incorrect (stimulated) inputs against the number of stages of the system.

bundle size

**N**equals 20bundle size

**N**equals 40bundle size

**N**equals 60