www.prismmodelchecker.org

Workstation Cluster

(Haverkort, Hermanns & Katoen)

Contents

Introduction

This case study is based on a cluster of workstations, taken from [HHK00]. The system comprises two sub-clusters with N workstations in each, connected in a star topology. The switches connecting each sub-cluster are joined by a central backbone. All components can break down and there is a single repair unit to service all components.

Workstation cluster

We define two levels of Quality of Service (QoS):

  • minimum QoS: at least 3N/4 workstations are operational and connected via switches and backbone;
  • premium QoS: at least N workstations are operational and connected via switches and backbone.

The model is represented as a CTMC. The PRISM source code is given below.

// Workstation cluster [HHK00]
// dxp/gxn 11/01/00

ctmc

const int N; // Number of workstations in each cluster
const int left_mx = N; // Number of work stations in left cluster
const int right_mx = N; // Number of work stations in right cluster

// Minimum QoS requires 3/4 connected workstations operational
const int k = floor(0.75*N);
formula minimum = (left_n>=k & toleft_n) | 
                  (right_n>=k & toright_n) | 
                  ((left_n+right_n)>=k & toleft_n & line_n & toright_n);

// Failure rates
const double ws_fail = 1/500; // Single workstation: average time to fail = 500 hrs
const double switch_fail = 1/4000; // Switch: average time to fail = 4000 hrs
const double line_fail = 1/5000; // Backbone: average time to fail = 5000 hrs

// Left cluster
module Left 
	
	left_n : [0..left_mx] init left_mx; // Number of workstations operational
	left : bool; // Being repaired?
	
	[startLeft] !left & (left_n<left_mx) -> 1 : (left'=true);
	[repairLeft] left & (left_n<left_mx) -> 1 : (left'=false) & (left_n'=left_n+1);
	[] (left_n>0) -> ws_fail*left_n : (left_n'=left_n-1);
	
endmodule

// Right cluster
module Right = Left[left_n=right_n,
                    left=right,
                    left_mx=right_mx,
                    startLeft=startRight,
                    repairLeft=repairRight ]
endmodule

// Repair unit
module Repairman
	
	r : bool; // Repairing?
	
	[startLeft]    !r -> 10 : (r'=true); // Inspect Left 
	[startRight]   !r -> 10 : (r'=true); // Inspect Right 
	[startToLeft]  !r -> 10 : (r'=true); // Inspect ToLeft
	[startToRight] !r -> 10 : (r'=true); // Inspect ToRight 
	[startLine]    !r -> 10 : (r'=true); // Inspect Line 
	
	[repairLeft]    r -> 2     : (r'=false); // Repair Left 
	[repairRight]   r -> 2     : (r'=false); // Repair Right
	[repairToLeft]  r -> 0.25  : (r'=false); // Repair ToLeft
	[repairToRight] r -> 0.25  : (r'=false); // Repair ToRight
	[repairLine]    r -> 0.125 : (r'=false); // Repair Line
	
endmodule

// Line/backbone
module Line 
	
	line :   bool; // Being repaired?
	line_n : bool init true; // Working?
	
	[startLine] !line & !line_n -> 1 : (line'=true);
	[repairLine] line & !line_n -> 1 : (line'=false) & (line_n'=true);
	[] line_n -> line_fail : (line_n'=false);
	
endmodule

// Left switch
module ToLeft = Line[line=toleft,
                     line_n=toleft_n,
                     line_fail=switch_fail,
                     startLine=startToLeft,
                     repairLine=repairToLeft ]
endmodule

// Right switch
module ToRight = Line[line=toright,
                      line_n=toright_n,
                      line_fail=switch_fail,
                      startLine=startToRight,
                      repairLine=repairToRight ]
endmodule

// Reward structures

// Percentage of operational workstations stations
rewards "percent_op"
	true : 100*(left_n+right_n)/(2*N);
endrewards

// Time that the system is not delivering at least minimum QoS
rewards "time_not_min"
	!minimum : 1; 
endrewards

// Number of repairs
rewards "num_repairs"
	[repairLeft]    true : 1;
	[repairRight]   true : 1;
	[repairToLeft]  true : 1;
	[repairToRight] true : 1;
	[repairLine]    true : 1;
endrewards
View: printable version          Download: cluster.sm

Model Statistics

The table below shows statistics for each model we have built. The first section gives information about the CTMC representing the model: the number of states and the number of non zeros (transitions). The second part refers to the MTBDD which represents this CTMC: the total number of nodes, the number of leaves (terminal nodes), and the amount of memory needed to store it. The last column gives the amount of memory a sparse matrix would need to represent the same CTMC.

N:   Model:   MTBDD:   Sparse:
States: NNZ: Nodes: Leaves: Kb: Kb:
2 2761,120 33196.46 16.4
4 8203,616 5351110.5 51.9
8 2,77212,832 9681518.9 183
16 10,13248,160 1,8992337.1 683
32 38,676186,400 3,9063976.3 2,637
64 151,060733,216 8,22571160 10,362
128 597,0122,908,192 17,481134341 41,076
256 2,373,6521.15e+7 37,272262728 162,581
512 9,465,8764.6e+7 79,3995181,550 649,990

Model Construction Times

The table below gives the times taken to construct the models. This process involves first building a CTMC (represented as an MTBDD) from the system description (in our module based language), and then computing the reachable states using a BDD based fixpoint algorithm. The total time required and the number of fixpoint iterations are given below.

N:   Construction:
Time (s): Iter.s:
20.149
40.1513
80.2021
160.0537
320.1269
641.65133
1286.20261
25623.43517
512100.961,029

Steady-State Probability Computation

The table below gives the times taken to compute the steady-state probabilities using the JOR method. The number of iterations required and the time taken using the MTBDD, sparse and hybrid engines is as follows.

N:   Time per iteration: (s) Iterations:
MTBDD Sparse Hybrid
2 0.0280.000260.0002739
4 0.0520.000310.0006364
8 0.21 0.000850.0022 82
16 0.78 0.0035 0.0072 141
32 2.81 0.014 0.025 260
64 9.38 0.14 0.16 540
128 32.6 0.39 0.49 1,276
256 - - 2.16 3,634
512 - - 8.99 30,476

The steady-state probabilities allows use to compute properties such as:

  • S=? [ "premium" ]: the long-run probability that premium QoS will be delivered.
  • S=? [ !"minimum" ]: the long-run probability that QoS is below minimum.

Model Checking

We have model checked the following properties:

  • "The system will always be able to offer premium QoS at some point in the future." The corresponding CSL formula is given by:

    P>=1 [ true U "premium" ]

    Model checking times:

    N:   Prob1 precomputation:
    Time (s): Iterations:
    2 0.06318
    4 0.07526
    8 0.07342
    16 0.19274
    32 0.360138
    64 1.27 266
    128 3.30 522
    256 12.1 1034
    512 44.3 2058

    Conclusion: the property holds in all states
  • "The chance that the QoS drops below minimum quality within T time units." The corresponding CSL formula is given by

    P=? [ true U<=T !"minimum" ]

    In the graph below we have plotted these probabilities from a range of values of T and N.

    plot: the chance that the QoS drops below minimum quality within T time units
  • "When facing insufficient QoS, the (maximum) probability of facing the same problem after T time units." The corresponding CSL formula is given by

    P=? [ true U[T,T] !"minimum" {!"minimum"}{max} ]

    In the graph below we have plotted these probabilities from a range of values of T and N.

    plot: when facing insufficient QoS, the (maximum) probability of facing the same problem after T time units
  • "The probability of going from minimum QoS to premium QoS within T time units." The corresponding CSL formula is given by

    P=? [ true U<=T "premium" {"minimum"}{min} ]

    In the graph below we have plotted these probabilities from a range of values of T and N.

    plot: the probability of going from minimum QoS to premium QoS within T time units
  • "The probability of going from minimum QoS to premium QoS within T time units without violating the minimum QoS constraint along the way." The corresponding CSL formula is given by

    P=? [ "minimum" U<=T "premium" {"minimum"}{min}]

    In the graph below we have plotted these probabilities from a range of values of T and N.

    plot: the probability of going from minimum QoS to premium QoS within T time units without violating the minimum QoS constraint along the way
  • "The probability that it takes more than T time units to recover from insufficient QoS." The corresponding CSL formula is given by

    P=? [ !"minimum" U>=T "minimum" {!"minimum"}{max} ]

    In the graph below we have plotted these probabilities from a range of values of T and N.

    plot: the probability that it takes more than T time units to recover from insufficient QoS
  • "The minimum percentage of operational workstations at time T when starting from below minimum QoS." The corresponding CSL formula is given by

    R{"percent_op"}=?[ I=T {!"minimum"}{min} ]

    In the graph below we have plotted these results from a range of values of T and N.

    plot: the minimum percentage of operational workstations at time T when starting from below minimum QoS
  • "The expected time that the system is below minimum QoS until time T." The corresponding CSL formula is given by

    R{"time_not_min"}=?[ C<=T ]

    In the graph below we have plotted these results from a range of values of T and N.

    plot: the expected time that the system is below minimum QoS until time T (small time scale)
    plot: the expected time that the system is below minimum QoS until time T (large time scale)
  • "The expected number of repairs by time T." The corresponding CSL formula is given by

    R{"num_repairs"}=?[ C<=T ]

    In the graph below we have plotted these results from a range of values of T and N.

    plot: the expected number of repairs by time T

Case Studies