Quantitative risk management
Quantitative risk illuminates the levers that link decisions to value.
This article illustrates the construction of a simple stochastic model with a straightforward example that captures the key categories of risk.
My daily cycle is circumscribed by uncertainty. What sort of form am I in on the day? Which way is the wind blowing and how strong? Will I catch the lights? Will I bump into someone I know? Will the bike break down?
The parallels to project management are all too apparent. Project teams and their governance can be in better or worse form. Projects encounter headwinds and stop lights and they break down.
My goal is to get to work on time and the project wants to deliver on schedule. The analysis is analogous.
Goal oriented & actionable
The first step in the construction of any meaningful model is to identify the objectives we are trying to achieve and the decision we can make in order to achieve them.
Safety first, my primary objective is to make it to work in one piece. I would also like to make it to work on time, but this being a stochastic model, I can do better than that: I would like to be late to work on average less than once a week. I will, cravenly, forego a discussion of the trade off between the probability of an accident and time and cost-savings, which is a subject for an article by itself, and resolve to cycle carefully.
My decision space is dominated by choosing my departure time. In a longer article, we might also look at bike maintenance and service and even the capital investment of buying a new bike.
Predictive and stochastic
Paraphrasing ISO31000, “Risk management is the coordination of activities to direct and control an organization with regard to the effect of uncertainty on objectives.” Our model aims to explicate exactly the relationships between decisions, objectives and the unruly turmoil of uncertainty that lies in between.
These relations are and must be causal; it is not enough to describe historical contingencies. We want to understand how decisions propagate through causal chains and interact with uncertainties to affect outcomes. And the model is born stochastic; we don’t begin with a deterministic model and look at potential deviations. If risk is the effect of uncertainty on objectives, then we’d better place uncertainty at the centre of our analysis of risk.
The causal map
A causal map serves three ends. First, it is the basis of the quantitative model on which the decisions will be based. Secondly, and far more importantly, it provides common ground for analysts and decision makers to meet. Causality captures the experience and insight of experts because our intuition about causes is remarkably well developed, in contrast to our intuition about probability which is equally remarkably underdeveloped. Causal maps bring all stakeholders as close to the model as they need to approach, without getting them snarled in a mathematical tangle. Finally, the causal map shows the relationship between the fundamental elements of the model and the data we will use to condition it.
The causal map starts with decisions and objectives. These are the poles of the axis around which the map is built. The simplest meaningful map, with a decision, an objective and the uncertainty we’re trying to understand is shown above. Indeed, in many respects, this map may be good enough. We understand the basic relationships (arrival time = departure time + travel time) and with enough data on travel times, we have a simple, powerful decision model.
Virtuous and vicious simplicity
Virtuous simplicity contains its complexity. The figure above is a virtuous simplification because the simplified parameter — the travel time — can be unpacked and explicated, either for the sake of better modelling or for the sake of introducing more controls. We will do this now. Vicious simplification ignores complexity. It would stop here.
Hopping quickly over the intermediary steps in the mapping process, the full causal map for the cycling model is shown here. The decision to include notes on modelling (which distributions we use) and data is a question of taste and audience, though it’s very normal to note at least the data sources we use.
The anatomy of a causal map
Primary nodes, those that do not have an arrow coming in to them, are either primary uncertainties — in which case they are described in the model by a probability distribution — or decisions. In a causal sense, the other nodes are those parameters that are affected by the nodes whose arrows point in towards them. Mathematically, the arrows denote functional dependence: an intermediary node is simply a function of the nodes that point to it. It is a matter of taste whether all the primary uncertainties are included on the map. One might, for example, choose not to map traffic lights, but just to absorb them into the travel time calculation.
This structure, apart from formalizing stakeholders’ intuition with respect to influences and effects, makes the final translation of the causal model to a mathematical model simple and systematic. It also ensures that the stakeholders retain a high level of ownership over the final mathematical model, allowing analysts to background the mathematical detail and focus attention on the key drivers.
I’ve put the mathematical details of the cycling example in an appendix. Briefly:
- Travel time is comprised of an undelayed travel time, calculated from an average speed, and delays
- Delays are comprised of traffic lights and extraordinary delays, notably failures (puncture, breakdowns) and chance encounters
- Speed is determined by wind — speed and direction — and form, effectively a constant related to power.
- Dimensional analysis gives the speed as a function of the wind speed in the direction of travel and the power constant.
- The power constant is fitted to data in the cases where 1) I’m fresh, 2) I haven’t slept well (lots of data — I have three small kids) and 3) I’m hungover (very few data for the same reason).
The analogy with project schedule risk is direct. Undelayed travel time is the time you would expect to compete a project (or phase), speed is analogous to efficiency of the project team, which may be a function of resources (wind), form (experience) and so on. Discrete project risks schedule risks are modelled in the same way as my delays.
Speed / efficiency is conditioned by experience, adjusting for delays, delays are conditioned directly from experience. We don’t need a huge amount of data to get started.
Once we have built our model we can combine the uncertainties defined from the data sources we have identified using the relations we have established. The easiest, but not the only way to do this is with a Monte Carlo simulation. All these simulations were carried out in the free version of VoseSoftware’s ModelRisk, which allows you to carry out Monte Carlo simulations in an Excel spreadsheet.
A Monte Carlo simulation is essentially a large number of experiments. In each experiment, each of the primary variables is chosen at random, but in such a way that the statistics of each primary variable across all the experiments honours the prescribed probability distribution. The variables are combined according to the model and the outcome — in this case the travel time — is calculated. After we have carried out the large number of experiments (10,000 in this case), we can then look at the statistics of the outcome.
Looking at the results here, we see that 80% of cases come in with a travel time under 26 minutes. This matters to me, because I don’t want to be late more than — on average — once a week. If I leave 26 minutes before time I need to be at work, my probability of being late is 20%. That doesn’t guarantee I’ll be late exactly once a week, but it fulfils my criterion in the long run.
Wind speed and wind direction are simulated in the above results set, but a quick look at my phone before I leave and I have the wind speed and wind direction for the day. It is then a simple matter to run the model with those uncertainties removed. The range narrows, but because I have a headwind today, the distribution shifts right and I now need to leave 30 minutes before work in order to maintain my punctuality statistics.
Perhaps recalculating is too much work. I can introduce heuristics that adjust travel time for wind speed according to some simple algorithm and design policies that adjust for information made available through time. In this case, the range narrows so I’m less likely to turn up frustratingly early or embarrassingly late.
By looking at the travel time in the lowest and highest, say 10% of instances of each of the primary variables, I can see the relative effect of each primary uncertainty on the total travel time. Wind is clearly dominant here. I can’t avoid meeting people and will always stop if I do, but this analysis tells me that breakdowns and punctures also make a significant contribution,
However, this kind of sensitivity analysis tells me what good and bad look like, but not how often things are good or bad. For that, we use a different kind of sensitivity analysis, but we can always simply model an intervention, for example introducing maintenance and see the effect…
On the face of it, introducing maintenance has very little effect, but in fact it halves my probability of being more than 15 minutes late.
Even the simplest quantitative models provide enormous power and versatility for addressing the effects of uncertainty on objectives and then assessing directly controls and policies put in place to control them.
The manifold miseries of matrices
The cycling example also furnishes some good examples of some of the problems of addressing risk management using risk matrices or heat maps. The heat map here is based on a real heat map used by a large multinational company (no longer with us). The “probability” scale is unchanged, but the severity has been rescaled to this particular problem.
My first significant challenge (not including the simple semantic issues of identifying what constitutes a risk versus a trigger, a cause or a consequence — a problem neatly circumvented by a causal map) is ambiguity with respect to where to place the risks on the matrix.
Wind can delay me by up to 20 minutes. A five minute delay — which puts me in the third severity column — occurs about a third of the time, so comfortably in the 25% — 75% range. But a nine minute delay (same column) is relatively rare, certainly less than 10%, but not less than 5%. A small delay occurs about half the time. So depending how I parse the question, I can place the Wind risk in three different boxes. Different interpretations give different ambiguities. For example, starting with the probability there’s a headwind (about half) and then setting the severity (anything from zero to 20 minutes) gives an entry in every column of the third row.
Form is actually quite clear. I fail to sleep well about 20% of the time and its impact is almost invariably between 1 and 5 minutes. But the problem returns with a vengeance with encounters. The probability of a chance encounter is 5%, the severity anything from 0 to around 30 minutes, with a steadily decreasing probability the longer the duration. Not, however, decreasing fast enough that the probability drops under 1% before I get to the worst severity column. Similar issues plague the puncture.
The crudity of the calculus of colour carries over into our assessment of controls and mitigations. I’ve taken the gross risk as that implied by leaving an average cycling time before work starts. With the first interpretation above, that puts us in one of the three boxes shown in the figure. Leaving early is the obvious mitigation or policy to address this, but leaving even substantially earlier will only reduce my probability score by one row. So yellow stays yellow, even though leaving early is clearly a magnificent mitigation for this risk.
With an appropriate use of drugs, I may be able to ensure a good night’s sleep and I can resolve not to go out on the town on a school night. Doing so reduces my chance of a loss of form under 1%, so yellow becomes green. But we can see form has the smallest effect on travel time. Similarly, we know maintenance doesn’t make that much difference away from the extreme events, but the mitigation manages to reduce it from an ambiguous yellow / green to an unambiguous green.
In summary the heat map has communicated that the most effective mitigation has no effect and the two least effective mitigations have materially affected our risk profile.
- Models must be goal-oriented and actionable
- They take their point of departure in decisions and objectives
- Models should be predictive and stochastic
- They are couched in the language of intervention and explicitly embrace uncertainty
- Causal mapping is intuitive and accessible
- It exploits causal intuition and provides accessible basis for rigorous mathematical modelling
- Causal mapping is (virtuously) simple, scalable and fit-for-purpose
- Models can be rolled up or folded out as required for decisions (and modelling detail)
- Causal mapping provides a direct link to data and are verifiable
- Models are built around available data
- Models are checked and refined against prediction
Appendix: Details of the cycling model
Meeting someone is a Bernoulli event (I don’t stop twice) and the “probability” (really a frequency, but they’re nearly the same when they’re small) is just set to the historical rate at which I have met people.
Breakdowns are Poisson events (I can break down more than once). The rate is set to the historical rate at which I have broken down.
Severity of both encounters and breakdowns are modelled by a Weibull distribution. They are parameterized by mean delay in both cases, standard deviation in the case of meetings and a fudgy guess for the variance of breakdowns, because I only have two breakdowns in my dataset (a puncture and ripping my chainwheel off).
Traffic lights are modelled by Bernoulli distributions for whether they are red and uniform distributions for waiting time if they are. Probability is the proportion of time they’re red and the width of the uniform distribution is the time for which they are red. There are three lights and they are assumed independent. I’m pretty sure they are.
Total travel time is an undelayed travel time plus the delays sampled with the distributions above.
Undelayed travel time is calculated from an average bike speed, which is calculated from a power model P = k (V-U)² V, which makes a one parameter (P/k) model in terms of bike speed V and wind speed U.
In the forward problem, I calculate a zero-wind cycling speed from the power parameter (which is to be fitted), the resolved wind speed is calculated from the wind speed (Weibull distribution, data direct from wind atlas) and the wind direction (2 mixed von Mises distributions with means on prevailing wind directions and variance and mixing parameters eye-balled into place). I then use the power formula to give me a corrected speed, which is then used to calculate the undelayed travel time.
The power parameters are fitted to data for the cases where I am hungover, tired or neither (I don’t have data for hungover and well slept and assume it doesn’t happen). In the fitting, the lights are not controlled for.