Probability audit
How do you check that your assessments of probability are any good?
If you tell me that there is a 30% chance of succeeding and you succeed, were you wrong? After all, you said you were less likely to succeed than to fail.
The problem is that both outcomes are quite plausible. There is simply too much uncertainty in the outcome for it to say anything clearly about the faithfulness of the assessment of probability.
Aggregation to reduce uncertainty
Now suppose you’re assessing the probability of 20 different events, with 20 different probabilities, ranging from 10% to 40% with an average of around 30%. (You may be drilling a sequence of exploration wells, for example).
In terms of the number of successes, there are now 21 different outcomes and at least some of them are pretty unlikely. The probability distribution is shown here.
If you only had two successes, for example, you might feel your assessments had been a little rosy. The probability of so few is well under 10%; it could be bad luck, but you would have to have been wretchedly unlucky.
By looking at the sum of successes, we were able to reduce the uncertainty around the predicted outcome and open up for the possibility of observing an outcome sufficiently unlikely that we could say something a little stronger about the quality of the assessments that made up the prediction.
Note however, that we do this at the cost of saying anything meaningful about any one prediction. We can only say something about systematic inconsistencies across the whole sequence.
Baselining and polarization
But what if the number of successes lies comfortably in the range of outcomes you predict? Should you be asking for a raise?
Not necessarily.
If that were the case — at least as far as an audit is concerned — we may as well assess every probability at the average success rate. This is baselining, which is fine — indeed good practice — if the events whose probability we are assessing were identical, but as far as being able to assess the relative likelihood of different events, it’s throwing in the towel. In that respect, we would really like probability to indicate our strength of belief in the success of an outcome.
We can easily imagine a situation where we are seduced by our apparent understanding of the case at hand, so that when we believe it’s likely, we really think it’s very very likely, but when it’s bad, it’s never going to happen. This is polarization: Low probabilities are pushed further down and higher probabilities are pushed up. Our predictions are poorer, but on average these two effects cancel each other out, so you don’t necessarily land far from the middle of the distribution.
The traditional method for catching baselining and polarization is to assess ranges of probability separately in a probability interval plot.
The probability interval plot
The idea here is to take all the probabilities between, say, 0 and 20% and to predict the average success rate. Then we take the probabilities between 20% and 40%, then 40% to 60% and so on.
Such a plot is shown here; the expected average success rate is shown in the red curve, the delivered success rate in the blue columns. As in the example above, there is a substantial range of outcomes that might be perceived as reasonable statistical deviation. It is critically important this range is indicated on the plot, otherwise you can’t know whether your deviations are due to assessment or probabilistic whimsy. The range is shown here with a 10th and 90th percentile of predicted outcome.
If the probabilities have been subject to a baselining bias — an unwillingngess to budge from a baseline — then the probabilities of poor prospects will tend to be too high, leading to under-delivery, and the probabilities of good prospects will be too low, leading to over-delivery. With, say, 100 probability assessments to analyze, this can be seen fairly clearly, as in the example here. Note, all of the ranges are in the 80% confidence interval — it is the consistency of the pattern across all ranges that reveals the systematic baselining bias.
Where baselining steepens the relationship between success rate and probability, polarization has the opposite effect. Unlikely events tend to be under-predicted leading to over-delivery. Good prospects tend to be over-predicted, leading to under-delivery. Again, with a lot of prospects, this is can be seen, though again it is the consistency of the pattern across all probabilities that reveals the bias.
The challenge here is that even with 100 prospects, once they’re divided into these intervals, there aren’t that many to get predictions. The uncertainty distribution shown above would be typical and, again, the uncertainty is too great to make conclusive statements.
Can we do better? The Bayesian bias footprint
We need a lot of assessments to catch polarization and baselining in a probability interval plot. Mathematically, it feels rather crude: chopping the sequence in to chunks necessarily increases the uncertainty in each chunk, and it is in any case the consistency of the behaviour across all chunks that tells us what we need to know. It turns out we can do better if we allow ourselves to assume the form of the bias we’re trying to catch.
By building a simple model of bias with two parameters — one for simple optimisms and pessimism and one for baselining and polarization — we can turn the problem of revealing that bias in to a problem of finding a probability distribution for those parameters.
The results of such an analysis is a Bayesian Bias footprint. The horizontal axis shows left to right pessimism to optimism and the vertical axis from bottom to top baselining to polarization. The contours show the likelihood of the data for the various parameter values. We can see the maximum likelihood sits at a point with modest optimism and polarization, but we can also see the extent to which that conclusion is actually warranted by the data.
Risk Awareness Week 2020
For a full description of this model and how to make such a footprint, sign up to my presentation at Risk Awareness Week 2020.