Do you know what you know?

The Quiz

The quiz below explores your ability to quantify your epistemic uncertainty. A series of questions are given to which the answer is either A or B. Hopefully you will not know the answer to too many, and you will not cheat. You choose your preferred answer, and say how confident you are in this choice, on a scale from 50% to 100%. When the answer is revealed, you are then assigned a score: positive if your choice was correct, negative if incorrect, 0 if you said 50:50.

What is epistemic uncertainty?

Probability is usually identified with ideas of chance or randomness - essential unpredictability about the future. For example, suppose you are about to flip a coin, assumed to be nice and balanced, it would generally be agreed that the chance of it coming up heads is $\frac{1}{2}$ (although I generally carry a two-headed coin just to show that even this is an assumption that could be wrong).

Then you flip the coin, but cover it up before you see the result. What is the probability that it is a head? There is no longer any randomness - the coin is either heads or tails, but you don't know what it is. The Bayesian interpretation of probability claims that you can still say that your probability is $\frac{1}{2}$, which expresses your own personal uncertainty about the outcome. This is no longer a property of the coin, but of your knowledge about the coin - this is clearly shown if I look and see it is a head, so that my probability is now 1, whereas your probability for heads stays at $\frac{1}{2}$ (provided you don't peak). Note that, strictly speaking, it is better to say 'for heads' rather than the probability 'of heads', as the probability is an expression of your opinion, and not a property of the coin.

Before the coin is flipped, when we can't know the answer, our uncertainty is known as aleatory - essential chance, randomness, unpredictability. Afterwards, when we don't know the answer, our uncertainty is known as epistemic - essential lack of knowledge or ignorance.

The scoring rule

The 'scoring rule' is as follows

Scoring rule for quiz.
Your ‘confidence’ in your answer 50% 60% 70% 80% 90% 100%
Score if you are right 0 9 16 21 24 25
Score if you are wrong 0 -11 -24 -39 -56 -75

It is important that you know the scoring rule before you say how confident you are. You should be choosing your level of confidence as that which you are happy to accept the potential gain or loss if you are right or wrong - we can then interpret the level of confidence as your probability for the favoured answer.

We note the scoring rule is non-linear and asymmetric in gains and losses - a confident opinion that turns out to be incorrect loses far more than it would gain if it is right. It's rather like a nasty teacher that punishes failure harshly but only grudgingly rewards success. However this property is not arbitrary - the rule is carefully designed to reward honest expression of opinion, and discourage over-stating one's confidence. The mathematics of this is given below.

At the conclusion of the test, the average score is provided, as well as an explanation of how you lost points. For example, you may score an average of 7 points per question, compared to a maximum possible of 25 were you to confidently and correctly answer all the questions. You therefore lost an average of 18 points a question, and this might be divided into say 11 points lost for 'not knowing' and 7 lost for 'not knowing what you know'.

You lose points for 'not knowing' when your success rates, whatever the confidence you gave, are below 100% - this is also known as 'lack of discrimination/resolution'. You lose points for 'not knowing what you know' when your success rates don't match the confidence levels - e.g. when you say you are 90% confident but only 70% of your answers are correct- this is known as 'lack of reliability/calibration'. Lack of reliability can be shown graphically by plotting the actual success rate against the claimed confidence - any deviation from the line of identity contributes to being penalised for lack of reliability. See the Appendix for the details of how these scores are calculated.

Why do these tests?

The aim is try to reward both 'knowing a lot', and also 'knowing what you know'. The strongest penalty is reserved for those who don't know, but think they do: a lethal mixture of cockiness and ignorance that nobody wants in an advisor or predictor. (Anecdotally, young males appear particularly prone to this behaviour).

These scoring procedures are used to train and assess those who give probabilistic forecasts, such as weather forecasters and those charged with making assessments about the risks of natural disasters. When the questions relate to the presumed area of an expertise, performance of assessors in such tests can be used as a basis for weighting their opinions in making real predictions.

It has also been proposed that similar methods should be used instead of standard multiple-choice questions in which a single answer is permitted - so-called 'certainty based marking'.

Appendix: Discrimination and reliability

We have 6 categories of answer, $j$=5,6,..,10, where the $j$th category assigns probability (confidence) $p_j$ to the favoured answer $F$, where $F$ might be $A$ or $B$, and $p_j = 10j$%.

If we select $p_j$, the scoring is as follows:

  • If $F$ is the correct answer, score $25 - 100(1-p_j)^2$
  • If $F$ is incorrect , score $25 - 100 p_j^2$

So if we assign 70% confidence to $B$ ($p_j$=0.7) , we score 16 if we are correct and -24 if we are wrong. For $p_6$ = 0.5, then always score 0.

Suppose that after answering $N$ questions, we have assigned confidence $p_j$ a total of $n_j$ times, of which $m_j$ were 'correct' answer and $n_j-m_j$ 'incorrect' answer. For $p_6 = 0.5$, we can arbitrarily set $m_6 = n_6/2$.

Then a proportion $n_j/N$ have been assigned confidence $p_j$, and of these a proportion $\tilde{p}_j = m_j/n_j$ had the correct answer. For a 'calibrated' forecast, we would expect $\tilde{p}_j$ to be near $p_j$. $\tilde{p}_j$ can be thought if as the 're-calibrated' forecasts - i.e. the forecast probability that someone should have given were they aware of their lack of calibration. Note that $\tilde{p}_6 = 0.5$ by construction.

It is straightforward to show that the average score $T$ can be expressed as a simple function of two components $D$ and $R$, where
$$ T = 25 -100( D + R) $$
The component definitions are

  • Lack of discrimination: $D = \frac{1}{N} \sum_j n_j \tilde{p}_j(1- \tilde{p}_j) $
  • Lack of reliability: $R = \frac{1}{N} \sum_j n_j(\tilde{p}_j - p_j)^2$

The lack of discrimination $D$ is a positive term that penalises those whose success rates - the $\tilde{p}_j $s - are near 0.5. $D$ is low when the $\tilde{p}_j $s are near 0 or 1, and so rewards those with extreme success or failure rates: i.e. those who 'know something' (even if they are utterly wrong). $D$ does not depend at all on the claimed confidence levels $p_j$.

The lack of reliability $R$ is a positive term that is high when people are badly calibrated. $R$ is low when the actual success rates $\tilde{p}_j $ are close to the claimed confidence levels $p_j$. A low $R$ rewards those who 'know how much they know'.

We note that other decompositions of the total score features a term for 'overall calibration', i.e. how the expected proportion of $B$'s differs from the observed proportion, This is inappropriate in our context since there is no special status of $A$ or $B$.

Why use this scoring rule?

The scoring rule is a transformed version of the Brier scoring rule developed to train and evaluate the probabilistic predictions of weather forecasters. It is an example of what is known as a proper scoring rule, which encourages people to honestly express their beliefs, in the sense that their expected score, calculated with respect to their 'true' probability, is maximised by choosing their level of confidence to match their true probability.

For example, suppose my honest probability for $B$ was 70%, and so I chose 70\% as my confidence level. Then I judge that I have a 70% probability of gaining 16, and a 30% probability of losing 24 , and so my expected score is 0.7 x 16 - 0.3 x 24 = 4.0. But suppose I was arrogant and chose to exaggerate and claim 100% confidence. Then my expected score is 0.7 x 25 - 0.3 x 75 = -5.0, which is lower than if I had chosen to express my true opinion. So although I could be lucky in this instance, on average it will pay me to be honest.

However, suppose we used the following rule, which is linear and symmetric.

An inappropriate scoring rule that encourages exaggerated confidence.
Your ‘confidence’ in your answer 50% 60% 70% 80% 90% 100%
Score if you are right 0 5 10 15 20 25
Score if you are wrong 0 -5 -10 -15 -20 -25

This rule may superficially seem reasonable, as it essentially penalises by the distance from the correct answer.

Then my expected score by being honest is 0.7 x 10 - 0.3 x 10 = 4.0. Whereas my expected score if I exaggerate is 0.7 x 25 - 0.3 x 25 = 10.0. So this rule, although apparently reasonable, hopelessly encourages people to lie about their uncertainty. Unfortunately this rule has been used in some studies.

Free tags: