forked from Corrod3/SecurityPolicyForecastingTournament
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththesis.rmd
392 lines (272 loc) · 80.6 KB
/
thesis.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
---
title: "Improving Forecasting for Foreign Policy^[A summary of the forecasting tournament and its results are available on the [project website](https://corrod3.github.io/SecurityPolicyForecastingTournament/). Additional background information and methodological details can be access via the [online appendix](https://corrod3.github.io/SecurityPolicyForecastingTournament/appendix.html). The r scripts and the raw data are available on the author's [GitHub account](https://github.com/Corrod3/SecurityPolicyForecastingTournament). This thesis was supervised by Prof. Mark Kayser and submitted in partial fulfilment of the requirements for the Master of Public Policy at Hertie School of Governance.]"
subtitle: "Identifying drivers of accuracy in a forecasting tournament"
author: "Alexander Sacharow^[Corresponding address: [email protected]]"
header-includes:
- \usepackage{fancyhdr}
- \usepackage{float}
- \pagestyle{fancy}
- \fancyhead[LO,LE]{Improving Forecasting for Foreign Policy}
- \fancyfoot[LO,LE]{Master thesis}
- \fancyfoot[RE,RO]{Alexander Sacharow}
- \usepackage{setspace}
- \onehalfspacing
- \newcommand*{\secref}[1]{Section~\ref{#1}}
date: "April 28, 2017"
abstract: In this paper, I explore the drivers of forecasting accuracy for geopolitical events with the help of a forecasting tournament. The paper analyses the responses of the participants in order to test and evaluate several explanations of successful forecasting. More specifically, it looks at (1) measurable characteristics of forecasters, (2) the decision environment in which a forecast is made and (3) minimal interventions aiming at improved forecasting judgements. My findings are that intelligence is a good indicator of forecasting success in line with prior findings on geopolitical forecasting. There is also some evidence that moral judgment competency is related to forecasting accurarcy, but further research is needed. Regarding the decision context, the forecasting tournament showed that more forecasting time is related to more accuracy, but with diminishing returns. Finally, the forecasting competition did not find evidence that small interventions in form of analytical guides improve forecasting judgements. These insights can be used to compute improved crowd forecasts and inform policy makers engaged in forecasting. The results were derived from a security policy forecasting tournament which took place from February to April 2017 and which had more than 200 participants, comprised out of university students with a strong interest in the field, paid online respondents and voluntary online users.
output:
pdf_document:
toc: true
number_sections: true
fontfamily: mathpazo
fontsize: 12pt
urlcolor: blue
bibliography:
- literature.bib
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Clear Global environment
rm(list=ls())
# Setting Working directory
try(setwd("D:/Eigene Datein/Dokumente/Uni/Hertie/Materials/Master thesis/SecurityPolicyForecastingTournament"), silent = TRUE)
source("main.R")
# Collect packages/libraries we need for paper:
packages <- c("stargazer")
# install packages if not installed before
for (p in packages) {
if (p %in% installed.packages()[,1]) {
require(p, character.only=T)
}
else {
install.packages(p, repos="http://cran.rstudio.com", dependencies = TRUE)
require(p, character.only=T)
}
}
rm(p, packages)
```
# Introduction
<!-- Why does it matter -->
Foreign policy makers, just like other decision makers, have to constantly think about the future and how it can unfold. They need to have an idea of possible outcomes and their likelihood in order to make their decisions. For this, they rely mainly on explicit or implicit forecasting judgements, either by themselves, their advisors or some outsiders. However, forecasting geopolitical trajectories in an uncertain world has proven to be a challenge. Too often forecasts are flawed and therefore unreliable for decision makers. In order to improve future-oriented foreign policy making, a better understanding of successful forecasting is crucial.
<!-- Motivation/ Why is it a policy problem -->
There are several ways how this can happen. First, deeper understanding leads to better forecasts which are a prerequisite for pro-active policy. Foreign policy is often described as dominated by reactive policy making. In order to break this pattern, more reliable methods for forecasting possible futures are essential. Second, it will draw the attention to ill-conceived assumptions underlying today’s decision-making and thereby offers a chance to address them appropriately. Third, a better understanding of forecasting is necessary to transform institutions tasked with forecasting. Knowledge about individual differences between forecasters can be used to staff and structure such agencies. A better understanding of the decision environment will allow changing this environment to make it more suitable for forecasting. Likewise, tested decision aids can be used to counter common decision-making flaws made by forecasters. This research focuses primarily on the third point, but indirectly it also contributes to the other points.
<!-- The basic question / justification forecasting tournament -->
In this paper, a forecasting tournament is used to identify drivers behind accurate forecasting. The tournament is used both to check prior findings in the literature and to test new hypotheses. Compared to other forecasting methods, a tournament forces forecasters to make precise predictions. Participants are evaluated and ranked based on the accuracy of their forecast. This reduces the level of ambiguity in forecasts, which is common for many other forms of forecasting, and measures the quality of forecasts against what they actually attempt to do.
<!--limitations -->
However, it does also highlight the limitations of this research: A forecasting tournament defines successful forecasting in terms of accuracy. Participants have to specify the likelihood of a particular event in a given time frame and they perform well if they choose probabilities close to the truth. Other indicators for successful forecasting are sidelined, e.g. identifying relevant possible future events or specifying the impact of certain future events.
<!-- Structure -->
The paper starts by reviewing and critically discussing the literature on forecasting and in particular forecasting competitions. In section three the hypotheses of this research and their theoretical as well as empirical background are presented. Then the set-up of the research design is explained and some crucial design choices are reviewed. In section five the results of the research are presented and interpreted. These results are then used in section six to aggregate the individual forecasts and compared to other aggregation approaches. Finally, the research and its limitations are discussed and some policy implications are highlighted.
# Literature Critique
<!-- Possible vs. Luck -->
In order to forecast one has to make a basic assumption: Forecasting is possible at all. Not everyone, however, agrees with this. Forecasting sceptics emphasize the fundamental uncertainty of the future and assume successful predictions about world politics are ultimately grounded in luck [@Almond.1977; @Beyerchen.1992; @Taleb.2007]. They don’t claim that forecasting in general is impossible, but in their view the fundamental problems about foreseeing the future are particular strong in the field of international politics. And they have a point, as foreign policy has many conditions which have proven to be unfavorable for forecasting: The environment is dynamic, most events are essentially unique, feedback on forecasts has long delays, there is a lack of empirical tested decision aids and a strong reliance on subjective judgments [@Shanteau.1992]. There are even indications that the quality of judgements gets worse with professionalization as intelligence experts specialized in forecasting seem to exhibit even more decision-making biases than college students [@Reyna.2014].^[Which has been attributed to bad habits developed in their working environment.]
<!-- Dispositional factors -->
But others scholars are more optimistic about the prospects of forecasting in the field. In their view, forecasting foreign policy is still in its infancy which is partially attributed to the dominant role explanation enjoyed in the past among international relation scholars [@Ward.2016]. This is, however, gradually changing as a result of more forecasting-oriented research. On the individual level, the research has shown grave differences between individual forecasters, making forecasting not only a question of how to forecast but also of who is forecasting [@Tetlock.2005; @BuenodeMesquita.2009; @Mellers.2015] and that forecasting can be further improved by appropriate training [@Mellers.2014]. There has also been a rise in the number of methods available and the data used to generate more sophisticated forecasts for world politics [@Dhami.2015; @Ward.2016].
This research is based on the presumption that forecasting is possible, but to a limited extent. Forecasting will never produce a fully certain prediction of the future and some events and aspects will remain beyond the forecastable. However, there are aspects which can be improved and the level of uncertainty can be reduced in a systematic and reliable manner.
<!-- Here maybe something on different time-horizons -->
<!-- [Choice of forecasting method] -->
The question is how to improve foreign policy forecasting in a meaningful way and what method to use for it. Generally, a wide variety of approaches is available:
<!-- [intuitive predictions] -->
Starting from the simplest and most common one: Intuitive predictions. These are statements people make out of their head without recourse to a systematic methodology. The good thing about these predictions is that they can be applied to most topics at almost no cost. Unfortunately, they have a record of being inaccurate. Take for example affective predictions which rarely match the actual experience [@Wilson.2005; @Schkade.1998] or probability judgments which have been shown to be susceptible to biases [@Kahneman.1974]. The most simple statistic models have shown to outperform intuitive predictions in various domains like university admission or parole violations [@Dawes.1989; @Swets.2000]. More recent research has demonstrated that expert prediction exhibit the same problems. In the political sphere, by tracking expert statements for more than 20 years @Tetlock.2005 has argued expert predictions are often as accurate as a “dart-throwing chimpanzee”. Similar results can be found with experts in other fields, e.g. climate science [@Green.2007]. One of the reasons for this discrepancy is that forecasting and explanation require different set of skills and there is no reason that experts have both of them. Another reason is that experts tend not to apply the same rigidity to forecasting statements, when asked about it, as to their written work. This does not mean expert opinions should be disregarded, but for forecasting purposes they should be treated with care.
<!-- Statistical models -->
One approach to solve the problems of inaccurate forecasting is to rely more on quantitative methods, as it is common in other disciplines like meteorology or economics. Statistical forecasting has been discussed for long in the sphere of international relations [@Choucri.1974] and the field is clearly on the rise [@Ward.2016]. Well known applications are election forecasting models [e.g. @LewisBeck.2005; @Norpoth.2010] or the work of the Political Instability Task Force [@Goldstone.2010], which led to the Integrated Crisis Early Warning System ([ICEWS](http://www.lockheedmartin.com/us/products/W-ICEWS/W-ICEWS_overview.html)) Project. The forecasts are generally based on measurable input variables ranging from economic, media to political indicators. The models are calibrated with past data and then used for extrapolating into the future.
<!-- [limitations]-->
But there are also severe limitations to this approach. First, statistical models are largely limited to quantifiable events which can be grouped by their similarity. Hence, forecasts are possible, for example, about the outbreak and scale of violence or protests. Many events in foreign policy, however, are at least to some degree unique. Take for instance a court ruling of the International Court of Justice on a specific matter. It is hard to impossible to build a statistical model for such types of events. Hence, often there are no statistical models available to forecast relevant events or important information has to be neglected in order to make events predictable with quantitative models. Second, many statistical models have focused on finding significant relationships instead of useful predictive indicators [@Ward.2010]. As a result, they do not produce precise predictions and therefore lack the external validity to be useful for actual forecasting. A well-known example is the Flu forecast developed by Google which could not produce accurate predictions after its initial introduction [@Lazer.2014]. Moreover, in many cases the necessary input data for quantitative models is scarce, not available or too expensive to gather. This limits the usefulness of quantitative models for policy makers. However, if there are statistical forecasting models with a good track record available, using them is surely a promising approach. The focus of this research is on geopolitical events for which quantitative models so far haven’t been able to produce useful forecasts.
<!-- [Prediction markets] -->
Prediction markets are another approach for generating forecasts which has been discussed in the literature [see @Wolfers.2004]. Economists see them as an efficient aggregator of information by encouraging market participants to use various information sources and by exploiting the wisdom of the crowd effect. This effect was famously described by [@Galton.1907], who observed that the average of all estimates of the ox weight at an exhibition was much more precise than any individual estimate. Markets essentially attempt to incorporate this effect. A well-known prediction market example is the [Iowa Electronic Market](http://tippie.biz.uiowa.edu/iem/), which has been used mostly to forecast elections in the U.S. But markets have also been used to forecast other political events as well, e.g. the outcome of referendums or even terrorism.^[There was a DARPA research project on Policy Analysis Market (PAM), however, it was stopped after public criticism.]
For several reasons forecasting markets never really took off and still are of limited utility for policy making. On the one hand this is due to the operating of prediction markets. One problem with market predictions is that they have shown to exhibit a favorite long shot bias. Small probabilities are overvalued and near certainty undervalued. This has been extensively discussed in the case of horse races, where people overproportionally bet on underdog horses [@Thaler.1988]. It was also shown to be a problem in financial markets [@Bates.1991; @Rubenstein.1994]. Other operational problems of prediction markets are trading by desires [e.g. @Forsythe.1999] and speculative bubbles. However, these problems tend to decrease with market size and sophistication.
More problematic for prediction markets are their practical limitations: Markets are not always desirable or feasible. Take for instance the case of terrorism, where betting on these events would create perverse incentives for conducting attacks.^[A sad example for this was the attack on the soccer team of Borussia Dortmund on 11th of April 2017, where the attacker wanted to cash-out financially from the market reaction.] Or the case of asymmetric information, where strong insider knowledge exists and outsiders are basically discouraged to participate in the market. This is, for example, the case for many government decisions, where relevant forecasting information is only available to a small cycle of individuals. In the past prediction markets often lacked the necessary liquidity and number of market participants in order to produce meaningful predictions.
<!-- [forecasting tournaments] -->
A relatively new approach in forecasting international politics are forecasting competitions. In such a tournament participating individuals or teams are asked about the likelihood of future events. After the end of the forecasting horizon these forecasts are compared and evaluated. Forecasting competitions have some similarity to citizen election forecasts, where voters are asked which candidate or party they consider most likely to win their constituency [@Murr.2011]. Forecasting competitions combine the advantages of the prior discussed approaches. Like intuitive predictions, they are hardly limited in their scope.^[The only serious restriction is, that the events under consideration are measureable in order to make them subject to a forecasting competition.] They can incorporate reliable statistical models, if they are available. And like prediction markets, forecasting competitions operate on the idea of aggregating the wisdom of the crowd and aggregating different information sources in order to generate forecasts.
In the field of international politics, this approach became prominent by the [IARPA tournament](https://www.iarpa.gov/index.php/research-programs/ace), a geopolitical forecasting competition started in 2011 by the U.S. intelligence community. Different groups of academics were invited to participate in the project and compete against each other in providing forecasts about geopolitical events. In consecutive years, the winning team demonstrated that more accurate forecasts can be achieved by using two strategies: First, the exploitation of individual differences in forecasting skills and judgements [@Atanasov.2016; @Mellers.2015; @Satopaa.2017]. Prior research on this strategy has primarily looked at different predictors of forecasting success [@Mellers.2015; @Poore.2014]. This paper is adding to this research by testing the replicability of prior findings and exploring new hypotheses. Second, the extremization of individual judgements [@Baron.2014; @Satopaa.2015b]. The first approach will be shortly illustrated in section six.
<!-- advantage of competitions in comparison tp markets -->
The possibility of calibrating aggregated forecasts is a major advantage in comparison to prediction markets where market participants are usually anonymous and the individual leverage on the aggregated forecast is determined by the capital available to forecasters, which is often unrelated to their ability to forecast geopolitical events accurately. This is probably a major reason why wisely aggregated forecasting competitions outperform markets [@Atanasov.2016].
<!-- [limits of forecasting tournaments] -->
Before I turn to the discussion of hypotheses, I want to touch upon the limitations of forecasting competitions. Limiting the understanding of successful forecasting to accuracy is their main weakness. On the one hand, forecasting is about more than just accurately describing the likelihood of events. It does also require exploring the unknown and identifying possible future events. In the forecasting competitions discussed here, this is not part of the tournament itself.^[Here this is done by the tournament facilitator, which is discussed in more detail in \secref{sec:tournament}.] It ultimately requires recourse to other methods, e.g. expert opinion or strategic foresight [@Popper.2009; @Kosow.2008; @Bergheim.2009]. On the other hand, accuracy might not be the most relevant quality of forecasts for policy decision making. For this other dimensions like the impact or the possibility of early action might be more important.
However, accuracy should be considered of key importance for forecasting. The likelihood of future events is essential for decision making from a normative point of view. Decision theory under uncertainty, in particular its most used approaches expected utility theory [@Neumann.1944] and subjective expected utility [@Savage.1972], presuppose the decision maker has an idea of the likelihood of the outcomes. But it matters also in practice: Policy makers tend to ask for a likelihood assessment when presented with possible future scenarios.^[Observation made by scenario and foresight experts which they expressed to me in a private conversation.] Another reason for improving accuracy are the positive effects of the process on other aspects of forecasting [@Tetlock.2015]. By reducing ambiguity in forecasts, encouraging learning from mistakes and forcing organizers of a forecasting competition to explore the “unknown” it creates good incentives for better forecasts in a broader sense. Finally, despite popular contrary claims, more accuracy in foreign policy forecasting is possible. It can be consistently utilized by analysts, e.g. by aggregating more information into single forecasts or discussing the implication of single pieces of information [@Friedman.2016].
# Theory and Hypothesis
<!-- [Intro / Justification] -->
Forecasting competitions can be used to improve forecasts by identifying good forecasters, the environment in which they make good decisions and decision aids which can improve the quality of the judgement. Following this idea, the factors discussed in this paper can be divided along three categories: dispositional, environment-related and intervention. Dispositional factors are characteristics of the forecaster like abilities. They don’t have to be fixed, but they should be at least stable in the short-term and measurable by a third party. Environment-related factors refer to the situation in which the decision is made. Intervention refers to treatments by outsiders, which aim at improving the quality of a decision.
## Individual Dispositions
<!-- [Dispositional factors] -->
Accepting the idea of individual differences in forecasting skills implies that there are individual dispositions which have direct implications on forecasting. In order to identify good forecasters it is therefore essential to know what characteristics can predict successful forecasting.
<!-- [intelligence and why] -->
A good starting point for this is intelligence. It has been shown to be a good predictor for many other things like job performance [@Ree.1992; @Schmidt.2004], socio-economic status [@Strenze.2007], academic achievement [@Furnham.2009] and decision competence [@DelMissier.2012; @Parker.2005]. Therefore, it can also be assumed to be a valuable predictor for successful forecasting.
<!-- [what is intelligence] -->
Intelligence can be defined in different ways. Generally, it is conceptualized either one-dimensional or multi-dimensional. In the one-dimensional approach a meaningful intelligence measure can be collapsed into one variable, while in the multi-dimensional approach intelligence is understood as a collection of different abilities.
<!-- [prior findings on intelligence / forecasting] -->
@Mellers.2015 have argued that there are three relevant aspects of intelligence for geopolitical forecasting. They are (1) inductive reasoning (e.g. linking current problem and historical analogy), (2) cognitive control (e.g. override seemingly obvious but incorrect responses and engage in more prolonged and deeper thought) and (3) numeric reasoning (understanding mathematical dimension of a problem). In their research the different aspects were correlated to more accurate forecasts and they are correlated between each other. The first finding shows that intelligence measures can be used for predicting forecasting success and the second point supports the view that a one-dimensional intelligence concept is sufficient in this context. Similar results were derived by @Poore.2014, who found a strong correlation between various measures for analytical abilities and forecasting accuracy. Therefore, it is reasonable to assume these findings will be confirmed by the security policy forecasting tournament:
**Hypothesis 1a: More intelligent individuals are more accurate forecasters**
<!-- [how is it measured] -->
To test the hypothesis, it needs to be clarified how intelligence is measured. Various psychometric measures are available. @Mellers.2015 for example use three different methods: The Ravens Advanced Regressive Matrices method [@Bors.1998], the Cognitive Reflection Test (CRT) by @Frederick.2005 and a combined numeracy scale from @Lipkus.2001 and @Peters.2006. @Poore.2014 used self-reported SAT scores, sample GRE/SAT questions, subjective numeracy [@Fagerlin.2007] and, like @Mellers.2015, the CRT.
<!-- [test used in this paper] -->
This paper uses a different test which was not used for forecasting accuracy yet: The Berlin Numeracy Test (BNT) by @Cokely.2012. It is a relatively new psychometric scale that was in particular developed to access statistical numeracy and risk literacy. This makes the test in particular suitable for forecasting decisions involving probability judgements. Since it is especially suited to differentiate between individuals with higher education, it is also likely to perform better than the numeracy scale used by @Mellers.2015 which lacked differentiation. Another reason for the BNT is its length: The test consists only of four questions, making it suitable for one time forecasting tournament.
<!-- [Moral Judgement] [transition to second dispositional factor]-->
Intelligence is a common measure, but by far not the only dispositional factor of interest. Other factors considered in this context include personality and cognitive styles [@Mellers.2015; @Poore.2014]. It has been found that factors like openness to new experiences and active open-mindedness are good predictors for forecasting success. To keep the forecasting tournament within reasonable time limits for the participants, this research did not attempt to replicate these results. Instead, the paper scrutinizes a factor which was not tested before, but was raised by @Tetlock.2015 [p. 226] in a side note: The interference of moral judgements with analytical judgments. In his book Tetlock claims superforecasters, unlike other forecasters, can separate analytical judgements from moral judgements. This is in particular important when forecasters make judgements about events for which they have a strong moral position. In such cases many people mingle the desirability of an outcome with the likelihood assessment. For example, if a person holds a strong moral opinion on the Syrian government and has a strong desire for it to be replaced by a more human-rights oriented government, Tetlock assumes this person to skew his or her probability judgment on the fall of the government towards the desired outcome.
This idea is similar to the social desirability bias where individuals over-report characteristics about themselves which they consider socially desirable [@Dalton.2011] as well as to trading according to desires in prediction markets, where personal political preferences affect buying decisions [@Forsythe.1999]. Analogously, people are expected by Tetlock to overrate the likelihood of events they deem desirable.
<!-- Strategies to test this: 1. Asking respondents about probability and desirability -->
But is this true? There are several ways this could be tested. First, the straight forward approach would be to ask respondents not only about the probability of an event but also about the desirability of it. The information could then be used to see whether it has any relationship to forecasting accuracy. But this approach is not helpful here. On the one hand the desirability of the events in question is unknown, although, this could theoretically have been asked in the forecasting tournament. On the other hand, for practical and normative reasons the accuracy of a forecaster should be predictable without knowing her or his moral opinion on the subject. Practically, it would be unfeasible as one would always have to ask forecasters about probability and desirability of events they forecast. This would increase the workload and induce fatigue. Normatively, it is undesirable to force forecasters to reveal their personal opinions as many of them are likely to work in highly politicized environments.
<!-- 2. Ranking questions by their morality -->
Another way would be to rank the questions according to their morality and see whether forecasters accuracy is related to the morality level of a question. However, I am not aware of any morality scale which could be applied to forecasting questions and which is sufficiently universal. To the contrary, it is plausible to have people disagree about the morality of geopolitical events. For example, many in the West would like Assad to lose power of the Syrian government while a Syrian Alawite likely will have a very different view on this. Moreover, assigning morality levels to questions introduces a measurement problem: It cannot be distinguished whether the source of the inaccuracy stems from the uncertainty of an event in question or its morality.
<!--3. Psychometric measure: -->
Hence, we might have to rely on a less direct test of the relationship of moral and analytical judgement. The (psychological) theory of moral judgement might be a promising starting point for this. According to @Kohlberg.1958 the moral development of humans can be ordered on a scale. This scale reflects how sophisticated moral justifications of judgements are. It ranges from self-interest over rule-based to justifications based on universal norms. In this context, the concept of moral competency was developed. It describes how people can disentangle moral decisions along this scale and see consistently the differences between them. This has some similarity to the context of forecasting questions. If moral competent individuals are better able to differentiate between different moral justifications, we might also assume that they can better differentiate their moral and their analytical judgement in the context of a forecasting question. In order to test this, the following hypothesis is used:
**Hypotheses 1b: More moral competent individuals are more accurate forecasters**
<!-- [moral-judgment competence] -->
More precisely, moral competency can be defined as “the ability of a subject to accept or reject arguments on a particular moral issue consistently in regard to their moral quality even though they oppose the subject's stance on that issue” [@Lind.2008, p. 200]. It can be contrasted with opinionated judgements, which are intuitive and emotional reactions to the content at hand and much of what Tetlock understands by moral assessment. Moral competency captures the idea that individuals are willing to consider counterviews despite their own, potentially strong, view on the issue.
In this regard moral competency has similarities to the active open-mindedness measure [@Baron.2007]. It asks respondents how they would consider other opinions and has been shown to be a good predictor of forecasting success [@Mellers.2015]. The Moral competency measure goes a step further: Instead of relying on a self-assessment, it actually tests whether different arguments are considered in the face of a moral issue.
<!-- how to measure it -->
Moral competency can be measured with the moral competency test (MCT). The test confronts respondents with two moral dilemma situations. To each story the participants have to answer 12 questions on different justification for the described acts. The answers are used to compute a competency score for each respondent. The score is not based on correct and wrong answers, but reflects a ratio between different parts of the answers.
\footnote{Basically, a multivariate analysis of variance (MANOVA) is used here which measures how individuals disaggregate different moral stages. Based on the instructions of the test (Lind, 2008) I reengineered the underlying formula for the score: \begin{equation} c = \frac{\frac{1}{4}\sum^{6}_{i = 1}(\sum^{4}_{j=1}x_{ij})^2 - (\frac{1}{24}\sum^{6}_{i=6}\sum^{4}_{j=1}x_{ij})^2}{\sum^{6}_{i = 1}\sum^{4}_{j=1}x_{ij}^2 - (\frac{1}{24}\sum^{6}_{i=6}\sum^{4}_{j=1}x_{ij})^2}\cdot 100 \end{equation} where $i \in {1,...,6}$ stands for the moral stage and $j \in {1,...,4}$. for the section (pro-Worker, contra Worker, pro-Doctor, contra-Doctor). The score is between 0 and 100 and is sometimes categorized as follows: very low (1-9), low (10-19), medium (20-29), high (30-39), very high (40-49) and extraordinary high (above 50) (Lind, 2008, p. 200). For further information check the \href{https://www.uni-konstanz.de/ag-moral/mut/mjt-engl.htm}{associated website}.}
## Decision Environment
<!-- [Changing environment of decision making] -->
Individual differences in forecasting accuracy are also determined by differences in the decision environment. Unlike early decision theory assumed, empirical research has shown that context matters [e.g. @Kahneman.1974]. The list of possible factors is long, ranging from the number of forecasters and the incentives to the opportunity of deliberative practice [@Arkes.2001; @Ericsson.1993; @Kahneman.2009]. In this paper only the simplest environmental factor will be tested: Time used for answering the forecasting questions.
<!-- [Time] -->
Time used for forecasting is ultimately a choice of the forecaster. It can, however, be influenced by explicitly making time slots available or freeing forecasters from other tasks. To justify such choices, it would be necessary to know the relationship between time and forecasting accuracy. There are good reasons to belief that more time will also lead to more accurate forecasts. First of all, time is necessary in order to go beyond intuitive thinking and to engage in analytical thinking about the question at hand. The two different ways of thinking are often described as ‘system 1’ and ‘system 2’, where 'system 1' stands for fast intuitive judgements and 'system 2' for slow reflective thinking [@Evans.2013; @Kahneman.2013]. There is also a second reason why more time might lead to more accurate forecasts: Spending time on a question does allow gathering more information. Hence, the use of time should indicate whether a decision was informed or not.
However, for two reasons this might not be the case. First, the forecasters might use the time for other things. @Haran.2013, for example, have argued that acquiring new information depends on other characteristics like active open mindedness. But this is less of a problem in this forecasting tournament as participants would have no reason to spend more time on the questions and move on to the next section of the tournament. Second, the forecasters might have different levels of pre-knowledge. Some participants might need more time to grasp the context of questions while others can rely on their extensive political pre-knowledge. As will be described later in more detail, this was counteracted by ensuring a wide span of questions. Moreover, going back to the first explanation for the link between time and accuracy: Even individuals with pre-knowledge will have to switch between the two mental modes, which should again be captured by the time spend on the questions.
But the relationship between time and accuracy is unlikely linear. From the view of a mental system shift, there is no theoretical reason why more time should increase accuracy once the shift of mental systems took place. From the informational point of view, over time the value of new information decreases as it will have less and less implications for the judgement and the costs of gathering will increase as it will be harder to find new additional information. For this reason, it is reasonable to assume that the marginal value of more time decreases:
**Hypothesis 2: The marginal added value of time spend on forecasting is positive and decreases over time**
In order to verify the hypothesis, the time participants used to answer the forecasting questions is measured. The validity of the measurement is ensured by treating other questions asked over the course of the forecasting competition in different sections and thereby excluded from the time measurement for this hypothesis. Moreover, the time used for forecasting is cross-checked with self-reported time use.
## Intervention
<!-- [intervention] Discussion: Treatment effects on judgement in general -->
Finally, differences between individual forecasters can be the result of outside interventions. In forecasting this could be achieved by treatments improving analytical judgement [@Soll.2015; @Larrick.2004]. Possible interventions include minor decision aids [@Kretz.2015], feedback [@Benson.1992], exposure to multiple perspectives [@Ariely.2000; @Herzog.2009], exposure to historical analogies [@Lovallo.2012], decomposition of problems into subsets [@Fischhoff.1978], explicit consideration of contradictory evidence [@Koriat.1980] and probabilistic training [@Mellers.2014]. They can be tested in a forecasting tournament. @Mellers.2014, for example, tested the effect of scenario and probabilistic trainings and found them to have a long-term positive effect on forecasting accuracy.
However, for the forecasting tournament in this paper the focus will be on minimal interventions as the scale of the tournament is rather small. It is therefore more suited to test mild interventions aiming at debiasing. This could, for example, be achieved by providing forecasters with analytical tools. @Kretz.2015 has done some research on mild decision aid interventions and found that most analysts tend to disregard decision aids which require some effort, at least after some time. Kretz sees the reason for this in the additional mental capacities needed to apply the decision aids, which distracts the analysts from the actual problem at hand. In his research only the mildest intervention proved to have a significant effect on the judgement quality.
<!-- Following this idea, the paper tests a minimal intervention in the context of forecasting and tests the following: -->
Following this idea, I decided to use a decision guide as treatment. The guide is inspired by Tetlock's description of how "superforecasters" approach forecasting question [@Mellers.2014]. Theoretically speaking, it advises the forecaster to find a reference class for the forecasting question and then use Bayesian updating to adjust the base probability with other information.^[The reference class raises fundamental issues about interpreting probabilities, as for single events the classical frequency view of probability does not work [e.g. @Popper.1959]. Hence, the forecasters have to pick a reference class based on some similarity criteria.] Practically, this means forecasters were advised to identify a base rate for event in the forecasting question (outside view) and add or subtract incrementally probability points from this depending on the nature of the available information (inside view). Basically, this is a decision heuristic. The decision guide might improve forecasting accuracy because it is based on standard theory for decision under uncertainty. This theory reflects the view most decision theorists have on how such questions should be addressed. And the decision heuristic of out- and inside view has empirically shown to be successful for other types of forecasting decisions, e.g. for company revenues [@Lovallo.2012].
**Hypothesis 3: A decision guide increases the accuracy of forecasting**
In order to test this, all participants are assigned randomly, in about equal shares, to a treatment or control group. The treatment group is provided with a short (ca. 150 words) decision guide which describes the idea of outside and insight view in simple words and illustrates it with an example.^[The full text of the guide is available in the [online appendix](https://corrod3.github.io/SecurityPolicyForecastingTournament/appendix.html)] At the end of the guide, they were asked to fill out a small check box on whether they have read the guide.
# The Forecasting Tournament {#sec:tournament}
<!-- [When, where, Who] -->
The forecasting tournament took place from 06.-12. February 2017 and forecasters were asked to consider possible events happening between 12th of February and the 24th of April 2017. The forecasters came either from the Master of International Relations program at Hertie School of Governance Berlin or were recruited externally via mailing lists of relevant study programs, associations working in the field of international relations or by word of mouth. The students had to do the tournament as a homework while the others participated voluntary. The group was further complemented by forecasters from Amazon Mechanical Turk, who were paid for their participation. In total, 214 forecasters provided valid answers, they had an average age of `r round(mean(SPFT$age),1)` and `r round(length(SPFT$sex[SPFT$sex == "Female"])/length(SPFT$sex),3)*100` percent of them were female.
<!-- [Design/how] -->
The forecasting tournament was conducted with a online survey. The survey consisted of three parts: First, the forecasters were asked question batteries of psychometric measures on intelligence and moral judgement. In the second part, the participants answered 24 forecasting questions on various security policy related events.^[A full list is the questions is available in the [online appendix](https://corrod3.github.io/SecurityPolicyForecastingTournament/appendix.html)] In each forecasting question the participants were asked to provide a judgement of how likely they thought the event is, expressed in probability. Finally, in the third part, forecasters reflected upon their forecasting and provided some demographic information about them.^[The survey was implemented with Qualtrics.]
The forecasters were informed that their background information will be handled confidentially and in case of the university group that their forecasting judgments will be accessible to their fellow students. The second notice had the intention to incentive the students to give serious consideration to their answers by creating a competitive environment. In case of the voluntary participants this was less important as their participation already indicated intrinsic motivation. For the Mechanical Turk users the survey used attention checks, which they had to pass in order to receive the payout. All participants were also explicitly informed to use all information sources they deem relevant and spend as much time on the questions as they need. The participants were not informed whether to work individually or in teams as this lies outside of the control of the research design, but `r round(100*length(SPFT$team[SPFT$team != "Individually"])/length(SPFT$team), 1)` percent said they answered the questions not alone. In total 231 individuals participated in the security policy forecasting tournament. However, only 214 responses are used as some participants had signs of not taking the survey serious (failing attention checks, unrealistically short time used for participation) or submitted their forecasts after February 12th, 2017.
## Questions
<!-- [Questions] -->
The questions were all related to security policy and selected in a multi-stage procedure. In the first step, I selected conflict regions which might be subject to changes in the short time horizon under consideration. Then I drafted questions by surveying reports from international organizations, think tanks, governments, NGOs and media outlets on recent developments in the conflict regions. Among these sources were reports by the International Crisis Group, the German Institute for International and Security Affairs (SWP Berlin), German Institute of Global and Area Studies (GIGA), Brookings Institute and the Carnegie Center. Finally, the draft questions were sent to a few researchers and forecasting experts for feedback and their recommendations were integrated in the final forecasting tournament.
The questions were all binary and the possible answers “yes” or “no”. Most questions covered possible events in the whole time period between February 12th and April 24th, 2017. For each question the forecasters had to specify a probability to indicate how likely they expected the event to be. One question was, for example, “Will IS claim responsibility for another attack with a truck inside the European Union by 24. April 2017?”. The questions have to be precise and measureable. How difficult this is, one can see by the mentioned question. On March 22th, 2017 the Westminster Attack happened in the UK. The attacker used a SUV to attack and kill several people in London. The question was intended to capture such events, but the term ‘truck’, literally understood, does not include SUVs. This is a fundamental problem about outlining events which did not happen yet: There will be aspects which were not anticipated correctly. In this case, the type of the car. How to respond in such cases? Here the event was nevertheless seen as a ‘yes’ reply to the question. First, the question was intended to capture such events and the use of an SUV instead of a truck does not make it fundamentally different. Second, suppose one would ask the participants whether their expectation explicitly excluded the case of SUV being used the likely answer would be no. However, there is no fixed rule for these borderline cases and they need to be decided on a case to case basis.
Selecting the questions illustrates the distinction between two challenges in forecasting: Sampling and accuracy. In this research design sampling is done by the organizer of the forecasting competition while participants are solely dealing with the issue of accuracy. Ideally, one would also include sampling into a competition and testable format, but samples of different possible future events are hard to compare and therefore they cannot easily be made part of a competition.
<!-- 1. criteria question: neither remote nor almost certain-->
As a good sample of possible geopolitical events is crucial for the forecasting tournament, the forecasting questions had to satisfy a number of criteria. First, the questions should neither concern events which have almost no chance of happening nor almost certain events. It is difficult to non-arbitrarily select a highly unlikely event as they are numerous and can have all kinds of realizations. An example for such an event is the start of the Arab Uprising in 2011 after Mohamed Bouzid set himself on fire. Neither should almost certain events be subject to a forecasting competition. An example for such a question would be whether the German federal elections will take place in September 2017. Such a question is a just the flipside of the highly unlikely event, but without specifying what this interruptive event could be. But again, choosing a relevant almost certain event for a forecasting competition will become an arbitrary choice. Moreover, remote and almost certain events will cause clustering of forecasts along the extreme values by participants in the forecasting competition. This would make it harder to distinguish successful forecasters from unsuccessful ones.
<!-- 2. criteria: different regions -->
Second, the questions should cover various regions. The results might be biased if individual participants have special knowledge about a region which is overly represented in the competition. To further reduce the effects of narrow expertise, similar questions were also avoided. Ideally, the questions should also cover a wide range of policy fields. However, as the security policy forecasting tournament was conducted in collaboration with a university course at Hertie School of Governance, the topics were restricted to content of this course. This should not be a problem, as the prior mentioned considerations already introduce diversity into the questions and even within the field of security policy there is a wide range of possible topics.
<!-- 3. Relevance for policy making -->
Third, events for the forecasting competition should be relevant for policy makers and a large group of people. Relevance implies that the event has an impact on policy makers. The impact dimension excludes forecasting questions like the music played at the inauguration of a head of state. Relevance in this research was ensured by selecting events or indicators which would be discussed or considered by international organizations, governments and policy-oriented research institutions.
<!-- Limits on choosing the questions -->
Even though these criteria guided the selection of the questions, a few limitations had to be taken into account. First of all, language restricted the range of possible events. Only events which would be reported in English language were selected for the competition. It reduces the problem of forecasters benefiting from the knowledge of certain languages. This could, for example, be the case with Spanish as most information for some events in Latin America would be in Spanish. However, the more significant events are the more likely there is also sufficient information in English available. Second, the events were chosen on the basis of possibly getting international media attention. On the one side, this reduces the barrier for participants as it limited the scope of the questions to topics they might at least generally familiar with. On the other side, it keeps the workload for tracking questions reasonable. However, this does also exclude many possible questions. For example, funding decisions in international organizations are of policy relevance and might have severe implications, but information on them are hardly available. Third, reliable information should be available for the events. In the field of security policy this can be difficult as information is inherently subject to the conflict dynamics and in many conflict areas almost any reliable information is hard to get by. Take for instance the conflict in the Democratic Republic of Congo or even Syria, where smaller incidences are rarely reported, and even if, cannot independently be confirmed.
## Measuring forecasting success
<!-- Brier score: General -->
In order to assess the quality of forecasts, they have to be scored. A common method is based on the Brier score, which was originally proposed in the context of weather forecasting [@Brier.1950]. Generally speaking, the Brier score indicates the distance of the forecast to the truth. More precisely, the Brier score is the squared error of a probabilistic forecast. To calculate it, the forecast are expressed on the range between 0 (0%) and 1 (100%). The realized events are coded either 0 (if the event did not happen) or 1 (if the event did happen). For each answer option, the difference between the forecast and the correct answer is squared and added. It can be expressed with:
\begin{equation}
\frac{1}{N} \sum^{N}_{i=1}\sum^{R}_{k=1}(p_{ik}-o_{ik})^2
\end{equation}
<!-- Brier Score: Details + Scoring Board rules-->
$N$ stands for the number of events, $R$ is the number of possible realizations the event can have, $p$ is the probability forecast and $o$ the realized outcome. The Brier score can evaluate questions with more than two possible outcomes ($R>2$), but in this paper only binary events are considered ($R=2$).^[As the competition only includes binary events, a simpler version of the Brier score (which is equivalent to the squared error) would also be sufficient. But to make comparison to the Good Judgement Project easier the multinomial version of the Brier score is used here. The simple Brier score can easily be computed by dividing the multinomial Brier score by two.] The participants' Brier score are computed by averaging the Brier scores across the questions.^[In contrast to @Mellers.2015 the scores don't have to be normalized as the participants had to answer all questions and could not self-select the questions they thought to be easiest.] The best (lowest) possible Brier score is 0, and the worst (highest) possible Brier score is 2. The Brier score is a proper scoring function which means that participant cannot improve their score by reporting a different probability from their actual belief.
# Results
<!-- [Brier score distribution] -->
The research design aims at identifying different factors of individual forecasting success. For this, measurable differences between the individuals are a prerequisite. In the case of a forecasting tournament, this can be verified by looking at the distribution of Brier scores (Figure \ref{fig:brier}). Since the distribution ranges from `r round(min(SPFT$brier.avg),2)` to `r round(max(SPFT$brier.avg),2)` with a mean score of `r round(mean(SPFT$brier.avg), 2)` we have enough variance for further testing.
```{r echo = FALSE, fig.cap="\\label{fig:brier} Brier score distribution", out.width=c('300px', '140px'), fig.align='center', fig.pos = 'H'}
brier.plot
```
<!-- Skill vs. Luck / Pre-Hypotheses / T Test -->
Before turning to the hypotheses, it makes sense to see how the participants performed in comparison to a simple statistical benchmark. This gives us a picture of the overall forecasting ability of the participants. A standard benchmark is the comparison of the realized forecasting scores to a situation with uniformly random distributed outcomes [@Mellers.2015]. The average Brier score with random events is `r round(mean(brier.exp.fq),digits=2)`, while the actual average Brier score of the participants was `r round(mean(SPFT$brier.avg),digits=2)` (`r t.test.against.random`).^[In this context a one-sided t-test is used to see whether the forecasters performed better: $H_0: \bar b =b_rand$ and $H_A: \bar b <b_{rand}$. The Brier score for random events was computed by computing the expected Brier score for each question / individual and taking the average: $\frac{1}{24 \cdot 214} \sum^{24}_{q = 1} \sum^{214}_{i = 1} (p_{qi}^2 + (1-p_{qi})^2)$ with probability forecast $p \in$ [0,1], individuals $i \in \{1,..., 214 \}$ and questions $q \in \{1, ...,24\}$.]
<!-- [evaluation results] -->
Hence, the forecasting crowd performed significantly better than the benchmark. This supports the view of forecasting optimists: To some degree forecasting seems possible.
<!-- ^[However, if we compare the performance of the crowd to a simple 50% guess for each question (brier score = 0.5) the forecasting crowd on average did not perform better.] CHECK-->
<!-- The alternative *b. The result shows that on average forecasters did not perform better than a random quess[, they even performed worse]. This is not a surprise as it is in line with the literature on intuitive and expert judgements. Rather than looking at the group average we should look at successful forecasters and understand why they outperform the group average.* -->
<!--[Alternative measure: Proportions of forecasts on the correct side of 50%] -->
A second and more intuitive measurement of the overall forecasting accuracy of the participants is the proportion of questions where forecasters with their forecasts were on the correct side of 50% [@Mellers.2015, p. 6]. The measure counts the forecasts above 50% for event which happened and forecasts below 50% for events which did not happen and divides them by the total number of forecasts. The perfect score would be 100%, a score corresponding to chance 50%. The forecasting average of the tournament participants was `r round(mean(SB.CS$cs.avg)*100, 1)`%. Again, we can use the t-test to measure whether the difference to the chance score is significant (`r t.test.correct.side`).
\begin{table}
\label{tab:desc}
\centering
```{r results = "asis", echo = FALSE}
stargazer(select(SPFT, brier.avg, bnt.s, mct.c, time.fq.sec, Duration.min, age),
title = "Descriptive Statistics",
covariate.labels = c("Brier score", "BNT Score", "MCT Score",
"Forecasting time in min", "Total time in min",
"Age"),
header = FALSE, float=F, digits = 2)
```
\caption[desc]{Descriptive Statistics}
\end{table}
```{r results="asis", echo = F}
cor.plot
```
## Hypothesis 1a: Intelligence
<!-- [evaluation] -->
Like for the first benchmark, this measurement indicates that forecasters performed better than random guessing. However, it does also illustrate that on average the forecasters are just slightly better than chance. For comparison: In the Good Judgement Project [@Mellers.2015, p. 6] forecasting competition the share was 75%, indicating that their forecasters crowd performed better.^[A third measure for the overall performance of the crowd would be to compare the average score of the crowd to the performance of 50% guess for each question, which would basically assume the decision maker is ignorant to any information. A 50% guess for each question is equivalent to a Brier score of 0.5. Hence, according to this measurement the crowd actually performed worse than a simple 50% guessing strategy and the crowd's performance looks less favorable compared to the other measures. But this is not a problem, as the primary focus here is to understand the individual differences between forecasters and why some managed to outperform the rest.] <!-- check all of this after 24.04.-->
<!-- The correlation between brier scores and the correct side measurement could be computed in order to show that both measures capture accuracy
Alternative: *b. Like the first benchmark, the alternative measure shows that forecasters perform on average [like / worse] than random guessing.* Since the measure tries to capture the same thing, this is not surprising. -->
<!-- [Desciptives for Hypothesis 1a] -->
Having established the performance of the crowd, the focus can now turn to factors behind the forecasting success of individual forecasters. Starting with the first dispositional factor: intelligence. The forecasters mean score at the Berlin Numeracy Test (BNT) score was `r round(mean(SPFT$bnt.s),2) ` (at a range from 0 to 4), which is slightly above the 1.6 average score @Cokely.2012 found for Berlin university students.^[The forecasting tournament used the four item 'paper and pencil' version of the test. For this version there is no general population score for comparison available. However, in principle it would be possible by adjusting the score to the to the more commonly used adaptive test format. For further details see @Cokely.2012.] As the distribution of BNT scores illustrates (Figure \ref{fig:bnt}), the test is able to discriminate between the participants. It assigned the forecasters to five score levels, each of which is roughly equal in size.
```{r echo = FALSE, fig.cap="\\label{fig:bnt} Berlin Numeracy Test score distribution", out.width=c('240px', '150px'), fig.align='center', fig.pos = 'H'}
bnt.plot
```
<!-- evaluation of hypothesis 1a-->
In order to understand the relationship between intelligence and forecasting accuracy, the Pearson correlation coefficient is informative. It computes the direction of the relation, indicated by the sign (+ or -), and the magnitude of the effect on a range from 0 to 1. The correlation between the BNT score and the Brier score is: `r cor.brier.bnt`. With the t-test it can be further assessed whether the result is significantly different from no relationship between both variables ($r = 0$).
\footnote{Here the following t-test statistic is used: $t = r\sqrt{\frac{n-2}{1-r^2}}$. Note that the Pearson correlation test treats the BNT score as a continuous variable, implying that the intelligence difference between the score levels is about equal.}
As expected the correlation is negative, which implies that more intelligent individuals tended to be more accurate forecasters (expressed in a lower Brier score). This supports the first hypothesis (1a) in line with prior findings in the literature. However, the relationship is not very strong<!-- neither strong nor clearly significant-->. There can be several reasons for this: First, the relation between intelligence and forecasting accuracy might be less strong than previously argued. This is, however, rather unlikely, as intelligence was a strong predictor of forecasting success in more sophisticated research designs [@Mellers.2015; @Poore.2014]. Second, the BNT test could be not valid and therefore not measure what it claims to. Again, this is rather unlikely as it has been extensively tested in various settings [@Cokely.2012]. Nevertheless, one might argue that it measures an intelligence dimension which is less relevant for forecasting than other intelligence aspects. However, BNT scores are correlated with other intelligence measures [@Cokely.2012] and there is no plausible reason why risk literacy should not matter for forecasting while other intelligence measures do. Third and most likely: The forecasting success in the tournament reflects a mix of skills and luck, which is skewed towards luck. This was already indicated by the rather moderate performance of the crowd against standard benchmarks. The reason is probably the one-off nature of the forecasting tournament. In contrast to the forecasting projects of @Mellers.2015 and @Poore.2014 participants have little chance to incorporate feedback and improve their forecasting skills under these conditions. If this holds true, it is likely to be reflected in the remaining hypotheses testing.
## Hypothesis 1b: Moral Competency
Having this in mind, I can now turn to the second dispositional factor under scrutiny: moral competency. The Moral Competency Test (MCT) resulted in scores ranging from `r round(min(SPFT$mct.c, na.rm = TRUE),2)` to `r round(max(SPFT$mct.c, na.rm = TRUE),2)`. Among the participants are more individuals with very high scores (mct > 0.5) than in other comparable groups [@Lind.2008]. The moral competency score is not available for all participants since the completion of the moral competency test questions were not obligatory. However, the missing values are largely due to forgetting to answer a sub-question and not to categorical non-replies. ^[There is no procedure for calculating the MCT score with missing answers.] Hence, the scores should not have systematic non-response bias. In order to assess the hypothesis, the Pearson coefficient is used to evaluate the relationship between forecasting accuracy and moral competency: `r cor.brier.mct`.
```{r echo=FALSE, fig.cap="Scatterplot Moral Competency and Brier Score", out.width=c('300px', '150px'), fig.align='center', fig.pos = 'H'}
cor.brier.mct.plot
```
<!-- evaluation hypothesis 1b -->
The correlation between moral competency and forecasting accuracy indicates that individuals with a higher moral competency score forecasted slightly more accurately. However, the effect is<!--very--> weak and less significant than for intelligence<!-- and clearly not significant-->. It is not possible to draw a solid conclusion on the relationship between moral competency and forecasting accuracy from this. Like for the intelligence hypothesis, the role of luck versus skill might skew the result. However, in the case of moral competency one other major reason might also play a role: The validity of the MCT test for the underlying question. The test is a valid measure to see whether people consistently evaluate moral questions on the basis of their moral quality when faced with difficult decisions [@Lind.2008, p. 200]. In other words: It looks whether individuals are considering different moral reasons for actions despite having a moral position for themselves. But in the context of the forecasting we are interested in whether the moral opinion of the forecasters affects their analytic judgements. This is not exactly the same. Even though the similarity in both tasks might be sufficient: In both cases the decision makers have to consider contrary views and incorporate them into their judgement. Hence, they have to be willing and able to incorporate new information while having a personal stand on the issue. The link between moral and analytical judgements can, however, only be disentangled with further research.
## Hypothesis 2: Time
<!-- Descripitves Hypothesis 2 -->
In hypothesis 2 the role of decision time for forecasting accuracy was tested. The median participants spend `r round(median(SPFT$Duration.min),1)` minutes on participating in the survey and `r round(median(SPFT$time.fq.sec, na.rm = TRUE),1)` minutes on answering the forecasting questions. Hence, a large share of the participants did not spend much time on the individual questions but relied on their intuitive judgement.^[Overall, `r round(intu.share,1)`% of the participants said they used only or mostly intuition to answer the questions.] But this is not true for all participants, as some spend a considerable amount of time on forecasting. Hence, there is sufficient variance among the participants for the hypothesis. Since hypothesis 2 assumes the marginal benefit of time to decrease, the logarithmized time for answering the forecasting question is used instead of the actual time. The Pearson correlation between the logarithmized time and forecasting accuracy is: `r cor.brier.time`.^[Two extreme outliers were excluded as the recorded time was implausibly high, likely as a result of interrupting the survey in order to do something else.]
<!-- Discussion of second method to test the relevance of time: Divide participants into two groups depending on the time they spend in the questions and compare group means. The exact devision time is however arbitrary.-->
<!-- evaluation of Hypothesis 2 -->
The negative correlation coefficient indicates that more time spend on answering the question is related to more forecasting accuracy. Moreover, when we compare the correlation of the linear time and the logarithmized time, the latter has a higher correlation coefficient and is clearly significant. (| $r_{log} =$ `r round(cor(SPFT$brier.avg, SPFT$time.fq.sec.log, use="complete.obs"),2)`| > | $r_{linear} =$`r round(cor(SPFT$brier.avg, SPFT$time.fq.sec, use="complete.obs"),2)`|). Hence, this supports the view of a decreasing marginal return on forecasting as it was expected in hypothesis 2. To conclude: Forecasters who used more time performed better, but the added accuracy decreased as they spend more time on the forecasting questions.
<!-- ALTERNATIVE However, the relationship is not significant]
*The positive correlation is contrary to the time hypothesis. This might [again] be an indicator that forecasting success in this forecasting tournament is rather due to luck than to other factors. It can also mean that other aspects, e.g. prior political knowledge, offset the effects of time for forecasting. This could only be investigated in further research.* -->
```{r echo = FALSE, fig.pos='H', fig.cap= "Forecasting time - Brier score scatterplot", out.width=c('240px', '120px'), fig.align='center'}
cor.brier.time.plot
```
```{r echo = FALSE, fig.pos='H', fig.cap= "Log (forecasting time) - Brier score scatterplot", out.width=c('240px', '120px'), fig.align='center'}
cor.brier.time.log.plot
```
## Hypothesis 3: Decision guide
<!-- Hypothesis 3 -->
Finally, hypothesis 3 was about the effect of an decision aid intervention. The participants were randomly assigned to a treatment (n = `r plyr::count(SPFT$Group[SPFT$Group == "Treatment"])[2]`) or control group (n = `r plyr::count(SPFT$Group[SPFT$Group == "Control"])[2]`). The treatment group was presented with a decision guide, while the control group was not. To test whether the analytic guide had any impact on the forecasting accuracy the mean Brier scores of both groups is used. The mean Brier score of the treatment group is `r round(t.test.intervention[[5]][1],2)` and of the control group `r round(t.test.intervention[[5]][2],2)`.
<!-- evaluation hyptothesis 3 -->
There is hardly any difference between the two groups (`r t.test.intervention.result`). Hence, it is unlikely that the intervention had any impact on forecasting accuracy and there is no support for the intervention hypothesis (3). The decision guide might have failed for several reasons:
<!-- first reason: ignored decision guide -->
First, participants might have ignored the decision guide. This might in particular be true for forecasters who only spend a few minutes on answering the questions. Moreover, applying a decision methodology requires cognitive effort from decision makers [@Kretz.2015, p. 68ff] and disrupts the train of thoughts [@Hernandez.2013]. When faced with difficult analytical tasks, like forecasting questions, decision makers might have used their mental capacities rather for information processing than for their methodological approach.
The intervention tried to account for this with a minimal preventative measure. The forecasters had to indicate with a check box whether they have read the guide. Check boxes have been shown to be an effective nudge for analysts to increase their attention and are commonly used, e.g. by airline pilots or marine crews, see @Kretz.2015 [p. 33ff.]. All participants in the treatment group indicated that they have read the guide.
So what happened? To see whether the treatment lead to any behavioral change, we can check whether it had any effect on the time used for answering the forecasting questions. And this is the case, as the treatment group on average used `r round(mean(SPFT$time.fq.sec[SPFT$Group == "Treatment"], na.rm = TRUE),1)` min and the control group `r round(mean(SPFT$time.fq.sec[SPFT$Group == "Control"], na.rm = TRUE),1)` min. But these means might be the result of few extreme outliers. To compare whether the difference is more than a random occurence, the logarithmized times can be used.^[Eye-balling the data shows that time in minutes used by the forecasters clearly does not follow a Gaussian distribution, but the logarithmized distribution resembles the classical Gaussian curve.] Applying a t-test shows that the treatment group used more time, but the finding is only significant at the five percent level (`r t.test.intervention.time`). Even though this is not a clear result, it does also not rule out a small behavioral change from the decision guide.
```{r echo = FALSE, fig.pos='H', fig.cap= "Distribution forecasting log(time) for treatment and control group ", out.width=c('300px', '200px'), fig.align='center'}
hypo3.time.plot
```
But if the forecasters used the decision methodology, the lack of higher accuracy might also come from irrelevance of the proposed methodology. This would contradict the research results by @Tetlock.2015 and his so-called superforecasters, who performed rather well with the methodology of the decision guide in the Good Judgement Project. However, the problem might be a result of inappropriate application of the decision heuristic. In other words: In order to have measurable results in terms of forecasting accuracy, the forecasters need sufficient training in the methodology and a simple decision guide does just not provide enough learning experience. This is the most plausible explanation for the intervention failure, as for example @Mellers.2014 showed how a one-hour probabilistic training, which included the inside outside view methodology, could improve forecasting accuracy on the long term.
<!-- general conclusion -->
To conclude, similar to the 'list the hypotheses' or 'map the evidence methodology' [@Kretz.2015], a simple decision guide can be added to a list of decision aids with no considerable effects for analytical decisions.
# Forecast aggregation
<!-- 1. simple mean probability forecast -->
So far the focus of the paper was on the quality of individual forecasting decisions. However, having a group of forecasters enables us to generate forecasts by using the wisdom of the crowd effect. The simplest approach would be, like @Galton.1907 did in the case of predicting the weight of oxen, to use the average probability forecasts of the participants. Compared to the average Brier score among the participants (`r round(mean(SPFT$brier.avg),2)`) the score from averaging the forecasts of each question is `r bs.mean`.<!--check aggregated result --> Obviously, the crowd forecast performed significantly better than average participant in the tournament (`r t.test.bs.avg.mean`).
<!-- 2. intro: smart aggregation -->
But if we take into account what we know about individual forecasting decisions, the crowd wisdom can be aggregated in a more effective way. On the one hand, we can use individual predictors of forecasting success for combining the forecasts. On the other hand, extremizing aggregated forecasts can correct biases in aggregrated probability forecasts [@Wallsten.1997; @Zhang.2012].
<!-- sub group aggregating-->
The individual differences between forecasters can be exploited in different ways when aggregating forecasts. A simple approach is to combine only forecasts of individuals who are expected to perform well. The forecast of this subgroup will then reflect both, individual forecasting skills and the wisdom of the crowd effect. The selection can be based on the past performance of forecasters, as for example done by @Mellers.2014. Since in the security policy forecasting tournament the past performance of forecasters is unknown, this is not an option here. However, we can use the predictors of individual forecasting success to select a subgroup. As discussed in the previous section, these are, in particular, the BNT score and the time used for forecasting. The choice of the direct cut-off point is somehow arbitrary. Here the subgroup is composed of the individuals who are the in the better half of the BNT score range and who spend sufficient time with forecasting.^[More precisely: individuals which at least hat a BNT score of 3 and spend more time than the median forecaster on the questions.] Taking the average forecast of this selected subgroup results in a Brier score of `r bs.cutoff.mean`, which is clearly better than the `r bs.mean` score of the whole group. Theoretically, we could improve the result further by finding an ideal cutoff point for the subgroup. But then we risk overfitting the data and for illustration it is sufficient to show that the Brier score improves even with an arbitrary cut-off point.
<!-- weighted forecasts -->
Another approach is to weight the forecasts with the drivers of forecasting success. This approach has the advantage to avoid a cutoff point. The weights can be constructed in various ways, but it seems plausibe to assume that intelligence and the used time interact. In other words: A more intelligent forecaster should also be able to utilize the time more effectively. Therefore the weights are constructed by multiplying the BNT score and the logaritmized time ($w_i = bnt_i \cdot log(time_i)$).^[The logaritmized time is used to reduce the impact of outliers. It also better describes the relation between forecasting accuracy and time (Hypothesis 2).] Like for the subgroup aggregation, the weighted forecast performs better (`r bs.mean.w.bnt.time`) than the unweighted average (`r bs.mean`), but the improvement is rather small.
<!-- debiasing / extremizing -->
Apart from exploiting the individual differences, extremizing does also improve the aggregated forecasts. There are several reasons for this [@Baron.2014; @Satopaa.2014]: First, the random errors are compressed at the ends of the 0 to 1 probability scale which pushes the average forecast towards 0.5. Second, individual forecasters draw from different sources. The diversity of information sources allows us to be more confident than the simple average forecasts suggests since the aggregate is actually based on a broader informational base. This would not be the case, if all forecasters use the same information for their judgement.^[A commonly discussed example to illustrate this is President Obama's decision on whether to start a special operation to kill Osama Bin-Laden in Abbottabad. If his advisors used the same information sources, he would be best advised to average them. If they are from different sources, extremizing is advisable [e.g. @Tetlock.2015].] Third, forecasts are underconfident because they only utilize partial information. Theoretically speaking, a forecaster would start from a 50% guess and incrementally adjust the probability by including information.^[Whether it is really rational to start from a 0.5 prior when having no information has recently attracted some criticism, see e.g. @Gilboa.2009.] This information should on average point in the direction of the best-informed probability forecast. But in reality the best-informed forecast is not reached, because most forecasters stop before they use all available information.
<!-- how to extremize -->
In this paper the extremization is implemented with a simple logit model, as proposed by @Satopaa.2014. The logit transformation is convenient for extremizing as it maps the probabilties to a continuous domain where a 50% probability is equivalent to a log-odds value of 0. The systematic bias, which we want to correct for with extremizing, can be described with a single variable $a \in [0,\infty]$. The best-informed forecast is described by $a = 1$. An underconfident forecast would be expressed with $a > 1$. In principle, this representation is also applicable for overconfidence ($a < 1$). But former research has shown that forecasting crowds as a whole tend be underconfident [@Baron.2014]. Mathematically, the relation between the best informed forecast ($p$) and the individual forecasts ($p_i$) can be descibed as follows:
\begin{equation}
y_i = log\Big(\frac{p_i}{1-p_i}\Big) = log\Big(\frac{p}{1-p}\Big)^{\frac{1}{a}} + \epsilon_i
\end{equation}
In order to illustrate the potential of extremizing, we have to compute the corrected probabilities $p_E$ for each question. This can be done with:^[This is an MLE estimator. For a more in-depth explanation see @Satopaa.2014.]
\begin{equation}
\hat p_E(a) = \frac{\big[\prod^N_{i = 1}(\frac{p_i}{1-p_i})^{\frac{1}{N}}\Big]^{a}}{1 + \Big[ \prod^N_{i = 1}(\frac{p_i}{1-p_i})^\frac{1}{N}\Big]^{a}}
\end{equation}
<!-- results of smart crowd aggregation -->
To illustrate the potential of extremizing, we can look for the level of bias correction ($a$) which minimizes the Brier score. For the security policy forecasting tournament the bias is a* = `r round(bias.a$minimum,2)`. Hence, as expected the forecaster crowd on average is underconfident. Extremizing does improve the forecasting accuracy to `r round(bias.a$objective,3)` from `r bs.mean`. Hence, extremizing can lead to similar accuracy gains as exploiting the individual differences between forecasters.
# Discussion
<!-- General findings -->
To sum up, this research has found the forecasting crowd to perform slightly better than random guessing but not as accurate as they have been in other comparable forecasting competitions. The forecasting tournament supports the idea that intelligence and forecasting time, if diminished returns are considered for the latter, are useful predictors for forecasting accuracy. Moreover, the data provides indicative support that moral competency is positively related to forecasting accuracy, but findings are too weak to draw a final conclusion. Finally, there is no support for a measurable effect of a decision guide intervention on forecasting accuracy. These insights can be used when aggregating individual predictions to compute more accurate forecasts.
<!-- Limitations of the design -->
These results should, however, be enjoyed with care. First, the size of the tournament was relatively small compared to other similar projects. More questions and forecasters would make the results less sensitive to outliers and reduce the impact of single interpretation of the forecasting questions, as e.g. the occurance of some events could only be established with some considerations.^[The answers to the individual questions are accessable on the [project website](https://corrod3.github.io/SecurityPolicyForecastingTournament/).] With more questions the implications of a single consideration would likely balance out on average. Second, the research design only allows to make limited claims. As the overall performance of the crowd indicates, apart from skill, luck was also a driver of the results. In a long-term setting the effect of forecasting skill would likely be larger as the participants would get used to the format and learn (differently) from their performance. Additionally, like the intervention hypothesis, testing could be more rigorous. For example, in the case of the time hypothesis participants could be randomly assigned to groups with different time allocations. Such designs might, however, not be feasible in purely online-based forecasting formats. Third, to understand the relation between moral and analytical judgements in the context of forecasting the inquiry has to be supplemented with other measurement tools. While these are limitations of this study, they can be addressed fruitfully in future research.
<!-- Policy relevance -->
So what can a policy maker take from all of this? First of all: It matters *who* is forecasting. Hence, when tasking people or institutions with looking at the future, selecting the right persons is key. Risk literacy and numeracy are useful indicators for this. The forecasters should also be provided with an appropiate working environment. Available time for forecasting plays a vital role. In the everyday work of bureauracies, employees concerned with foreign policy rarely have deliberate time for analytically thinking about their (implicit) forecasts. And policy makers should use analytical decision guides which have been empirically proven to be effective. This is not the case with the decision guide used here, but there are decision tools which are useful. These recommendations sound commonsensical, but often they are forgotten when it comes to implementation.
# References