Is it possible to make useful forecasts about social phenomena such as the alarm over predictions of dangerous manmade global warming? Kesten Green and Scott Armstrong reasoned that it should be possible to do so using the structured analogies method. (In their 2007 paper, Green and Armstrong found that structured analogies provided forecasts that were more accurate than experts’ judgments when applied to the difficult problem of forecasting decisions in conflicts.) The Global Warming Analogies Forecasting Project page is now up on the Public Policy Forecasting SIG pages (PublicPolicyForecasting.com). The page includes a link to the Project working paper and lists 26 situations that have been identified as analogous to the current alarm. To date there are descriptions of six of them available. Kesten and Scott welcome evidence and analysis, especially if it contradicts their own efforts to date.
Can schoolgirls who do not recognize potential candidates do a better job than polls of making an early prediction of who will win an election? A paper by Scott Armstrong, Kesten Green, Randy Jones, and Malcolm Wright, now forthcoming in the International Journal of Public Opinion Research, presents evidence from research to predict the outcome of the previous U.S. presidential election and previous studies that they can.
The New Zealand schoolgirls and other research participants were shown only pictures of potential candidates faces and were asked to rate their competence. The candidates with the highest facial competence ratings in each of the two parties received the highest popular vote. For the Republicans, McCain had the highest facial competence rating at a time when Giuliani was well ahead in the polls. For the Democrats, Clinton had the highest rating and Obama was in second place; both had higher ratings than McCain. Clinton received slightly more popular votes but lost the nomination to Obama in the vote by party delegates.
A working paper version of the paper is available here.
(We have already made changes in response to feedback - 18 March 2008, and again on 21 March.) Here is the proposed restatement of the principle in full:
13.29 Do not use measures of statistical significance to assess a forecasting method or model.
Description: Even when correctly applied, significance tests are dangerous. Statistical significance tests calculate the probability, assuming the analyst’s null hypothesis is true, that relationships apparent in a sample of data are the result of chance variations that arose in selecting the sample. The probability that is calculated is affected by the size of the sample and the choice of null hypothesis. With large samples, even small differences from what would be expected in the data if the null hypothesis were true will be “statistically significant.” Choosing a different null hypothesis can change the conclusion. Statistical significance tests do not provide useful information on material significance or importance. Moreover, the tests are blind to common problems such as non-response error, response error, and misspecification of relationships. The proper approach to analyzing and communicating findings from empirical studies is to (1) calculate and report effect sizes; (2) estimate the range within which the actual effect size is likely to lie by taking account of prior knowledge and all potential sources of error in measuring the effect; and (3) conduct replications, extensions, and meta-analyses.
Purpose: To avoid the selection of invalid models or methods, and the rejection of valid ones.
Conditions: There are no empirically demonstrated conditions on this principle. Statistical significance tests should not be used unless it can be shown that the measures provide a net benefit in the situation under consideration.
Strength of evidence: Strong logical support and non-experimental evidence. There are many examples showing how significance testing has harmed decision-making. Despite repeated appeals for evidence that statistical significance tests can improve decisions, none has been forthcoming. Tests of statistical significance run contrary to the proper purpose of statistics—which is to help users make sense of data. Experimental studies are needed to identify the conditions, if any, under which tests of statistical significance can improve decision-making.
Source of evidence:
Armstrong, J. S. (2007). Significance tests harm progress in forecasting. International Journal of Forecasting, 23, 321-336, with commentary and a reply.
Hauer, E. (2004). The harm done by tests of statistical significance. Accident Analysis and Prevention, 36, 495-500.
Hubbard, R. & Armstrong J. S. (2006). Why we don't really know what ‘statistical significance’ means: a major educational failure. Journal of Marketing Education, 28, 114-120
Hunter, J.E. & Schmidt, F. L. (1996). Cumulative research knowledge and social policy formulation: The critical role of meta-analysis. Psychology, Public Policy, and Law, 2, 324-347.
Ziliak, S. T. & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. Ann Arbor, MI: University of Michigan Press.
Have you ever been told you should "stand in the other person's shoes" in order to predict the decisions they will make? This is a common and plausible recommendation in popular business books and everyday life but, to date, there has been no experimental evidence on its usefulness. Kesten Green and Scott Armstrong found, when they formalized this advice in the form of a method they call "role thinking", that the forecasts were little more accurate than guessing what the decisions might be. This finding is consistent with earlier findings that situations involving interactions between people with different roles are too complicated for experts to make useful predictions about when they rely on trying to think through what will happen. The group forecasting method of simulated interaction, on the other hand, allows realistic representations of group interactions and does provide accurate forecasts. Green and Armstrong's paper has been accepted for publication in a special issue of the International Journal of Forecasting on group forecasting. A copy of their working paper is available here.
Andreas Graefe and Scott Armstrong report on results from an experiment on the relative accuracy of three structured approaches compared to traditional face-to-face meetings. The four methods were compared on a quantitative judgment task that did not involve widely dispersed information among participants.
Overall, Delphi performed best, followed by nominal groups, prediction markets and unstructured meetings. Of the three structured approaches, only Delphi outperformed a simple average of participants' prior individual estimates.
The authors also report participant's ratings of the group methods. Participants preferred personal interaction such as in meetings and nominal groups. Prediction markets were rated least favorable.
The pre-print version of the paper, which will be published by the International Journal of Forecasting, is available here.