Statistical reasoning (SR) is a form of reasoning with probabilistic features, applicable to inference and decision making in the presence of an uncertainty that cannot be expressed in terms of known and agreed chance probabilities. Thus SR is not relevant to games of pure chance, such as backgammon with well-engineered dice, but is likely to be involved in guessing the voting intentions of an electorate and fixing an advantageous polling date.
Its application is usually mediated by some standard statistical method (SM) whose prestige and convenience, especially if computerized, can induce a neglect of the associated SR. Even when explicitly formulated, SR may be plausible (or not) in appearance and efficacious (or not) in its ultimate influence. The evolutionary theory of SR (Campbell 1974) postulates that it is a genetically controlled mental activity justified by survival advantage. A related black-box view of the efficacy of SR may be useful in deciding between the claims of different SR schools that their respective nostrums are found to work in practice. We shall concentrate here, however, on the plausibility of the types of SR usually associated with particular statistical methods, and go on to consider principles that may assist in the continually required discrimination in favour of good SR. Our ith example of method is denoted by SMi and the jth example of possible reasoning for it is denoted by SRij. Undefined terms will be supposed to have their ordinary interpretations.
SM1: The incorporation of an element of objective random sampling in any observations on a population of identifiable items, that ensures for each item a specified, non-negligible probability of being included in the sample.
SR11: Without the element of random sampling, it is impossible for the sampler to justify the selection of items without reference to some systematic, comprehensive theory, which may be erroneous or, worse, subject to undeclared or subconscious bias.
SR12: With the element, it is maintainable by probabilistic argument that the unobserved items should not be systematically different from those observed. This permits tests of hypotheses about the population as a whole.
SR13: The power of such tests may be enhanced by the device of restricted randomization which excludes in advance the selection of samples that would only weakly discriminate among alternative hypotheses.
SM2: Random manipulation of controllable independent variables in the treatment of experimental units, and the analysis of the effect of this manipulation on dependent variables.
SR21: If the effect referred to were reliably established, this could be described as causal, operating either directly or through the agency of other variables. The use of an isolated random manipulator—uninfluenceable and influential only through controllable independent variables—is necessary in order to rule out the hypothesis of spurious correlation between the dependent variables and naturally occurring variation of the independent variables. As a bonus, it also rules out the possibility of the experimenter using ‘inside knowledge’ to produce such a correlation by unconscious or deliberate choice of the values of the control variables.
SR22: The extent to which such causal inference is possible in non-experimental investigation depends on the extent to which changes in the independent variables are induced by factors judged to be equivalent to an isolated random manipulator, as in quasi-experimental studies (Blalock 1972).
SM3: Evaluation of the achieved significance level P for the observed value t of a test statistic T whose (null) distribution is specified by a (null) hypothesis H0, i.e.
P=Pr(T≥t|H0).
SR31: When it is small, P provides a standardized interpretable encoding of the deviation of t from the values of T that would be expected if H0 were true. Increasing values of t are encoded as decreasing values of P which induce increasing dissatisfaction with H0. A small value of P forces the simple dichotomy: either H0 is true and a rare event has occurred, or H0 does not describe the actual distribution of T.
SR32: P is not the ‘probability that H0 is true’, which probability is not definable in the set-up of SM3.
SR33: The ‘dissatisfaction’ in SR31, increases smoothly: there is no critical value, 0.05 for example, at which P suddenly becomes scientifically important.
SR34: The provenance of T should be taken into account in the calculation of P when, for example, T has been selected as a result of a search for any interesting feature of the data.
SM4: Calculation, from the data x, of a 95 per cent confidence interval (l(x), u(x) for a real-valued parameter θ in a statistical model defined as a set {Prθ}, indexed by θ, of probability distributions of X, the random generic of x.
SR41: The particular interval (l(x), u(x)) is regarded as relevant to inference about the true value θ because of the coverage property.
Pr (l(X)≤θ≤u(X))=0.95
SR42: The value 0.95 is not the ‘probability that θ lies in the particular interval (l(x), u(x))’, which probability is not definable in the set-up of SM4.
SR43: Can the ‘relevance’ mentioned in SR41 be reasonably maintained when, as may happen, the calculated interval turns out to be the whole real line, or the empty set, or when it may be logically established that the interval contains θ? Such counter-examples to SR41, do not arise in the commoner applications of the confidence interval method.
SM5: Given data x for a statistical model indexed by a parameter θ, a posterior probability distribution for θ is calculated by the Bayesian formula
posterior density∞prior density×Prθ(x)
and used freely for purposes of inference and decision.
SR51: There are now several nearly equivalent formulations of the Bayesian logic (Fishburn 1970) whose upshot, roughly, is that any individual, willing to accept a few qualitative axioms about ‘probability’ and to give expression to them in a rich enough context, will discover that she or he has a subjective probability distribution over everything—or at least over everything related to x. The formula in SM5 is particularly convenient if the first fruits of the introspective process for determining this distribution are not only the assignment of probability 1 to the assertion that data x was indeed randomly generated by the statistical model but also the probability distribution of θ which is the ‘prior density’.
SR52: If the ‘process’ in SR51 were faithfully undertaken by a very large number of Bayesians in a range of contexts, then, if the statistical models to which unit probability is assigned were indeed correct, it would be a consequence of the supposed randomness in the models that the data x would, with high probability, show significant departure from its associated model in a specifiable proportion of cases. This would be so, even if the Bayesians were fully aware of the features of their data at the time of their probability assignments.
It may therefore be necessary to defend the rights of Bayesians to use statistical models that would be rejected by other statistical methods.
SR53: The difficulty for the Bayesian approach just described may be overcome by the assignment of a probability of 1− rather than unity to the statistical model: awkward data can then be accommodated by reserving the prior probability for any ad hoc models.
SR54: A similar loophole may be employed in dealing with the paradox created by data that simultaneously deviates highly significantly from what is expected under a sharp sub-hypothesis, θ=θ0, say, of the model, while increasing the odds in favour of θ0 (Lindley 1957). For example, suppose a ‘psychic’ correctly predicts 50,500 out of 100,000 tosses of a fair coin and the statistical model is that the number of correct guesses is binomially distributed with probability. For the prior that puts prior possibility at and uniformly over the interval (0, 1), the posterior odds in favour of are 1.7/1, although the outcome has an achieved significance level of 0.0008.
SR55: Bayesians claim that all probabilities are subjective with the possible exception of the quantum theoretic sort. At best, subjective probability distributions may agree to assign unit probability to the same statistical model but, even then, the posterior distributions would differ, reflecting individual priors. Such differences have not succumbed to extensive but largely abortive efforts to promulgate objective priors (Zellner 1980), just as attempts to formalize the apparently reasonable slogan ‘Let the data speak for themselves!’ have proved nugatory.
SM6: Given are
1
a statistical model {Prθ}
2
a set {d} of possible decisions
3
a loss function L(d, θ), the loss if decision d is taken when θ is true
4
a set {δ} of decision rules, each of which individually specifies the decision to be taken for each possible x.
Deducible are the risk functions of θ, one for each δ, defined by the expectation under Prθ of the randomly determined loss L(δ(X),θ) The method, not completely specified, consists in selecting a decision rule from {δ} that has a risk function with some optimal character.
SR61: The ambiguity of choice of T for the ‘achieved significance level’ method (SM3), coupled with that method’s lack of concern about its performance when H0 does not hold, led Neyman and Pearson to treat testing a hypothesis as what may now be viewed as a special case of SM6. This has, simply, {d} = (Accept H0, Reject H0}, L=0 or 1 according as d is right or wrong and, as a consequence, a risk function equivalent to a statement of the probabilities of error ‘size’ and ‘1—power’.
SR62: A difficulty with the risk function approach to inference that is implicit in the Neyman-Pearson treatment of hypothesis testing was pointed out by Cox (1958). It can be illustrated with a simple story. Two pollsters A and B wanted to test the hypothesis that no more than half the electors of a large city, willing to respond to a particular Yes—No question, would do so affirmatively (Cohen 1969). Pollster A suggested that the poll would require only 100 randomly chosen respondents, whereas B wanted to get 10,000 responses. They agreed, first, to toss a fair coin to decide the sample size, second, to employ the 5 per cent hypothesis test, most powerful in detecting a Yes:No ratio of 2:1, with probabilities of error defined before the outcome of the toss is known. They check with a statistician that this test would have an overall power of 99 per cent for the alternative hypothesis that the proportion of yeses was 2/3. In the event, the sample size was 10,000 and the number of yeses was 5,678. Both A and B were astonished when advised that this number was too small to reject the hypothesis by the agreed test, even though, had it been obtained in a survey with a non-random choice of the sample size 10,000, it would have had an achieved significance level (SM3) of less than 1 in a million!
The reason for this behaviour is that the Neyman-Pearson lemma, justifying the test, ignores all possibilities other than the null and alternative hypotheses, under both of which any outcome in the region of 5,678 yeses has only the remotest possibility of occurring.
SR63: Another apparent difficulty for risk functions arose in connection with the widespread use of least squares estimates for normal models. Taking risk as mean square error, James and Stein (1961) found that improvements could be made, whatever the true values of the parameters, by means of a special estimator even when this combined the data of quite unrelated problems. This striking phenomenon may be regarded as providing a criticism of least squares estimation viewed as a form of restriction on {δ}: a Bayesian approach whose prior insists that the problems are indeed unrelated will not allow any pooling of information—but will also not produce least squares estimates.
The above examples of SR were elicited in response to statements of representative statistical methods and are of a somewhat ad-hoc, fragmentary character. Are there no general principles that can be brought to bear on any problem of statistical methodology of whatever size and shape? The answer depends very much on the extent to which the ‘uncertainty’ in the problem has been crystallized in the form of an agreed statistical model {Prθ}. Given the latter, the ideas of Birnbaum (1969) and Dawid (1977) deserve wider appreciation.
In Dawid’s terminology, an ‘inference pattern’ is any specified function I(ξ, x) of the two arguments: a ‘potential experiment’ ξ and associated potential data x (the value of variable X). For each ξ in a specified class, a statistical model {Prθ} is provided for X, where the parameter θ indexes the supposed common uncertainty in all the potential experiments considered. These are the defining conditions under which a number of principles require that I be the same for data x in ξ and data x′ in ξ′:
Principle
Conditions for I(ξ, x)=I(ξ′, x′)
‘Distribution’
ξ and ξ′ have the same {Prθ} and x = x′
‘Transformation’
ξ′ is given by a 1–1 transformation t of the data in ξ and x′=t(x)
‘Reduction’
I(ξ, x) is a function of r(x), ξ′ is given by reporting the value of r, and x′=r(x)
‘Ancillarity’
a(X) has a constant distribution (independent of θ), ξ′ is the experiment whose statistical model is the set of probability distributions of X given a(X)=a, a(x)=a and x′=x
‘Sufficiency’
ξ′ reports the value of a sufficient statistic t(x) and x′=t(x)
‘Likelihood’
the likelihood functions of θ, given x in ξ and given x′ in ξ′, are proportional
There are implications among such principles so that if one accepts the weaker looking ones, one is then obliged to accept the stronger ones—such as the likelihood principle. Very many statistical methods violate the latter.
When there is no agreed statistical model, however, SR cannot receive the (occasionally doubtful) benefit of mathematical support. Perhaps as a consequence, it has not received much attention in the literature, except in the popular texts excellently represented by Huff (1973) or the occasional philosophical article (most philosophical discussions of SR are implicitly model-dependent). At this pre-modelling level, there is a broad consensus among the statistically minded as to what constitutes poor SR: it is much more difficult to characterize good SR. The latter is required to avoid the elementary logical pitfalls but has to go well beyond that in constructive directions. A paradoxical snag in statistical thinking about some problems is how to recognize that the data are inadequate to support such thinking: imaginative SR is often needed to specify the kind of data needed to support the embryonic inferences being formulated.
Pre-modelling SR stands to gain much from the recent advances in ‘descriptive statistics’ largely associated with the work of Tukey (1977). The techniques of ‘exploratory data analysis’ and ‘computer-intensive methods’ (Diaconis and Efron 1983) extend the range of statistical activity ultimately subject to SR scrutiny but, at the same time, they enhance the risks that SR will be neglected by methodologists fascinated by the complexity of such techniques.
M.Stone
University of London
References
Birnbaum, A. (1969) ‘Concepts of statistical evidence’, in S.Morgenbesser el al. (eds) Philosophy, Science and Method: Essays in Honor of E.Nagel, New York.
Blalock, H.M. (1972) Causal Models in the Social Sciences, London.
Campbell, D.T. (1974) ‘Evolutionary epistemology’, in P.A.Schilpp (ed.) The Philosophy of Karl Popper, La Salle, IL.
Cohen, J. (1969) Statistical Power Analysis for the Behavioral Sciences, New York.
Cox, D.R. (1958) ‘Some problems connected with statistical inference’, Annals of Mathematical Statistics 29.
Dawid, A.P. (1977) ‘Conformity of inference patterns’, in J.R.Barra et al. (eds) Recent Developments in Statistics, Amsterdam.
Diaconis, P. and Efron, B. (1983) ‘Computer-intensive methods in statistics’, Scientific American 248.
Fishburn, P.C. (1970) Utility Theory for Decision Making, Publications in Operations Research 18, New York.
Huff, D. (1973) How to Lie with Statistics, Harmondsworth.
James, W. and Stein, C. (1961) ‘Estimation with quadratic loss’, Proceedings of the 4th Berkeley Symposium of Mathematical Statistics and Probability 1.
Lindley, D. V (1957) ‘A statistical paradox’, Biometrika 44.
Tukey, J.W. (1977) Exploratory Data Analysis, Reading, MA.
Zellner, A. (1980) Bayesian Analysis in Econometrics and Statistics: Essays in Honor of Harold Jeffreys, Amsterdam.