As commonly applied, the criticism that studies are inconsistent has several implications, all of them interpreted as suggesting that the hypothesized association is not present: (1) No association is present, but random error or unmeasured biases have generated the appearance of an association in some but not all studies. (2) No conclusions whatsoever can be drawn from the set of studies regarding the presence or absence of an association. (3) The literature is methodologically weak and pervasive methodologic problems are the source of the disparate study findings. An equally tenable explanation is that the studies vary in quality and that the strongest of the studies correctly identify the presence of an association and the methodologically weaker ones do not, or vice versa. Unfortunately, the observation of inconsistent results per se, without information on the characteristics of the studies that generated the results and the nature of the inconsistency, conveys very little information about the quality of the literature, whether inferences are warranted, and what those inferences should be. Inconsistencies across studies can arise for so many reasons that without further scrutiny the observation has little meaning.
Random error alone inevitably produces inconsistency in the exact measures of effect across studies. If the overall association is strong, then such deviations may not detract from the overall appearance of consistency. For example, if a series of studies of tobacco use and lung cancer generate risk ratios of 7.0, 8.2, and 10.0, we may legitimately interpret the results as consistent. In contrast, in a range of associations much closer to the null value, or truly null associations, fluctuation of equal magnitude might well convey the impression of inconsistency. Risk ratios of 0.8, 1.1, and 1.5 could well be viewed as inconsistent, with one positive and two negative studies, yet the studies may be estimating the same parameter, differing only due to random error. When the precision of one or more of the studies is limited, the potential for random error to create the impression of inconsistency is enhanced. While the pursuit of substantive explanations for inconsistent findings is worth undertaking, the less intellectually satisfying but often plausible explanation of random error should also be seriously entertained. Results that fluctuate within a relatively modest range do not suggest that the studies are flawed, but rather may simply suggest that the true measure of the association is somewhere toward the middle of the observed range and the scatter reflects random error. Conversely, substantial variability in findings across studies should not immediately be assumed to result from random error, but ran dom error should be included among the candidate contributors, particularly when confidence intervals are wide.
Those who compile study results will sometimes tally the proportion of the studies that generate positive or negative associations, or count the number of studies that produce statistically significant associations. While there are ways to infer whether the count of studies deviates from the expectation under the null (Poole, 1997), it is far preferable to examine the actual measures of effect and associated confidence intervals. To count the proportion of studies with relative risks above or below the null sacrifices all information on the magnitude of effect and variation among the studies generating positive and inverse associations. A focus on how many were statistically significant hopelessly confounds magnitude of effect with precision. A series of studies with identical findings, for example, all yielding risk ratios of 1.5, could well yield inconsistent findings with regard to statistical significance due to varying study size alone. Variability in study size is one easily understood basis for inconsistency due to its affect on precision. As suggested in Chapter 10, statistical significance is of little value in interpreting the results of individual studies, and the problems with using it are compounded if applied to evaluating the consistency of a series of studies.
Another mechanism by which a series of methodologically sound studies could yield inconsistent results is if the response to the agent in question truly differs across populations, i.e., there is effect measure modification. For example, in a series of studies of alcohol and breast cancer, one might find positive associations among premenopausal but not postmenopausal women, with both sets of findings consistent and valid. Some studies may include all or a preponderance of postmenopausal women and others predominantly premenopausal women. If the effect of alcohol varies by menopausal status, then the summary findings of those studies will differ as well. Whereas the understanding of breast cancer has evolved to the point that there is recognition of the potential for distinctive risk factors among premenopausal and postmenopausal women, for many other diseases the distinctiveness of risk factors in subgroups of the population is far less clear. Where sources of true heterogeneity are present, and the studies vary in the proportions of participants in those heterogeneous groups, the results will inevitably be inconsistent. All studies however may well be accurate in describing an effect that occurs only or to a greater extent in one subpopulation.
This differing pattern of impact across populations is one illustration of effect modification. In the above example, it is based on menopausal status. Analogous heterogeneity of results might occur as a function of baseline risk. For example, in studies of alcohol and breast cancer, Asian-American women, who generally have lower risk, might have a different vulnerability to the effects of alcohol compared to European-American women, who generally have higher risk. The prevalence of concomitant risk factors might modify the effect of the one of interest. If the frequency of delayed childbearing, which confers an increased risk of breast cancer, differed across study populations and modified the effect of alcohol, the results would be heterogeneous across populations that differed in their childbearing practices.
Where strong interaction is present, the potential for substantial heterogeneity in study results is enhanced. For example, in studies examining the effect of alcohol intake on oral cancers, the prevalence of tobacco use in the population will markedly influence the effect of alcohol. Because of the strong interaction between alcohol and tobacco in the etiology of oral cancer, the effect of alcohol intake will be stronger where tobacco use is greatest. If there were complete interaction, in which alcohol was influential only in the presence of tobacco use, alcohol would have no effect in a tobacco-free population, and a very strong effect in a population consisting of all smokers. Even with less extreme interaction and less extreme differences in the prevalence of tobacco use, there will be some degree of inconsistency across studies in the observed effects of alcohol use on oral cancer. If we were aware of this interaction, of course, we would examine the effects of alcohol within strata of tobacco use and determine whether there is consistency within those homogeneous risk strata. On the other hand, if unaware of the interaction and differing prevalence of tobacco use, we would simply observe a series of inconsistent findings.
There is growing interest in genetic markers of susceptibility, particularly in studies of cancer and other chronic diseases (Perera & Santella, 1993; Tockman et al., 1993; Khoury, 1998). These markers reflect differences among individuals in the manner in which they metabolize exogenous exposures, and should help to explain why some individuals and not others respond to exposure to the same agent. If the proportion that is genetically susceptible varies across populations, then the measured and actual effect of the exogenous agent will vary as well. These molecular markers of susceptibility are not conceptually different from markers like menopausal status, ethnicity, or tobacco use, although the measurement technology differs. All provide explanations for why a specific agent may have real but inconsistent effects across populations.
Until this point, we have considered only inconsistent results among a set of perfectly designed and conducted studies that differ from one another solely due to random error or true differences in the effect. Introducing methodological limitations and biases offers an additional set of potential explanations for inconsistent results. By definition, biases introduce error in the measure of effect. Among an array of studies of a particular topic, if the extent and mix of biases varies across studies, results will vary as well. That is, if some studies are free of a particular form of bias and other studies are plagued to a substantial degree by that bias, then results will be inconsistent across those sets of studies. Susceptibility to bias needs to be examined on a study-by-study basis, and considered among the candidate explanations for inconsistent results. In particular, if there is a pattern in which the findings from studies that are most susceptible to a potentially important bias differ from those of studies that are least susceptible, then the results will be inconsistent but highly informative. The studies that are least susceptible to the bias would provide a more accurate measure of the association.
In order to make an assessment of the role of bias in generating inconsistent results, the study methods must be carefully scrutinized, putting results aside. Depending on preconceptions about the true effect, there may be a temptation to view those studies that generate positive or null results as methodologically superior because they yielded the right answer. In fact, biases can distort results in either direction, so that unless truth is known in advance, the results themselves give little insight regarding the potential for bias in the study. Knowing that a set of studies contains mixed positive and null findings tells us nothing about which of them is more likely to be correct or whether all are valid or all are in error. In particular, there is no logical reason to conclude from such an array of results that the null findings are most likely to be correct, by default—mixed findings do not provide evidence to support the hypothesis of no effect. The demand on the interpreter of such evidence is to assess which are the stronger and weaker studies and examine the patterns of results in relation to those method-ologic attributes.
Was this article helpful?