There has been a dramatic rise in interest and methodology for the formal, quantitative integration of evidence across studies, generally referred to as metaanalysis (Petitti, 1994; Greenland, 1987, 1998). In the biomedical literature, much of the motivation comes from a desire to integrate evidence across a series of small clinical trials. The perceived problem that these tools were intended to address is the inability of individual trials to have sufficient statistical power to detect small benefits, whereas if the evidence could be integrated across studies, statistical power would be enhanced. If subjected to formal tests of statistical significance, which is the norm in assessing the outcome of a clinical trial, many individual trials are too small to detect clinically important benefits as statistically significant. When such non-significant tendencies are observed across repeated studies, there is an interest in assessing what the evidence says when aggregated. Note that the intended benefits were focused on reducing random error through aggregation of results, implicitly or explicitly assuming that the individual studies are otherwise compatible with regard to methods and freedom from other potential study biases.
The value of this effort to synthesize rather than merely describe the array of results presumes an emphasis on statistical hypothesis testing. A rigid interpretation of statistical testing can and does lead to situations in which a series of small studies, all pointing in the same direction, for example, a small benefit of treatment, would lead to the conclusion that each of the studies found no effect (based on significance testing). If the evidence from that same series of studies were combined, and summarized with a pooled estimate of effect, evidence of a statistically significant benefit would generate a very different conclusion than the studies taken one at a time. Obviously, if a series of small studies shows similar benefit, those who are less bound by adherence to statistical testing may well infer that the treatment appears to confer a benefit without the need to assess the statistical significance of the array of results. Those who wish to compare the array of results to a critical p-value, however, are able to do so. In fact, as discussed below in the section on "Interpreting Consistency and Inconsistency," the consistency across studies with at least slightly different methods and the potential for different biases might actually provide greater confidence of a true benefit. Identically designed and conducted studies may share identical biases and show similar effects across the studies due to those shared errors.
As discussed in Chapter 10, in well-designed and well-executed randomized trials, the focus on random error as the primary source of erroneous inferences may be justified. That is, if the principles of masked, objective assessment of outcome are followed, and an effective randomization procedures is employed to ensure that baseline risk does not differ across exposure groups, the major threat to generating valid results is a failure of the random allocation mechanism to yield groups of baseline comparability. Generating a p-value addresses the probability that the random allocation mechanism has generated an aberrant sample under the assumption that there is no true difference between the groups. Thus, repetition of the experiment under identical conditions can be used to address and reduce the possibility that there is no benefit of treatment but the allocation of exposure by groups has, by chance, generated such a pattern of results. A series of small, identical randomized trials will yield a distribution of results, and the integration of results across those trials would provide the best estimate of the true effect. In the series of small studies, the randomization itself may not be effective, although the deviation in results from such randomization should be symmetrical around the true value. Integration of information across the studies should help to identify the true value around which the findings from individual studies cluster.
The randomized trial paradigm and assumptions have been articulated because the direct application of this reasoning to observational studies is often problematic, sometimes severely so. Just as the framework of statistical hypothesis testing has limited applicability to a single epidemiologic study, the framework of synthetic meta-analysis has limited applicability to a set of observational studies.
Observational studies are rarely if ever true replications of one another. The populations in which the studies are conducted differ, and thus the presence of potential effect-modifiers differs as well. The tools of measurement are rarely identical, even for relatively simple constructs such as assessment of tobacco use or occupation. Exact methods of selecting and recruiting subjects differ, and the extent and pattern of nonparticipation varies. Susceptibility to confounding will differ whenever the underlying mechanism of exposure assignment differs. Thus, the opportunity to simply integrate results across a series of methodologically identical studies does not exist in observational epidemiology. Glossing over these differing features of study design and conduct, and pretending that only random error accounts for variability among studies is more likely to generate misleading than helpful inferences.
Closely related to this concern is the central role assigned to statistical power and random error in the interpretation of study results. The fundamental goal of integrating results is to draw more valid conclusions by taking advantage of the evidence from having several studies of a given topic rather than a single large study. While enhanced precision from the larger number of subjects accrued in multiple studies is an asset, the more valuable source of insight is often the opportunity to understand the influence of design features on study results. This can only be achieved by having multiple studies of differing character and scrutinizing the pattern that emerges, not suppressing it through a single synthetic estimate. Imagine two situations, one with a single study of 5000 cases of disease in a cohort of 1,000,000 persons, and the other a series of 10 studies with 500 cases each from cohorts of 100,000 persons. The single, extremely precise study would offer limited opportunity to learn from the methodologic choices that were made since a single protocol would have been followed. Differing approaches to measurement of exposure and disease, control of confounding, and modification of the estimated effect by covariates would be limited because of the lack of diversity in study methods. In contrast, the variation in methodologic decisions among the 10 studies would provide an opportunity to assess the pattern of results in relation to methods. With variability in attributes across studies (viewed as a limitation or barrier to deriving a single estimate), one can gain an understanding of how those study features influence the results (an advantage in evaluating hypotheses of bias and causality).
Was this article helpful?