Focusing on coherence between cases and controls emphasizes that it is not one of the groups that is at fault if they cannot be integrated to yield valid measures of the association between exposure and disease, but rather their composition relative to one another. Thus, there is no such thing as poorly constituted cases or poorly constituted controls, only groups that are incoherent with one another. In practice, once one group, cases or controls, has been operationally defined, then the desired attributes of the other is defined and the challenge is a practical one of meeting that conceptual goal. Miettinen (1985) coined the terms primary study base and secondary study base. With a primary study base, the definition of the population-time experience that produces the cases is explicitly demarcated by calendar periods and geographic boundaries. In such instances, the challenge is to fully ascertain the cases that arise from within that study base and to accurately sample controls from that study base. A secondary base corresponds to a given set of cases identified more by convenience, such as those that appear and are diagnosed at a given hospital, posing the challenge of identifying a means of properly sampling controls from the ill-defined study base.
In reality, there is a continuum of clarity in the definition of study bases, with the goal being the identification of a study base that lends itself to selection of coherent cases and controls. A choice can be made in the scope of the study base itself that will make coherent case and control selection more or less difficult. It may be more useful to focus on the identification of a coherent base for identifying both cases and controls than to first focus on making case or control selection alone as easy as possible and then worrying about how to select the other group. The ability to formally define the geographic and temporal scope of a study base is less critical than the practical ability to identify all the cases that are produced in a study base and to have some way to properly sample controls from it.
Coherence may sometimes be achieved by restricting the constitution of one of the groups to make the task easier. For example, in a study of pregnancy outcome based in prenatal care clinics, the case group may include women who began normal pregnancies and developed the disease of interest, e.g., pregnancy-induced hypertension, as well as women who began prenatal care elsewhere and were referred to the study clinic because they developed medical problems, including the one of interest. The source of referrals is very difficult to identify with clarity, since it depends on financial incentives, patient and physician preferences, etc. Therefore, one option would be to simply exclude those referred from other prenatal care providers from the case group and thereby from the study base itself, and instead study non-referred cases and a sample of patients who enrolled in prenatal care at the study settings without necessarily being at high risk. Note that the problem is not in identifying referred cases, which is straightforward, but rather in sampling from the ill-defined pool of pregnant women who would, if they had developed health problems, have been referred to the study clinics. Restricting cases and controls to women who began their care in the study clinics improves the ability to ensure that they are coherent.
In this example, as in most case-control studies, the burden is placed on proper selection of controls given a roster of cases. Selection bias is defined not as having chosen a set of cases from an ill-defined, intractable study base but rather having the inability to identify and sample controls from that base. Given a set of cases, we ask whether the chosen controls accurately reflect the prevalence of exposure in the study base that generated them, regardless of how complex the definition of that study base may be. One of the primary reasons to conduct case-control studies is the rarity of disease, in that a full roster of cases plus a sample from the study base is more efficient than the consideration of the entire study base, as is done in cohort studies. Given the goal of including as many cases as possible for generating precise estimates, solutions such as the one proposed for referrals to prenatal care clinics that require omission of sizable numbers of cases are likely to be unattractive. Typically, all possible cases are sought, and the search is made for suitable controls.
The more idiosyncratic the control sampling method and the more it deviates from a formal, random sample from the study base, the more scrutiny it requires. Sometimes, we have a clearly defined, enumerated study base, as in case-control studies fully nested within a defined cohort. When cases of disease arise in a population enrolled in a health maintenance organization, there is often a data set that specifies when each individual joins and leaves the program. Those are precisely the individuals who would have been identified as cases if they had developed the condition of interest (aside from those persons who for some reason forego the benefits of the plan and seek their care elsewhere). Medical care coverage affords the opportunity to comprehensively define the population at risk and thus sample from it. One of the primary strengths of epidemiologic studies in health maintenance organizations is the availability of a roster of persons who receive care, and are thus clearly at risk of becoming identified as cases.
As we move away from clear, enumerated sampling frames, the problems of control selection become more severe. Even in studies conducted within a defined geographic area, there is the challenge of identifying all cases of disease that occur in the area. Doing so is easier for some conditions than for others. Several diseases are fully enumerated by registries in defined geographic areas, most notably cancer and birth defects. Vital records provide a complete roster of births, and thus certain associated birth outcomes, and deaths, including cause of death. For most conditions of interest, however, there is not a geographically based register in place. Chronic diseases such as diabetes, myocardial infarction, or osteoporosis require essentially developing one's own register to fully ascertain cases in a geographically defined population, tabulating information from all medical care providers, developing a systematic approach to defining cases, etc. Beyond the potential difficulties in identifying all cases from a given region in most countries, probability sampling from geographically defined populations is extremely difficult and becoming more difficult over time. As privacy restrictions increase, accessibility of such data sources as drivers' license rosters is becoming more limited. Furthermore, public wariness manifested by increased proportions of unlisted telephone numbers and use of telephone answering machines to screen calls has made telephone based sampling more fallible. At their best, the available tools such as random-digit dialing telephone sampling, neighborhood canvassing, or use of drivers' license rosters are far from perfect, even before contending with the problems of non-response that follow. Conceptually, a geographically defined study base is attractive, but it may not be so on logistical grounds.
Sampling from the study base that generates patients for a particular hospital or medical practice raises even more profound concerns. The case group is chosen for convenience and constitutes the benchmark for coherent control sampling, but the mechanisms for identifying and sampling from the study base are daunting. Without being able to fully articulate the subtleties of medical care access, preference, and care-seeking behavior, diseased controls are often chosen on the assumption that they experience precisely the same selection forces as the cases of interest. To argue that choosing patients hospitalized for non-malignant gastrointestinal disease, for example, constitutes a random sample from the population that produced the cases of osteoporotic hip fracture may be unpersuasive on both theoretical and empirical grounds. Such strategies are rarely built on careful logic and there is no way to evaluate directly whether they have succeeded or failed, even though by good fortune they may yield valid results. Their potential value would be enhanced if it could be demonstrated that the other diseases are not related to the exposure of interest and that the sources of cases are truly identical.
Selection in still more conceptually convoluted ways, such as friend controls, is also not amenable to direct assurance that they represent the appropriate study base. We need to ask whether friend controls would have been enrolled as cases in the study had they developed the condition of interest and whether they constitute a random sample of such persons. Viewed from a perspective of sampling, it is not obvious that such methods will yield a representative sample with respect to exposure. When the procedure seems like an odd way of sampling the study base, attention should be focused on the ultimate question of whether the controls are odd in the only way that actually matters—Do they reflect the exposure prevalence in the study base that produced the cases? That question is synonymous with "Is selection bias present?"
Was this article helpful?