Assignment: Using Epidemiology to Evaluate Health Services

Permalink: https://nursingpapermills.com/assignment-using…-health-services/

D-From Gordis TO ASSIGNMENTS (NOT DISCUSSION): Gordis-Review all problems and answers in Ch 16, 17, 18. No problems to submit. Gordis assignment — instead of problems, Post to Assignments (NOT Discussion).(1) From Gordis Ch. 17, 18, & 19 (each), on review of each of these 3 chapters, post and discuss any 1 epidemiology concept in each listed chapter that you better understand see highlighted concepts in each chapter (2) Include your possible use of your example (from each chapter) in the future practice (Primary Care at a community health clinic).(3) Offer any concepts you plan to further explore. Assignment: Using Epidemiology to Evaluate Health Services
Post as:
CH 17-post & discuss any 1 epi concept in that chapter with 1 future application of either concept.CH 18-post & discuss any 1 epi concept in that chapter with 1 future application of either concept.CH 19-post & discuss any 1 epi concept in that chapter with 1 future application of either concept.

Gordis CHAPTER 17

Using Epidemiology to Evaluate Health Services

Keywords

measures of process and outcome; efficacy, effectiveness, and efficiency; outcomes research; avoidable mortality; Healthy People 2020 health indicators

LEARNING OBJECTIVES

To distinguish measures of process from measures of outcome, and to discuss some commonly used measures of outcome in health services research.
To define efficacy, effectiveness, and efficiency in the context of health services.
To compare and contrast epidemiologic studies of disease etiology with epidemiologic studies evaluating health services.
To discuss outcomes research in the context of ecologic data, and to present some potential biases in epidemiologic studies that emerge when evaluating health services using group-level data.
To describe some possible study designs that can be used to evaluate health services using individual-level data, including randomized and nonrandomized designs. Assignment: Using Epidemiology to Evaluate Health Services

Perhaps the earliest example of an evaluation is the description of creation given in the book of Genesis 1:1–4, which is shown in the original Hebrew in Fig. 17.1. Translated, with the addition of a few subheadings, it reads as follows

BASELINE DATA

In the beginning God created the heaven and the earth. And the earth was unformed and void and darkness was on the face of the deep.

IMPLEMENTATION OF THE PROGRAM

And God said, “Let there be light.” And there was light.

EVALUATION OF THE PROGRAM

And God saw the light, that it was good.

FURTHER PROGRAM ACTIVITIES

And God divided the light from the darkness.

This excerpt includes all of the basic components of the process of evaluation: baseline data, implementation of the program, evaluation of the program, and implementation of new program activities on the basis of the results of the evaluation. However, two problems arise in this description. First, we are not given the precise criteria that were used to determine whether or how the program was “good”; we are told only that God saw that it was good (which, in hindsight, may be sufficient). Second, this evaluation exemplifies a frequently observed problem: the program director is assessing

his own program. Both conscious and subconscious biases can arise in evaluation. Furthermore, even if the program director administers the program superbly, he or she may not necessarily have the specific skills that are needed to conduct a methodologically rigorous evaluation of the program.

Dr. Wade Hampton Frost, a leader in epidemiology in the early part of the 20th century, addressed the use of epidemiology in the evaluation of public health programs in a presentation to the American Public Health Association in 1925. 1 He wrote, in part, as follows:

The health officer occupies the position of an agent to whom the public entrusts certain of its resources in public money and cooperation, to be so invested that they may yield the best returns in health; and in discharging the responsibilities of this position he is expected to follow the same general principles of procedure as would be a fiscal agent under like circumstances. …

Since his capital comes entirely from the public, it is reasonable to expect that he will be prepared to explain to the public his reasons for making each investment, and to give them some estimate of the returns which he expects. Nor can he consider it unreasonable if the public should wish to have an accounting from time to time, to know what returns are actually being received and how they check with the advance estimates which he has given them. Certainly any fiscal agent would expect to have his judgment thus checked and to gain or lose his clients’ confidence in proportion as his estimates were verified or not.

However, as to such accounting, the health officer finds himself in a difficult and possibly embarrassing position, for while he may give a fairly exact statement of how much money and effort he has put into each of his several activities, he can rarely if ever give an equally exact or simple accounting of the returns from these investments considered separately and individually. This, to be sure, is not altogether his fault. It is due primarily to the character of the dividends from public health endeavor, and the manner in which they are distributed. They are not received in separate installments of a uniform currency, each docketed as to its source and recorded as received; but come irregularly from day to day, distributed to unidentified individuals throughout the community, who are not individually conscious of having received them. They are positive benefits in added life and improved health, but the only record ordinarily kept in morbidity and mortality statistics is the partial and negative record of death and of illness from certain clearly defined types of disease, chiefly the more acute communicable diseases, which constitute only a fraction of the total morbidity. 1

Dr. Charles V. Chapin commented on Frost’s presentation:

Dr. Frost’s earnest demand that the procedures of preventive medicine be placed on a firm scientific basis is well timed. Indeed, it would have been opportune at any time during the past 40 years and, it is to be feared, will be equally needed for 40 years to come. 2

Chapin clearly underestimated the number of years; the need remains as critical today, some 90+ years later, as it was in 1925.

Studies of Process and Outcome

Avedis Donabedian is widely regarded as the author of the seminal work on creating a framework of examining health services in relation to the quality of care. He identified three important factors simultaneously at play: (1) structure, (2) process, and (3) outcome. Structure relates to the physical locations where care is provided, the personnel, equipment, and financing. We will restrict our discussion here to the remaining two components, process and outcome.

Studies of Process

At the outset, we should distinguish between process and outcome studies. Process means that we decide what constitutes the components of good care, services, or preventive actions. Such a decision may first be made by an expert panel. We can then assess a clinic or health care provider, by reviewing relevant records or by direct observation, and determine to what extent the care provided meets established and accepted criteria. For example, in primary care we can determine what percentage of patients have had their blood pressure measured. The problem with such process measures is that they do not indicate whether the patient is better off; for example, monitoring blood pressure does not ensure that the patient’s blood pressure is under control or that the patient will consistently take antihypertensive medications if they are prescribed. Second, because process assessments are often based on expert opinion, the criteria used in process evaluations may change over time as expert opinion changes. For example, in the 1940s, the accepted standard of care for premature infants required that such infants be placed in 100% oxygen. Incubators were monitored to be sure that such levels were maintained. However, when research demonstrated that high oxygen concentration played a major role in producing retrolental fibroplasia—a form of blindness in children who had been born prematurely—high concentrations of oxygen were subsequently deemed unacceptable. Assignment: Using Epidemiology to Evaluate Health Services

Studies of Outcome

Given the limitations of process studies, the remainder of this chapter focuses on outcome measures. Outcome denotes whether or not a patient (or a community at large) benefits from the medical care provided. Health outcomes are frequently considered the domain of epidemiology. Although such measures have traditionally been mortality and morbidity, interest in outcomes research in recent years has expanded the measures of interest to include patient satisfaction, quality of life, degree of dependence and disability, and similar measures.

Efficacy, Effectiveness, and Efficiency

Three terms that are often encountered in the literature dealing with evaluation of health services are efficacy, effectiveness, and efficiency. These terms are often used in association with the findings from randomized trials.

Efficacy

Does the agent or intervention “work” under ideal “laboratory” conditions? We test a new drug in a group of patients who have agreed to be hospitalized and who are observed as they take their therapy. Or a vaccine is tested in a group of consenting subjects. Thus, efficacy is a measure in a situation in which all conditions are controlled to maximize the effect of the agent. Generally, “ideal” conditions are those that occur in testing a new agent of intervention using a randomized trial.

Effectiveness

If we administer the agent in a “real-life” situation, is it effective? For example, when a vaccine is tested in a community, many individuals may not come in to be vaccinated. Or, an oral medication may have such an undesirable taste that no one will take it (so that it will prove ineffective), despite the fact that under controlled conditions, when compliance was ensured, the drug was shown to be efficacious.

Efficiency

If an agent is shown to be effective, what is the cost–benefit ratio? Is it possible to achieve our goals in a less expensive and better way? Cost includes not only money, but also discomfort, pain, absenteeism, disability, and social stigma.

If a health care measure has not been demonstrated to be effective, there is little point looking at efficiency, for if it is not effective, the least expensive alternative is not to use it at all. At times, of course, political and societal pressures may drive a program even if it is not effective (an often-cited example is DARE—Drug Abuse Resistance Education, which has never been shown to have an impact on adolescent and young adult drug use). However, this chapter will focus only on the science of evaluation and specifically on the issue of effectiveness in evaluating health services.

Measures of Outcome

If efficacy of a measure has been demonstrated—that is, if the methods of prevention and intervention that are of interest have been shown to work—we can then turn to evaluating effectiveness. What guidelines should we use in selecting an appropriate outcome measure to serve as an index of effectiveness? First, the measure must be clearly quantifiable; that is, we must be able to express its effect in quantitative terms. Second, the measure of outcome should be relatively easy to define and diagnose. If the measure is to be used in a population study, we would certainly not want to depend on an invasive procedure for assessing any benefits. Third, the measure selected should lend itself to standardization for study purposes. Fourth, the population served (and the comparison population) must be at risk for the same condition for which an intervention is being evaluated. For example, it would obviously make little sense to test the effectiveness of a sickle cell screening program in a white population in North America (as sickle cell disease primarily affects African Americans).

The type of health outcome end point that we select clearly should depend on the question that we are asking. Although this may seem self-evident, it is not always immediately apparent. Box 17.1 shows possible end points in evaluating the effectiveness of a vaccine program. Whatever outcome we select should be explicitly stated so that others reading the report of our findings will be able to make their own judgments regarding the appropriateness of the measure selected and the quality of the data. Whether the measure we have selected is indeed an appropriate one depends on clinical and public health aspects of the disease or health condition in question.

Box 17.1

Some Possible End Points for Measuring the Success of a Vaccine Program

Number (or proportion) of people immunized
Number (or proportion) of people at (high) risk who are immunized
Number (or proportion) of people immunized who show serologic response
Number (or proportion) of people immunized and later exposed in whom clinical disease does not develop
Number (or proportion) of people immunized and later exposed in whom clinical or subclinical disease does not develop

Box 17.2 shows possible choices of measures for assessing the effectiveness of a throat culture program in children. Measures of volume of services provided, numbers of cultures taken, and number of clinic visits have been traditionally used because they are relatively easy to count and are helpful in justifying requests for budgetary increases for the program in the following year. However, such measures are all process measures and tell us nothing about the effectiveness of an intervention. We therefore move to other possibilities listed in this box. Again, the most appropriate measures should depend on the question being asked. The question must be specific. It is not enough just to ask how good the program is.

Box 17.2

Some Possible End Points for Measuring Success of a Throat Culture Program

Number of cultures taken (symptomatic or asymptomatic)
Number (or proportion) of cultures positive for streptococcal infection
Number (or proportion) of persons with positive cultures for whom medical care is obtained
Number (or proportion) of persons with positive cultures for whom proper treatment is prescribed and taken
Number (or proportion) of positive cultures followed by a relapse
Number (or proportion) of positive cultures followed by rheumatic fever

Comparing Epidemiologic Studies of Disease Etiology and Epidemiologic Research Evaluating Effectiveness of Health Services

In classic epidemiologic studies of disease etiology, we examine the possible relationship between a putative cause (the independent variable or “exposure”) and an adverse health effect or effects (the dependent variable or “outcome”). In doing so, we take into account other factors, including health care, that may modify the relationship or confound it (Fig. 17.2A). In health services research, we focus on the health service as the independent variable (the “exposure”), with a reduction in adverse health effects as the anticipated outcome (dependent variable) if the modality of care is effective. In this situation, environmental and other factors that may influence the relationship are also taken into account (see Fig. 17.2B). Thus, both etiologic epidemiologic research and health services research address the possible relationship between an independent variable and a dependent variable, and the influence of other factors on the relationship. Therefore, it is not surprising that many of the study designs discussed are common to both epidemiologic and health services research, as are the methodologic problems and potential biases that may characterize these types of studies.

FIG. 17.2 (A) Classic epidemiologic research into etiology, taking into account the possible influence of other factors, including health care. (B) Classic health services research into effectiveness, taking into account the possible influence of environmental and other factors.

Evaluation Using Group Data

Regularly available data, such as mortality data and hospitalization data, are often used in evaluation studies. Such data can be obtained from different sources, and such sources may differ in important ways. For example, Fig. 17.3 shows the changes in the estimated proportion of the US population with influenza-like illness (ILI) over time—trends—using three different data sources: sentinel surveillance sites overseen by the Centers for Disease Control and Prevention (CDC), Google Flu Trends, and Flu Near You. 3

FIG. 17.3 Estimated proportion of US population with influenza-like illness January 2011–13. CDC, Centers for Disease Control and Prevention. (From Butler D. When Google got flu wrong. Nature. 2013;494:155–156.)

Although the trends are fairly similar in this time period, we can see that Google Flu Trends estimated a higher proportion of the US population with ILI toward the end of 2012, nearly twice as high as the CDC estimates. This is potentially attributed to the varying methodology of data collection of each data source. The CDC generates its data from over 2,700 health care centers that capture over 30 million patient visits each year. Google Flu Trends uses data mining and modeling methodology generated from the flu-related search terms entered in Google’s search engine. Flu Near You uses data entered by internet users volunteering information, not necessarily physicians, to report on a weekly basis whether they, or their family members, have ILI symptoms. It is possible that not all individuals who develop ILI symptoms will seek medical care, and hence are not captured by the CDC data, but they may perform a Google search for ways to alleviate ILI symptoms, for example. Since Flu Near You solely depends on voluntary self-report of ILI symptoms it might well underestimate prevalence. In a recent flu season, New York State Governor Andrew M. Cuomo declared a Public Health Emergency in response to a severe flu season. It was suggested that this might have prompted numerous searches on Google by individuals who are not actually suffering from ILI symptoms, which in turn could have triggered the spike that we see in the figure.

Outcomes Research

The term outcomes research has been increasingly used to denote studies comparing the effects of two or more health care interventions or modalities—such as treatments, forms of health care organization, or type and extent of insurance coverage and provider reimbursement—on health or economic outcomes. The health end points may include morbidity and mortality as well as measures of quality of life, functional status, and patient perceptions of their health status, including symptom recognition and patient-reported satisfaction. Economic measures may reflect direct or indirect costs, and can include hospitalization rates, rehospitalization for the same condition within 30 days of discharge, outpatient and emergency room visits, lost days of work, child care, and days of restricted activity. Consequently, epidemiology is one of several disciplines needed in outcomes research. Assignment: Using Epidemiology to Evaluate Health Services

Outcomes research often uses data from large data sets that were derived from large populations. Although in recent years some of the large data sets have been developed from cohorts that were originally set up for different research purposes, many of the data sets used were often originally initiated for administrative or fiscal purposes, rather than for any research goals. Often several large data sets, each having information on different variables, may be combined or linked (resulting in “meta-data”) in order to have sufficient sample size to explore a question of interest.

With the advent of the electronic medical record (EMR), patient care data are increasingly available to the epidemiology and health services research communities. The purpose of the EMR is to provide health care providers all of the information pertaining to individual patients—findings from office visits, utilization of preventive services, prescribed medications, procedures, radiologic findings, laboratory test results—continuously over time (i.e., prospectively). However, the purpose of the EMR is not to serve as a research base but to direct patient care. Harnessing the EMR to evaluate health services research questions has great promise, but to date it has proven difficult to use and the methods to maximize its potential are still being developed and tested in the field.

The advantages of using large data sets (sometimes referred to as “big data”) are that the data refer to real-world populations, and the issue of “representativeness” or “generalizability” is minimized. In addition, since the data sets exist at the time the research is initiated, analysis can generally be completed and results generated relatively rapidly. Moreover, given the large data sets used, sample size is not usually a problem except when smaller subgroups are examined. Given these considerations, the costs of using existing data sets are generally lower than the costs of primary data collection.

The disadvantages are that, since the data were often initially gathered for fiscal patient care and administrative purposes, they may not be well suited for research purposes and for answering the specific research question addressed in the study. Even when the data were originally gathered for research, our knowledge of the area may now be more complete and new research questions may have arisen that were not even conceived of when the original data collection was initiated. In general, data may be incomplete. Data on the independent and dependent variables may be very limited. Data may be missing on clinical details including disease severity and on the details of interventions, and diagnostic coding may be inconsistent across facilities and within facilities over time. Data relating to possible confounders may be inadequate or absent since the research now being conducted was often not even possible when the data were originally generated. Because certain variables that today are considered relevant and important were not included in the original data set, investigators may at times create surrogate variables for the missing variables, using certain variables that are included in the data set but that may not directly reflect the variable of interest. However, such surrogate variables vary in the extent to which they are an adequate measure of the missing variable of interest. For all these reasons, the validity of the conclusions reached may therefore be in doubt.

Another important problem that may arise with large data sets is that because the necessary variables may be absent in the available data set, the investigator may consciously or subconsciously change from the question he or she had originally wanted to address to a question that is of less interest, but for which the variables that are needed for conducting the study are present in the data set. Thus, rather than the investigator deciding what research question should be addressed, the data set itself may end up determining what questions are asked in the study.

Finally, using large data sets, investigators become progressively more removed from the individuals being studied. Over the years, direct interviews and reviews of patient records have tended to be replaced by large computerized databases. Using these sources of data, many personal characteristics of the subjects are never explored and their relevance to the questions being asked is virtually never assessed.

One area in which existing sources of data are often used in evaluation studies is prenatal care. The problems discussed earlier are exemplified in the use of birth certificates. These documents are often used because they are easily accessible and provide certain medical care data, such as the trimester in which prenatal care was begun. However, birth certificates for women with high-risk pregnancies have missing data more often than those for women with low-risk pregnancies. The quality of the data provided on birth certificates also may differ regionally and internationally, and may complicate any comparisons that are made.

An example of outcomes research using large data sets is a study by Ikuta et?al. of Medicare beneficiaries in the United States. 4 Since Medicare health coverage is provided to virtually all elderly (ages 65 years and older) individuals in the United States, it is assumed that if a study population is limited to those who have Medicare coverage, financial obstacles to care and other variables such as age, gender, or racial/ethnic subpopulations are held constant among different groups. However, wide disparities still remain between blacks and whites in utilizing many Medicare services. The authors studied the national trends in the use of pulmonary artery catheterization (PAC) among Medicare beneficiaries during the period 1999–2013. 4 PAC is a procedure by which a tube is inserted in one of the large veins in the body, and then threaded through the heart to be ultimately placed in the pulmonary artery. This procedure used to be indicated as part of routine management of heart failure and sepsis-related acute respiratory distress syndrome, among many others. However, given the rising evidence that PAC did not improve patient outcomes, the clinical practice guidelines of the American College of Cardiology and the Society of Critical Care Medicine now recommends against the routine use of PAC. The authors studied inpatient claims data from the Centers for Medicare and Medicaid Services from 1999 to 2013 and estimated the rate of use of a PAC per 1,000 admissions, 30-day mortality, and length of stay. They found a statistically significant 67.8% relative reduction in PAC use (6.28 per 1,000 admissions in 1999 to 2.02 per 1,000 admissions in 2013), in addition to year-to-year reductions in in-hospital mortality, 30-day mortality, and length of stay. However, the findings also showed that such rates varied substantially by gender (Fig. 17.4), race (Fig. 17.5), and age (Fig. 17.6). These results showed the added benefits in restricting the use of PAC in some patients. In the meantime, the authors admitted the limitations in the use of administrative data sets and the inability to generalize to younger and uninsured individuals. Assignment: Using Epidemiology to Evaluate Health Services

FIG. 17.4 Pulmonary artery catheter use rate per 1,000 admissions by gender between 1999 and 2013. (Modified from Ikuta K, Wang Y, Robinson A, et?al. National trends in use and outcomes of pulmonary artery catheters among medicare beneficiaries, 1999–2013. JAMA Cardiol. 2017;2:908–913.)

FIG. 17.5 Pulmonary artery catheter use rate per 1,000 admissions by race between 1999 and 2013. (Modified from Ikuta K, Wang Y, Robinson A, et?al. National trends in use and outcomes of pulmonary artery catheters among medicare beneficiaries, 1999–2013. JAMA Cardiol. 2017;2:908–913.)

FIG. 17.6 Pulmonary artery catheter use rate per 1,000 admissions by age groups between 1999 and 2013. (Modified from Ikuta K, Wang Y, Robinson A, et?al. National trends in use and outcomes of pulmonary artery catheters among medicare beneficiaries, 1999–2013. JAMA Cardiol. 2017;2:908–913.)

Potential Biases in Evaluating Health Services Using Group Data

Studies evaluating health services using group data are susceptible to many of the biases that characterize etiologic studies, as discussed in Chapter 15. In addition, certain biases are particularly relevant for specific research areas and topics, and may be important depending on the specific epidemiologic design selected. For example, studies of the relationship of prenatal care to birth outcomes are prone to several important potential biases. In such studies, the question often addressed is whether prenatal care, as measured by the absolute number of prenatal visits, reduces the risk of prematurity and low birth weight. Several potential biases may be introduced into this type of analysis. For example, other things being equal, a woman who delivers prematurely will have fewer prenatal visits (i.e., the pregnancy was shorter so that there was less time in which it was possible for her to “be at risk” for prenatal visits). The result would be an artefactual relationship between fewer prenatal visits and prematurity, only because the gestation was shorter. However, bias can also operate in the other direction. A woman who begins prenatal care in the last trimester of pregnancy will likely not have an early premature delivery, as she has already carried the pregnancy into the last trimester. This would lead to an observed association of fewer prenatal visits with a reduced likelihood of early premature delivery. In addition, women who have had medical complications or a poor pregnancy outcome in a prior pregnancy may be so anxious that they come for more prenatal visits (where problems with the fetus may be detected early), and they may also be at greater risk for a poor outcome. Thus, the potential biases can run in one or both directions. If such women are at a risk that is not amenable to prevention, an apparent association of more prenatal visits with an adverse outcome may be observed.

Finally, prenatal outcome studies based on prenatal care are often biased by self-selection; that is, the women who choose to begin prenatal care early in pregnancy are often better educated and from a higher socioeconomic status with more positive attitudes toward health care. Thus, a population of women, who to begin with are at lower risk for adverse birth outcomes, select themselves for earlier prenatal care. The result is a potential for an apparent association of early prenatal care with lower risk of adverse pregnancy outcome, even if the care itself is without any true health benefit.

Two Indices Used in Ecologic Studies of Health Services

One index in evaluating health services that uses ecologic studies is avoidable mortality. Avoidable mortality analyses assume that the rate of “avoidable deaths” should vary inversely with the availability, accessibility, and quality of medical care in different geographic regions. The UK Office for National Statistics defines avoidable mortality as:

Avoidable deaths are all those defined as preventable, amenable, or both, where each death is counted only once. Where a cause of death falls within both the preventable and amenable definition, all deaths from that cause are counted in both categories when they are presented separately. 5

Conditions include tuberculosis, hepatitis C, human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS), selected malignant neoplasms, substance use disorders, cardiovascular and respiratory diseases, unintentional and intentional injuries, among others.

Ideally, avoidable mortality would serve as a measure of the accessibility, adequacy, and effectiveness of care in an area. Deaths from HIV/AIDS will be less frequent in communities with ample, friendly, and convenient HIV testing and counseling and high-quality AIDS service organizations, often found in urban areas. In rural areas, such services may be less accessible, and diagnoses may only be made when a patient presents with an AIDS-defining illness. Thus, patients are more likely to have a higher mortality rate in areas with poorer service coverage, which they would not have experienced had they lived in an urban environment. Changes over time could be plotted and comparisons made with other areas. Unfortunately, the necessary data for such an analysis are often lacking for many of the conditions suggested for avoidable mortality analyses. Moreover, data on confounders may not be available and the resulting inferences may therefore be open to question.

A second approach is to use health indicators. With this approach, certain sentinel conditions are assumed to reflect the general level of health care, and changes in the incidence of these conditions are plotted over time and compared with data for other populations. The changes and differences that are found are then related to changes in the health service sector and are used to derive inferences about causation. However, it is difficult to know which criteria need to be met in order for a given condition to be acceptable as a valid health indicator. A systematic process should be followed to allow the identification and implementation of a valid health indicator. Each indicator should have the following attributes: valid, reliable, relevant, realistic, measurable, well known, can be used in continuous assessment, and can effectively measure success and failure. The first phase of developing an indicator is usually through the identification of a proposed list of indicators by group of experts in the area, followed by shortlisting the list to the indicators that fulfill most or all the attributes outlined above. The second phase includes pilot testing, which is primarily targeted to test the availability of the data and to estimate the time, effort, and finances to collect information of this indicator. The third phase of development is the full testing of the indicators on a larger scale and tuning the indicator based on feedback from health care personnel on the use of these indicators. The fourth and final phase is the full implementation of the mature indicators. At this stage, there should be a mandate for reporting the indicators and having systems in place for data collection, tabulation, analysis, and interpretation, coupled with a feedback mechanism to the intermediate and peripheral levels of the health care system.

The CDC maintains 26 Leading Health Indicators (LHIs) under 12 topics. The Healthy People 2020 LHIs are given in Box 17.3 and can be accessed on their website (healthypeople.gov). As mentioned earlier, avoidable deaths are all those defined as preventable, amenable, or both, where each death is counted only once. Where a cause of death falls within both the preventable and amenable definition, all deaths from that cause are counted in both categories when they are presented separately.

Box 17.3

The Healthy People 2020 Leading Health Indicators Are Composed of 26 Indicators Organized Under 12 Topics

Access to Health Services

Persons with medical insurance (AHS-1.1)
Persons with a usual primary care provider (AHS-3)

Clinical Preventive Services

Adults receiving colorectal cancer screening based on the most recent guidelines (C-16)
Adults with hypertension whose blood pressure is under control (HDS-12)
Persons with diagnosed diabetes whose A1c value is greater than 9% (D-5.1)
Children receiving the recommended doses of DTaP, polio, MMR, Hib, HepB, varicella, and PCV vaccines by age 19–35 months (IID-8)

Environmental Quality

Air Quality Index >100 (EH-1)
Children exposed to secondhand smoke (TU-11.1)

Injury and Violence

Injury deaths (IVP-1.1)
Homicides (IVP-29)

Maternal, Infant, and Child Health

All infant deaths (MICH-1.3)
Total preterm live births (MICH-9.1)

Mental Health

Suicide (MHMD-1)
Adolescents with a major depressive episode in the past 12 months (MHMD-4.1)

Nutrition, Physical Activity, and Obesity

Adults meeting aerobic physical activity and muscle-strengthening objectives (PA-2.4)
Obesity among adults (NWS-9)
Obesity among children and adolescents (NWS-10.4)
Mean daily intake of total vegetables (NWS-15.1)

Oral Health

Children, adolescents, and adults who visited the dentist in the past year (OH-7)

Reproductive and Sexual Health

Sexually active females receiving reproductive health services (FP-7.1)
Knowledge of serostatus among HIV-positive persons (HIV-13)

Social Determinants

Students graduating from high school 4 years after starting ninth grade (AH-5.1)

Substance Abuse

Adolescents using alcohol or illicit drugs in past 30 days (SA-13.1)
Binge drinking in past month—adults (SA-14.3)

Tobacco

Adult cigarette smoking (TU-1.1)
Adolescent cigarette smoking in past 30 days (TU-2.2)

Evaluation Using Individual Data

Because of the limitations inherent in analyzing studies using grouped data (i.e., studies in which we do not have data on both health care [exposure] and particular health outcomes at the individual level), studies using individual data are generally preferable. If we wish to compare two populations, one receiving the care being evaluated (perhaps a new treatment) and one not receiving it (patients who are given “usual care”), we must ask the following two questions in order to be able to derive inferences about the effectiveness of care:

Are the characteristics of the two groups comparable—demographically, medically, and in terms of factors relating to prognosis?
Are the measurement methods comparable (e.g., diagnostic methods and the way disease is classified) in both groups?

Both issues have been discussed in earlier chapters because they also apply equally well to questions of etiology, prevention, and therapy, and they must therefore be considered in any type of study design. Assignment: Using Epidemiology to Evaluate Health Services

An important issue in using epidemiology to study outcomes for the evaluation of health services is the need to address prognostic stratification. If a change in health outcome is observed after a certain type of care has been delivered, can we necessarily conclude that the change is due to the (new) health care provided, or could it be a result of differences in prognosis based on comorbidity—preexisting disease that may or may not be specifically related to the disease being studied, in severity, or in any other associated conditions that bear on prognosis? To address these issues, medical outcome studies must carry out a prognostic stratification by studying case mix and carefully characterizing the individuals studied on the basis of disease severity.

Let us now turn to some study designs used in the evaluation of health services.

Randomized Designs

Randomization eliminates the problem of selection bias that results from either self-selection by the patient or selection of the patient by the health care provider. Usually, study participants are assigned to receive one type of care versus another rather than to receive care versus no care (Fig. 17.7). For many reasons, both ethical and practical, randomizing patients to receive no care usually is not considered.

FIG. 17.7 Design of a randomized study comparing care A and care B.

Let us consider a study that used a randomized design to evaluate different approaches to health care for elderly patients who have had a stroke. Early, organized, hospital-based management has been strongly recommended for the care of patients with stroke. However, few data are available from well-conducted controlled studies to compare hospital care with specialized care at home (domiciliary care). An alternative to stroke units in the hospital is a specialized stroke team that can provide care anywhere in the hospital where stroke patients may be treated. It may not be possible for every hospital to offer care in a specialized unit to all patients who have a stroke because of space limitations and other administrative and financial issues, hence the formation of a “roaming” stroke team.

To identify the optimal organizational structure for the care of patients with stroke, Kalra and colleagues 7 conducted a randomized, controlled trial to compare the efficacy of three forms of care (Fig. 17.8). Patients were randomly assigned to one of the following groups: (1) care provided in a hospital stroke unit by a stroke physician and a multidisciplinary team; (2) care provided by a multidisciplinary stroke team with expertise in stroke management; or (3) care at home (domiciliary care) provided by a specialist team. The outcome was mortality or institutionalization, which was assessed at 3, 6, and 12 months after the onset of a stroke. Data were analyzed by intention-to-treat. At each of the three time points, patients treated in the hospital stroke unit were less likely to die or to be institutionalized than patients in the group treated by the stroke team or the group receiving domiciliary care. Cumulative survival in the three groups is shown in Fig. 17.9. The study supports the use of specialized stroke units for the care of patients with stroke.

FIG. 17.8 Profile of a randomized trial of strategies for stroke care. aFifty-one patients in this group were admitted to the hospital within 2 weeks of randomization, but are included in the intention-to-treat analysis. (Modified from Kalra L, Evans A, Perez I, et?al. Alternative strategies for stroke care: a prospective randomized controlled trial. Lancet. 2000;356:894–899.)

FIG. 17.9 Kaplan–Meier survival curves for different strategies of care after acute stroke. (From Kalra L, Evans A, Perez I, et?al. Alternative strategies for stroke care: a prospective randomized controlled trial. Lancet. 2000;356:894–899.)

As seen in Fig. 17.9, an interesting and somewhat surprising finding in this study is that survival was better in patients who were randomized to receive domiciliary care (care at home) than in those randomized to receive care in the hospital by a stroke team.

A possible explanation for this observation is that patients in the domiciliary care group whose condition deteriorated or who had developed new problems were withdrawn from domiciliary care and admitted to a stroke unit. These patients were still analyzed with the domiciliary care group because an intention-to-treat analysis was used that analyzes outcome according to the original randomization. These patients may have benefited from care in the stroke unit, and if so, their outcome would tend to improve the outcome results for the domiciliary care group because of the intention-to-treat analysis.

Drummond and colleagues 8 conducted a 10-year follow-up of a randomized, controlled trial of care in a stroke rehabilitation unit. They found that management in a stroke rehabilitation unit conferred survival benefits even 10 years after the stroke. The exact reasons are not clear, but the authors suggest that one explanation may be that long-term survival is related to early reduction in disability.

Nonrandomized Designs

Many health care interventions cannot be subjected to randomized trials for several reasons. First, such trials are often logistically complex and extremely expensive. Because so many different health care measures are in use at any one time, it may not be feasible to subject all of them to randomized evaluations. Second, ethical problems may be perceived to occur in health services evaluation studies. Specifically, randomization may be viewed as an unacceptable process by many patients and by their health care providers. Third, randomized trials often take a long time to complete; because health care programs and health problems change over time, when the results of the study are finally obtained and analyzed, they may no longer be entirely relevant. For these reasons, many health care researchers look for alternative approaches that may at least yield some information. One such approach discussed above—outcomes research—generally refers to the use of data from nonrandomized studies that often use large existing data sets (or so-called “big data”).

Before–After Design (Historical Controls)

If randomization is not possible or will not be used for any reason, one possible study design to evaluate a program is to compare people who received care before a program was established (or before the health care measure became available) with those who received care from the program after it was established (or after the measure became available). What are the problems with the before–after design? First, the data obtained in each of the two periods are frequently not comparable in terms of either quality or completeness. When a new form of health service delivery is developed, evaluators of the program may want to include people who were treated in the past, before the program began, as a comparison group. The data on people treated after the new program begins may be collected using a well-designed research instrument, whereas data for past patients may include only that which may be available from health care records that had been designed and used only for clinical or administrative purposes. If we find a difference in outcome, we may not know if the observed difference is a result of the effect of the program or of differences in the quality of data from the two time periods.

Second, if we see a difference—for example, mortality is lower after a program was initiated than before the program was initiated—we do not know whether the difference is due to the program itself or to other factors that may have changed over time, such as housing, nutrition, other aspects of lifestyle, or the use of other health services.

Third, a problem of selection exists. Often, it is difficult to know whether the population studied after a program was established is actually similar to that seen before the program was established in terms of other factors that might affect outcome.

Does this mean that before–after studies have no value? No, it does not. But it does mean that such studies can only provide a suggestion—and are rarely conclusive—in demonstrating the effectiveness of a new health service.

A before–after design was used in a study to assess the impact of the Medicare prospective payment system (PPS) in the United States on quality of care. 9 The study was stimulated by concern that the PPS, with its closely regulated length of hospital stays and incentives for cost-cutting, might have adversely affected the quality of care. The before–after design was selected because the PPS was instituted nationwide, so a prospective cohort design could not be used. Data for almost 17,000 Medicare patients who were hospitalized in 1981–82 before the PPS was instituted were compared with data for patients hospitalized in 1985–86 after the PPS was in place. Quality of care was evaluated for five diseases: (1) congestive heart failure, (2) myocardial infarction, (3) pneumonia, (4) cerebrovascular accident, and (5) hip fractures. Outcome findings were adjusted for level of patient sickness on admission to the hospital. Although PPS was not found to be associated with an increase in either 30-day mortality or 6-month mortality, an increase was observed in instability at discharge (defined as the presence of conditions at discharge that clinicians agree should be corrected before discharge or monitored after discharge, and that may result in poor outcomes if not corrected). 10 The authors point out that other factors may also have changed during the time before and after institution of the PPS. Although the before–after design was probably the only design possible for the issue addressed in this study, the study is nevertheless susceptible to some of the problems of this type of design, which were discussed earlier. Assignment: Using Epidemiology to Evaluate Health Services

When the change in the risk of the outcome is dramatic, the before–after design is akin to the so-called natural experiment (see Chapter 14, section titled “Approaches to Etiology in Human Populations”). It would, for example, be difficult to explain the marked decline in the rates of hospitalization for diabetes and meningitis by reasons other than the introduction of insulin and streptomycin, respectively.

Simultaneous Nonrandomized Design (Program–No Program)

One option to avoid the problems of changes that occur over calendar time is to conduct a simultaneous comparison of two populations that are not randomized, in which one population is served by the program and the other is not. This type of design is, in effect, a cohort study in which the type of health care being studied represents the “exposure.” As in any cohort study, the problem arises as to how to select exposed and unexposed groups for study.

In recent years considerable interest has focused on whether higher hospital volume and higher surgeon volume relate to better patient outcomes and costs, and many studies have been carried out on these issues. An example of a simultaneous, nonrandomized study of hospital volume is one reported by Wallenstein and colleagues. 11 This study explored whether differences in patient outcomes in different hospitals related to the volume of hospital procedures performed. The authors studied hospitalizations of patients who underwent laparoscopic hysterectomy, the most common (600,000 surgeries annually) major gynecologic procedure in the United States. They examined the relationship of in-hospital complications (intraoperative, surgical site, and medical) as well as length of stay and cost during the index hospitalization to the volume of surgeries performed by physicians and overall in the hospital. 11 As seen in Table 17.1, a dose-response relationship was found: the highest in-hospital complications, length of stay, and costs occurred in hospitals that had the lowest volume of hysterectomies per year. The finding that hospitals that perform more hysterectomies have lower lengths of stay and costs has important potential policy implications and argues for the regionalization of gynecologic surgical services.

TABLE 17.1

Association Between Hospital Volume of Laparoscopic Hysterectomies Performed Per Year and Morbidity, Mortality, and Resource Utilization

blank cell number of procedures

<49.4/year 46.4–105/year >105/year

Any complication (%) 5.8 5.0 4.7

Intraoperative complications (%) 2.4 2.2 2.1

Surgical site complications (%) 2.6 2.3 1.8

Medical complications (%) 1.4 1.1 1.2

Length of stay longer than 2 days (%) 10.0 7.8 5.3

Cost (dollars) $6,527.00 $5,809.00 $5,561.00

Death (%) 0.02 0.01 0.01

Modified from Wallenstein ME, Ananth CV, Kim JH, et?al. Effect of surgical volume on outcomes for laparoscopic hysterectomy for benign indications. Obstet Gynecol. 2012;119:709–716.

It is possible that the findings relating higher hospital volumes to better patient outcomes might be due to higher volumes of procedures performed by the surgeons at these hospitals rather than to the overall volumes of procedures performed at these hospitals. Birkmeyer and colleagues addressed this issue. 12 Using Medicare claims data for 1998 and 1999, they examined mortality among all 474,108 patients who underwent one of four cardiovascular procedures or four cancer resection procedures (Fig. 17.10). They found that for most procedures the mortality rate was higher in patients operated on by low-volume surgeons than in patients operated on by high-volume surgeons. This relationship held regardless of the surgical volume of the hospital in which the surgery was performed.

FIG. 17.10 Adjusted operative mortality among Medicare patients in 1998 and 1999 according to level of surgeon volume for four cardiovascular procedures (A) and four cancer resection procedures (B). Operative mortality was defined as the rate of death before hospital discharge or within 30 days after the index procedure. Surgeon volume was based on the total number of procedures performed. (From Birkmeyer JD, Stukel TA, Siewers AE, et?al. Surgeon volume and operative mortality in the United States. N Engl J Med. 2003;349:2117–2127.)

Comparison of Utilizers and Nonutilizers

One approach for a simultaneous, nonrandomized study is to compare a group of people who use a health service with a group of people who do not (Fig. 17.11).

FIG. 17.11 Design of a nonrandomized cohort study comparing utilizers with nonutilizers of a program.

The problem of self-selection inherent in this type of design has long been recognized. Haruyama and colleagues studied the association between the personal utilization of general health checkups (GHCs) and medical expenditures (MEs) in a middle-aged Japanese population (Table 17.2). 13

TABLE 17.2

Odds Ratio (and 95% Confidence Intervals) of Any Medical Consultation (Defined as Seeing a Doctor in a Period of 1 Year) According to Subgroups of General Health Checkup Utilization in Middle-Aged Japanese Population, 2010

blank cell Nonutilizers Low-Frequency High-Frequency

Outpatient 1.00 2.90 (2.61–3.22) 4.37 (3.88–4.92)

Inpatient 1.00 0.79 (0.71–0.88) 0.75 (0.67–0.83)

Modified from Haruyama Y, Yamazaki T, Endo M, et?al. Personal status of general health checkups and medical expenditure: a large-scale community-based retrospective cohort study. J Epidemiol. 2017;27(5):209–214.

In this study, the authors recruited 33,417 residents of Soka City, Saitama Prefecture, Japan, and studied their GHC utilization from 2008 to 2010. The utilization of GHCs was divided into zero times (nonutilizers), one to three times (low-frequency utilizers), and four to six times (high-frequency utilizers). Compared with the nonutilizers, the high-frequency utilizers showed statistically significantly higher outpatient MEs. In addition, the low- and high-frequency utilizers showed statistically significantly lower inpatient MEs and total MEs than the nonutilizers. The authors concluded that outpatient MEs increased with the frequency of GHC attendance, and the early diagnosis facilitated by early outpatient consultation is more likely to lead to a slight increase in outpatient MEs but a decrease in inpatient MEs for serious diseases, resulting in a decrease in the total cost of health care.

Another example of differences between characteristics of groups under comparison is given by a study conducted by Gierisch et?al. on nonadherence to breast cancer screening with periodic mammographic examinations. 14 In this study, the nonadherent women were more likely than adherent women to be aged 40 to 49, and to have fair or poor self-rated health as well as difficulty in getting mammograms. As these variables are related to breast cancer (and all-cause mortality), they must be taken into consideration when using a nonrandomized design to examine the effectiveness of breast cancer screening.

Although we can try to address the selection problem by characterizing the prognostic profile of those who use care and those who do not, so long as the groups are not randomized, we are left with a gnawing uncertainty as to whether some factors were not identified in the study that might have differentiated utilizers and nonutilizers and, therefore, affected the health outcome. Assignment: Using Epidemiology to Evaluate Health Services

Comparison of Eligible and Ineligible Populations

Because of the problem of possible selection biases in comparing groups of utilizers with nonutilizers, another approach compares persons who are eligible for the care being evaluated with a group of persons who are not eligible (Fig. 17.12).

FIG. 17.12 Design of a nonrandomized cohort study comparing people eligible with people not eligible for a program.

The assumption being made here is that eligibility or noneligibility is not related to either prognosis or outcome; therefore no selection bias is being introduced that might affect the inferences from the study. For example, eligibility criteria may include the type of employer or the census tract of residence. However, even with this design, one must be on the alert for factors that may introduce selection bias. For example, clearly, census tract of residence may relate to socioeconomic status. The issue of finding an appropriate noneligible population for comparison may be critical. However, ineligible persons can be selected from similar neighborhoods that could compensate for the concern with ensuring comparability of socioeconomic status. In addition, as differences between eligible and ineligible individuals may also affect external validity, on occasion adjustment for the variables that differ between these individuals improves external validity.

Combination Designs

Fig. 17.13 shows a hypothetical result from a nonrandomized study comparing the morbidity level in a group that has not received a health service (Group X, shown in red) with the morbidity level in a group that has received the health service (Group Y, shown in black). Because the observed level of morbidity is lower for Group Y than for Group X, we might be tempted to conclude from these results that the health service reduces morbidity. However, as seen in Fig. 17.13 (left of figure), in order to reach this conclusion, we must assume that the original levels of morbidity in the two groups were comparable at a time before the care was provided to Group Y. If the morbidity levels for X1 and Y1 were similar, we could interpret the finding of a lower level of morbidity in Group Y (Y2) than in Group X (X2) at a time after which care has been administered as likely to have resulted from the care provided.

FIG. 17.13 Two possible explanations that would result in an observed difference in morbidity between Group X and Group Y after Group Y (shown in black) has received a health care service.

However, as seen in Fig. 17.13 (right of figure), it is possible that the groups might have been originally different and their prognoses may have differed at that time even before any care was provided. If such were the case, any differences in morbidity observed after care (i.e., Y2 lower than X2) might only reflect the original differences at the time before care was administered, and would not necessarily shed any light on the effectiveness of the care provided. Without data on morbidity levels in the two groups before the administration of care (“baseline”), the latter explanation of the observations cannot be ruled out.

In view of this problem, another approach to program evaluation is to use a combination design, which combines a before–after design with a program–no program design. This approach is demonstrated in the following example, in which outpatient care for sore throats in children was evaluated.

The study is designed to assess the effectiveness of outpatient care for sore throats in children by determining whether children who are eligible for care experience lower rates of complications of untreated “strep” throat, such as glomerulonephritis (inflammation of the kidney) or pediatric neuropsychiatric disorders associated with streptococcal infections (PANDAS), such as tics, than did children who are not eligible. The rationale was as follows: “Strep” throats are common in children. Untreated “strep” throats can lead to complications like kidney infection. If “strep” throats are properly treated, complications can be prevented. Therefore, if these programs are effective in treating “strep” throats, fewer cases of complications should occur in children who receive the treatment.

It is possible to identify and compare subgroups of children and adolescents and to compare their rates of complications from untreated “strep.” The groups could include residents of census tracts that meet eligibility criteria for comprehensive care and residents of census tracts that do not meet these eligibility criteria for comprehensive care. Both could then be compared to the city or town as a whole.

An historic example from Dr. Gordis’s research 15 shows another complication of “strep” throat—rheumatic fever, which was much more common in the past century than today. Fig. 17.14 shows a program–no program comparison of rheumatic fever rates in black children in Baltimore City. In children eligible for comprehensive care based on their census tracts, the rheumatic fever rate was 10.6 per 100,000, compared with 14.9 per 100,000 in those who were not eligible. Although the rate was lower in the eligible group in this simultaneous comparison, the difference was not dramatic.

FIG. 17.14 Comprehensive care and rheumatic fever incidence per 100,000, 1968–70; Baltimore, black population, aged 5 to 14 years. (Modified from Gordis L. Effectiveness of comprehensive-care programs in preventing rheumatic fever. N Engl J Med. 1973;289:331–335.)

The next analysis in this combination design examined changes in rheumatic fever rates over time in both eligible and noneligible populations.

As seen in Fig. 17.15, the rheumatic fever rate declined 60% in the eligible census tracts from 1960 to 1964 (before the programs were established) to 1968–70 (after the programs were operating). In the noneligible tracts, rheumatic fever incidence was essentially unchanged (+2%). Thus, both parts of the combination design are consistent with a decline related to the care available.

FIG. 17.15 Comprehensive care and changes in rheumatic fever incidence per 100,000, 1960–64 and 1968–70; Baltimore, black population, aged 5 to 14 years. (Modified from Gordis L. Effectiveness of comprehensive-care programs in preventing rheumatic fever. N Engl J Med. 1973;289:331–335.)

However, because many changes had occurred in Baltimore City during this time, it was not certain whether the care provided by the programs was indeed responsible for the decline in rheumatic fever. Another analysis was therefore carried out. In children, streptococcal throat infection can be either symptomatic or asymptomatic. Clearly, only a child with a symptomatic sore throat would have been brought to a clinic. If we hypothesize that the care in the clinic was responsible for the reduction in rheumatic fever incidence, we would expect the decline in incidence to be limited to children with symptomatic clinical sore throats who would have sought care, and not to have occurred in asymptomatic children who had no clinically apparent infections.

As seen in Fig. 17.16, the entire decline was limited to children with prior clinically overt infection; no change in rheumatic fever incidence occurred in those children with asymptomatic “strep” throat. These findings are therefore highly consistent with the suggestion that it was the medical care, or some factor closely associated with it, which was responsible for the decline in rheumatic fever incidence.

FIG. 17.16 Changes in the annual incidence of first attacks of rheumatic fever in relation to the presence or absence preceding clinically symptomatic sore throat. As seen in the figure, the entire decline in first attacks of rheumatic fever was due to a decline in first attacks of rheumatic fever that were preceded by clinically symptomatic sore throats. (Modified from Gordis L. Effectiveness of comprehensive-care programs in preventing rheumatic fever. N Engl J Med. 1973;289:331–335.)

Case-Control Studies

The use of the case-control design for evaluating health services, including vaccines and other forms of prevention and screening programs, has elicited increasing interest in the field of public health. Although the case-control design has been applied primarily to etiologic studies, when appropriate data are obtainable, this design can serve as a useful, but limited, surrogate for randomized trials. However, because this design requires definition and specification of cases, it is most applicable to studies of prevention of specific diseases. The “exposure” is then the specific preventive or other health measure that is being assessed. As in most health services research, stratification by disease severity and by other possible prognostic factors is essential for appropriate interpretation of the findings. The methodologic problems associated with such studies (which are discussed extensively in Chapter 7) also arise when the case-control design is used for evaluating effectiveness. In particular, these studies need to address the selection of controls and issues associated with confounders.

Conclusion

This chapter has reviewed the application of basic epidemiologic study designs to the evaluation of health services. Many of the issues that arise are similar to those that arise in etiologic studies, although at times they present a different twist. In etiologic studies, we are primarily interested in the possible association of a potential causal factor and a specific disease, and factors such as health services accessibility often represent possible confounders that must be taken into account. For example, in the Multi-Ethnic Study of Atherosclerosis, evaluation of determinants of atrial fibrillation must take into account the potential confounding effect of health insurance status (a marker of access to health care), as diagnosis of this condition is often made during a patient’s encounter with a physician. 16

In health care evaluation studies, we are primarily interested in possible associations of a health care or preventive activity and a particular disease outcome, and factors such as preexisting disease and other prognostic and risk factors become potential confounders that must be taken into consideration. Consequently, although many of the same design issues remain, the focus in evaluation research is often on different issues of measurement and assessment. The randomized trial remains the optimal method for demonstrating the effectiveness of a health intervention. However, ethical issues may remain in play, as it may be unethical to withhold a known or effective treatment in a randomized trial design. In initiating any evaluation study of health care, we should ask at the outset whether it is biologically and clinically plausible, given our current knowledge, to expect a specific benefit from the care being evaluated.

For practical reasons, nonrandomized observations are also necessary and must be capitalized in the attempt to expand our efforts at health services evaluation. Critics of randomized trials have pointed out that such studies have included—and can only include—a small fraction of all patients receiving care in the health care system so that generalizability of the results is a potential problem. Although this is true, generalizability is a problem with any study, no matter how large the study population. Nevertheless, even as we further refine the methodology of clinical trials, we also need improved methods to enhance the information that can be obtained from nonrandomized evaluations of health services.

The study of specific components of care, rather than a care program per se, is essential. In this way, if an effective element can be identified in a mix of many modalities, the others can be eliminated and the quality of care can be enhanced in a cost-effective fashion.

In Chapter 18, the discussion of evaluation is extended to a specific type of health services program: screening (early detection) for disease in human populations.

References

1 Frost WH. Rendering account in public health. Am J Public Health. 1925;15:394–397.

2 Chapin CV. Comments on “Rendering An Account on Public Health,” by Frost. Am J Public Health. 1925;15:397–398.

3 Butler D. When Google got flu wrong. Nature. 2013;494(7436):155–156.

4 Ikuta K, Wang Y, Robinson A, et al. National trends in use and outcomes of pulmonary artery catheters among medicare beneficiaries, 1999-2013. JAMA Cardiol. 2017;2(8):908–913.

5 Office of National Statistics. Review of Avoidable Mortality Definition. Cardiff: Government of the United Kingdom; 2015.

6 Khoja T, Farag MK. Synopsis of Indicators: Monitoring, Evaluation, and Supervision of Healthcare Quality. Kingdome of Saudi Arabia: Ministry of Health; 1995.

7 Kalra L, Evans A, Perez I, et al. Alternative strategies for stroke care: a prospective randomized controlled trial. Lancet. 2000;356:894–899.

8 Drummond AE, Pearson B, Lincoln NB, et al. Ten year follow-up of a randomized controlled trial of care in a stroke rehabilitation unit. BMJ. 2005;331:491–492.

9 Kahn KL, Rubenstein LV, Draper D, et al. The effects of DRG-based prospective payment system on quality of care for hospitalized Medicare patients: an introduction to the series. JAMA. 1990;264:1953–1955.

10 Kosecoff J, Kahn KL, Rogerts WH, et al. Prospective payment system and impairment at discharge: the “quicker and sicker” story revisited. JAMA. 1990;264:1980–1983.

11 Wallenstein ME, Ananth CV, Kim JH, et al. Effect of surgical volume on outcomes for laparoscopic hysterectomy for benign indications. Obstet Gynecol. 2012;119:709–716.

12 Birkmeyer JD, Stukel TA, Siewers AE, et al. Surgeon volume and operative mortality in the United States. N Engl J Med. 2003;349:2117–2127.

13 Haruyama Y, Yamazaki T, Endo M, et al. Personal status of general health checkups and medical expenditure: a large-scale community-based retrospective cohort study. J Epidemiol. 2017;27(5):209–214.

14 Gierisch JM, Earp JA, Brewer NT, et al. Longitudinal predictors of nonadherence to maintenance of mammography. Cancer Epidemiol Biomarkers Prev. 2010;19(4):1103–1111.

15 Gordis L. Effectiveness of comprehensive-care programs in preventing rheumatic fever. N Engl J Med. 1973;289:331–335.

16 Lin GM, Colangelo LA, Lloyd-Jones DM, et al. Association of sleep apnea and snoring with incident atrial fibrillation in the Multi-Ethnic Study of Atherosclerosis. Am J Epidemiol. 2015;182:49–57.

Review Questions for Chapter 17

1 All of the following are measures of process of health care in a clinic except:

Proportion of patients in whom blood pressure is measured
Proportion of patients who have complications of a disease
Proportion of patients advised to stop smoking
Proportion of patients whose height and weight are measured
Proportion of patients whose bill is reduced because of financial need

2 The extent to which a specific health care treatment, service, procedure, program, or other intervention does what it is intended to do when used in a community-dwelling population is termed its:

Efficacy
Effectiveness
Effect modification
Efficiency
None of the above

3 The extent to which a specific health care treatment, service, procedure, program, or other intervention produces a beneficial result under ideal controlled conditions is its:

Efficacy
Effectiveness
Effect modification
Efficiency
None of the above

4 A major problem in using a historical control design for evaluating a health service using case-fatality (CF) as an outcome is that if the CF is lower after provision of the health service was started, then:

The lower CF could be caused by changing prevalence of the disease
The lower CF may be a result of decreasing incidence
The lower CF may be an indirect effect of the new health service
The CF may have been affected by changes in factors that are not related to the new health service
None of the above

Question 5 is based on the information given below:

In-Hospital Case-Fatality (CF) for 100 Men Not Treated in a Coronary Care Unit (CCU) and for 100 Men Treated in a CCU, According to Three Clinical Grades of Severity of Myocardial Infarction (MI)

Clinical Grade non-ccu (no. of patients) ccu (no. of patients)

Total Died CF (%) Total Died CF (%)

Mild 60 12 20 10 3 30

Severe 36 18 50 60 18 30

Shock 4 4 100 30 13 43

The results shown are based on a comparison of the last 100 patients treated before the CCU was installed and the first 100 patients treated within the CCU. All 200 patients were admitted during the same month.

You may assume that this is the only hospital in the town and that the natural history of MI was unchanged during this period. Assignment: Using Epidemiology to Evaluate Health Services

5 The authors concluded that the CCU was very beneficial for men with severe MI and for those in shock, because the in-hospital CFs for these categories were much lower in the CCU. This conclusion:

Is correct
May be incorrect because CFs were used rather than mortality rates
May be incorrect because of a referral bias of patients to this hospital from hospitals in distant towns
May be incorrect because of differences in the assignment of the clinical severity grade before and after the opening of the CCU
May be incorrect because of failure to recognize a possible decrease in the annual incidence rate of MI in recent years

CHAPTER 18

Epidemiologic Approach to Evaluating Screening Programs

Keywords

early detection of disease; screening; preclinical phase; detectable preclinical phase; disease progression; referral or volunteer bias; length-biased sampling; lead time bias and survival; overdiagnosis bias; cost-benefit analysis

LEARNING OBJECTIVES

To extend the discussion of the validity and reliability of screening tests introduced in Chapter 5.
To revisit the natural history of disease and introduce the concepts of lead time and critical point.
To describe the major sources of bias that must be taken into account in assessing study findings that compare screened and unscreened populations, including referral bias, length-biased sampling, lead time bias, 5-year survival, and overdiagnosis bias.
To discuss various study designs for evaluating screening programs, including nonrandomized and randomized studies and the challenges of interpreting the results of these studies.
To discuss problems in assessing the sensitivity and specificity of commercially developed screening tests.
To introduce issues associated with cost-benefit analyses of screening.

In Chapter 1, we distinguished among primary, secondary, and tertiary prevention. In Section II, we discussed the design and interpretation of studies that aim to identify risk factors or etiologic factors for disease so that the occurrence of disease can be completely prevented—primary prevention. In this chapter, we address how epidemiology is used to evaluate the effectiveness of screening programs for the early detection of disease—secondary prevention. This subject is particularly important in both clinical practice and public health because there is increasing acceptance of a physician’s obligation to include prevention along with diagnosis and treatment as major responsibilities in the clinical care of patients.

The validity and reliability of screening tests were discussed in Chapter 5. In this chapter, we will discuss some of the methodologic issues that must be considered in deriving inferences about the benefits that may come to those who undergo screening tests.

The question of whether patients benefit from the early detection of disease includes the following components:

Can the disease be detected early?
What are the sensitivity and the specificity of the test?
What is the predictive value of the test?
How serious is the problem of false-positive test results?
What is the cost of early detection in terms of funds, resources, and emotional impact?
Can the subject be harmed by having a screening test?
Do the individuals in whom disease is detected early benefit from the early detection, and is there an overall benefit to those who are screened?

In this chapter, we primarily address the last question. Several of the other issues in the preceding list are considered only in the context of this question.

The term early detection of disease means diagnosing a disease at an earlier stage than would usually occur in standard clinical practice. This usually denotes detecting disease at a presymptomatic stage, at which point the patient has no clinical complaint (no symptoms or signs) and therefore no reason to seek medical care for the condition. The assumption in screening is that an appropriate intervention is available for the disease that is detected and that the medical intervention can be more effectively applied if the disease is detected at an earlier stage.

At first glance, the question of whether people benefit from early detection of disease may seem somewhat surprising. Intuitively, it would seem obvious that early detection is beneficial and that intervention at an earlier stage of the disease process is more effective and/or easier to implement than a later intervention. In effect, these assumptions represent a “surgical” view; for example, every malignant lesion is localized at some early stage, and at this stage it can be successfully excised before regional spread occurs or certainly before widespread metastases develop. However, the intuitive attractiveness of such a concept should not blind us to the fact that throughout the history of medicine, deeply felt convictions have often turned out to be erroneous when they were not supported by data obtained from appropriately designed and rigorously conducted studies. Consequently, regardless of the attractiveness of the idea of the beneficial aspects of early disease detection, both to clinicians involved in prevention and therapy and to those involved in community-based prevention programs, the screening program.

Assessing the Effectiveness of Screening Programs Using Operational Measures

Number of people screened
Proportion of target populations screened and number of times screened
Detected prevalence of preclinical disease
Total costs of the program
Costs per case found
Costs per previously unknown case found
Proportion of positive screenees brought to final diagnosis and treatment
Predictive value of a positive test in population screened

Modified from Hulka BS. Degrees of proof and practical application. Cancer. 1988;62:1776–1780. Copyright © 1988 American Cancer Society. Reprinted by permission of Wiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc.

We are particularly interested in the question of what benefit is gained by people who undergo screening in a screening program. However, just as is the case with evaluation of health services (discussed in Chapter 17), there is little advantage to improving the process of screening if persons who are screened derive no benefit. That is, if early detection does not lead to any improvement in survival, what is the gain to patients to be detected earlier? Perhaps just a longer remaining time to worry with poor quality of life! We will therefore examine some of the problems associated with determining whether early detection of disease confers benefits to the individual who undergoes screening (in other words, whether the outcome is improved by screening). Assignment: Using Epidemiology to Evaluate Health Services

What do we mean by outcome? To answer the question of whether patients benefit, we must precisely define what we mean by benefit, and what outcome or outcomes are considered to be evidence of patient benefit. Some of the possible outcome measures that might be used are shown in Box 18.2.

Box 18.2

Assessing the Effectiveness of Screening Programs Using Outcome Measures

Reduction of mortality in the population screened
Reduction of case-fatality in screened individuals
Increase in percent of cases detected at earlier stages
Reduction in complications
Prevention of or reduction in recurrences or metastases
Improvement of quality of life in screened individuals

Natural History of Disease

To discuss the methodologic issues involved in evaluating the benefits of screening, let us examine in further detail the natural history of disease (first discussed in Chapter 6).

We will begin by placing screening in its appropriate place on the timeline of the natural history of disease and will do so in relation to the different approaches to prevention discussed in Chapter 1.

Fig. 18.1A is a schematic representation of the natural history of a disease in an individual. At some point, biologic onset of disease occurs. This may be a subcellular change, such as an alteration in DNA, which at this point is generally undetectable. At some later point the disease becomes symptomatic, or clinical signs develop (i.e., the disease now moves into a clinical phase). The clinical signs and symptoms (e.g., blood in the stool) prompt the patient to seek care, after which a diagnosis is made and appropriate therapy is instituted, the ultimate outcome of which may be cure, control of the disease, disability, or death.

As seen in Fig. 18.1B, the onset of symptoms marks an important point in the natural history of a disease. The period when disease is present can be divided into two phases. The period from biologic onset of the disease to the development of signs and symptoms is called the preclinical phase of the disease, which comes before the clinical phase of the disease.

The period from the time when signs and symptoms develop to an ultimate outcome such as possible cure, control of the disease, or death is referred to as the clinical phase of the disease. As seen in Fig. 18.1C and D, primary prevention (i.e., preventing the development of disease by preventing or reducing exposure to disease-causing agents) denotes an intervention before a disease has developed. (Prevention of risk factor exposure, such as immunization and prevention of smoking initiation, is also known as primordial prevention.) Secondary prevention, detecting disease at an earlier stage than usual, such as by screening, takes place during the preclinical phase of an illness (i.e., after the disease has developed but before clinical signs and symptoms have appeared). Tertiary prevention refers to treating clinically ill individuals to prevent complications of the illness (e.g., stroke rehabilitation), including death of the patient.

If we want to detect disease earlier than usual through programs of health education, we could encourage symptomatic persons to seek medical care sooner. However, a major challenge lies in identifying persons with disease who do not have any symptoms. Our focus in this chapter is on identifying disease in persons who have not yet developed symptoms and who are in the preclinical phase of illness.

Let us now take a closer look at the preclinical phase of the disease (Fig. 18.2). At some point during the preclinical phase, it becomes possible to detect the disease by using currently available tests (see Fig. 18.2A). The interval from this point to the development of signs and symptoms is the detectable preclinical phase of the disease (see Fig. 18.2B). When disease is detected by a screening test, the time of diagnosis is advanced to an earlier point in the natural history of the disease than would have happened if the screening was not done. The lead time is defined as the interval by which the time of diagnosis is advanced by screening for the early detection of disease compared with the usual time of diagnosis (see Fig. 18.2C). The concept of lead time is inherent in the idea of screening and then detecting a disease earlier than it would usually be found.

Another important concept in screening is if there is a critical point in the natural history of a disease (Fig. 18.3A). 3 This is a point in the natural history before which treatment is more effective and/or less difficult to administer. If a disease is potentially curable, cure may be possible before this point but not later on. For example, in a woman with breast cancer, one critical point would be that at which the disease spreads from the breast to the axillary lymph nodes. If the disease is detected and treated prior to spreading, the prognosis is much better than after spread to the nodes has taken place.

FIG. 18.3 (A) A single critical point in the natural history of a disease. (B) Multiple critical points in the natural history of a disease. (Modified from Hutchison GB. Evaluation of preventive services. J Chronic Dis. 1960;11:497–508.)

As shown in Fig. 18.3B, there may be multiple critical points in the natural history of a disease. For example, in the patient with breast cancer, a second critical point may be that at which disease spreads from the axillary nodes to other more distant parts of the body. Prognosis is still better when the disease is confined to the axillary lymph nodes than when systemic spread has occurred, but not as good as when the disease is confined to the breast. The concept of multiple critical points suggests that the earlier the diagnosis, the better the prognosis.

However, the critical point is somewhat theoretical because we usually cannot identify when the critical point is reached. However, it is a very important concept in screening. If we cannot envision one or more critical points in the natural history of a disease, there is clearly no rationale for screening and early detection. Early detection presumes that a biologic point exists in the natural history of a disease before which treatment will benefit a person more than if he or she is treated after that point.

Pattern of Disease Progression

We might expect to see a potential benefit from screening and early detection if the following two assumptions hold:

All or most clinical cases of a disease first go through a detectable preclinical phase.
In the absence of intervention, all or most cases in a preclinical phase progress to a clinical phase.

Both assumptions are reasonably self-evident. For example, if none of the preclinical cases progress to clinical cases, there is no reason to perform screening tests. Alternatively, if none of the clinical cases passes through a preclinical phase, there is no reason to perform screening tests. Thus both assumptions are important in assessing any potential benefit from screening.

Let us look at the example of screening for cervical cancer. It has been some 80 years since the Papanicolaou (Pap) was developed to test for the presence of precancerous or cancerous cells of the cervix, the opening of the uterus. During this routine procedure, cervical cells are scraped from around the cervix and then examined. The biology of cervical cancer has been well documented, going through a series of steps from dysplasia to carcinoma in situ, to invasive cervical cancer that often takes years to develop. Thus early detection often allows treatment to stop the progression of this cancer. More recently, with the documentation of the viral origins of cervical cancer (human papillomavirus [HPV] infection), cervical cancer screening is now done by HPV detection with less frequent Pap tests. Fig. 18.4A shows the progression from a normal cervix to cervical cancer. We might expect that detection and treatment of more cases at the in situ (noninvasive) stage would be reflected in a commensurate reduction in the number of cases that progress to invasive disease.

FIG. 18.4 (A) Natural history of cervical cancer: I. Progression from normal cervix to invasive cancer. (B) Natural history of cervical cancer: II. Extremely rapid progression and spontaneous regression.

However, the two assumptions associated with early detection are open to question. In certain situations, and unlike what happens in cervical cancer, the preclinical phase may be so short that the disease is unlikely to be detected by any periodic screening program. In addition, there is increasing evidence that spontaneous regression may occur in some diseases; therefore not every preclinical case inexorably progresses to clinical disease. Importantly, this is the case with HPV detection in women—most of the HPV types detected in routine screening will generally revert (spontaneously disappear) in the following 6 months!

However, evaluating the benefits of cervical cancer screening is complicated by the problem that some cases progress through the in situ stage so rapidly and the preclinical stage is so brief, that for all practical purposes there is no preclinical stage during which disease can be detected by screening. In addition, nuclear DNA quantitation studies suggest that cervical intraepithelial abnormalities may exist either as a reversible state or as an irreversible precursor of invasive cancer. Data also suggest that some cases of cervical intraepithelial neoplasia detected by a Pap smear regress spontaneously, particularly in the earlier stages, but also in the later stage (carcinoma in situ). In one study, one-third of women with abnormal Pap smears who refused any intervention were later found to have normal Pap smears. In addition, data suggest that most, if not all, in situ cervical neoplasias are associated with different types of papillomaviruses. Only neoplasias associated with certain high-risk types of papillomavirus progress to invasive cancer, so we may be dealing with heterogeneity of both the causal agent and disease.

The simple model of progression from normal cervix to invasive cervical cancer seen in Fig. 18.4A would suggest that early detection followed by effective intervention would be reflected by a commensurate reduction in the number of invasive lesions that subsequently develop. A more accurate presentation of the natural history of cervical cancer may be that seen in Fig. 18.4B . The extent of both phenomena, spontaneous regression and extremely rapid progression, clearly influences the size of the decrease in invasive disease that might be expected to result from early detection and intervention and must therefore be taken into account in assessing the benefits of screening. Although these issues have been demonstrated for cervical cancer, they are clearly relevant to evaluating the benefits of screening for many diseases.

Methodologic Issues

To interpret the findings in a study designed to evaluate the benefits of screening, certain methodologic problems must be taken into account. Most studies of screening programs that have been reported are not randomized trials because of the difficulties of randomizing a population for screening. The question is, therefore, can we examine a group of people who have been screened and compare their mortality to a group of people who have not been screened (i.e., use a cohort design to evaluate the effectiveness of screening)?

Let us assume that we can compare a population of people who have been screened for a disease with a population of people who have not been screened for the disease. Let us assume further that a viable and effective treatment is available and will be used effectively for those in whom disease is detected. If we find a lower mortality from the disease in those in whom disease was identified through screening than in those in whom disease was not detected in this manner, can we conclude that screening and early detection of disease have been beneficial? Let us turn to some of the methodologic issues involved.

Selection Biases

Referral Bias (Volunteer Bias)

In coming to a conclusion about the benefits of screening, the first question we might ask is whether there was a selection bias in terms of who was screened and who was not. We would like to be able to assume that those who were screened had the same characteristics as those who were not screened (i.e., they were similar to one another in all ways except their screening history). However, there are many differences in the characteristics of those who participate in screening or take advantage of other health programs and those who do not. Many studies have shown volunteers to be healthier than the general population and to be more likely to comply (to be adherent) with medical recommendations. If, for example, persons whose disease had a better prognosis from the outset were either referred for screening or were self-selected, we might observe lower mortality in the screened group even if early detection played no role in improving prognosis. Of course, it is also possible that volunteers may include many people who are at high risk and who volunteer for screening because they have anxieties based on a positive family history or their own lifestyle characteristics. The problem is that we do not know in which direction the selection bias might operate and how it might affect the study results.

The problem of selection bias that most significantly affects our interpretation of the findings is best addressed by carrying out the comparison with a randomized experimental study in which care is taken that the two groups have comparable initial prognostic profiles (Fig. 18.5).

FIG. 18.5 Design of a randomized trial of the benefits of screening.

Length-Biased Sampling (Prognostic Selection)

The second type of problem that arises in interpreting the results of a comparison of a screened and an unscreened group is a possible selection bias; this does not relate to who comes for screening but rather to the type of disease that is detected by the screening. The question is: Does screening selectively identify cases of the disease which have a better prognosis? In other words, do the cases found through screening have a better natural history regardless of how early therapy is initiated? If the outcome of those in whom disease is detected by screening is found to be better than the outcome of those who were not screened, and in whom disease was identified during the usual course of clinical care, could the better outcome among those who are screened result from selective identification by screening of persons with a better prognosis? Could the better outcome be unrelated to the time of diagnostic and treatment interventions?

How could this come about? Recall the natural history of disease, with clinical and preclinical phases, as shown in Fig. 18.1B. We know that the clinical phase of illness differs in length for different people (i.e., there is a natural distribution of clinical illness parameters in every population). For example, some patients with colon cancer die soon after diagnosis, whereas others survive for many years. What appears to be the same disease may include individuals with different lengths of a clinical phase.

What about the preclinical phase in these individuals? Actually, each patient’s disease has a single continuous natural history, which we divide into preclinical and clinical phases (Fig. 18.6) on the basis of the point in time at which signs and symptoms develop. In some, the natural history is brief, and in others the natural history is protracted. This suggests that if a person has a slowly progressive natural history with a long clinical phase, the preclinical phase will also be long. In contrast, if a person has a rapidly progressive disease process and a short natural history, the clinical phase is likely to be short, and it seems reasonable to conclude that the preclinical phase will also be short. There are in fact data to support the notion that a long clinical phase is associated with a long preclinical phase and a short clinical phase is associated with a short preclinical phase. Lung cancer serves as an example: it has a short clinical phase and most likely also a short preclinical phase, as suggested by the inconsistent results from clinical trials of smokers screened by computed tomography, with some trials showing an approximately 15% to 20% effectiveness and others showing no effectiveness whatsoever. 4

FIG. 18.6 Short and long natural histories of disease: relationship of length of clinical phase to length of preclinical phase.

Remember that our purpose in screening is to detect the disease during the preclinical phase because during the clinical phase the patient is already aware of the problem and even without screening will probably seek medical care for symptoms. If we mount a one-time screening program in a community, which group of patients are we likely to identify—those with a short preclinical phase or those with a long preclinical phase?

To answer this question, let us consider a small population that is screened for a certain disease (Fig. 18.7). As shown in Fig. 18.7, each case has a preclinical and a clinical phase. The figure is drawn so that each preclinical phase is the same length as its associated clinical phase. Patients in the clinical phase will be identified in the usual course of medical care, so the purpose of the screening is to identify cases in the preclinical state (i.e., before any onset of signs or symptoms). Note that the lengths of the preclinical phases of cases represented here vary. The longer the preclinical phase, the more likely the screening program will detect the case while it is still preclinical. For example, if we screen once a year for a disease for which the preclinical phase is only 24 hours long, we will clearly miss virtually all of the cases during the preclinical phase. However, if the preclinical phase is 1 year long, many more cases will be identified during that time. Screening tends to selectively identify those cases that have longer preclinical phases of illness. Consequently, even if the subsequent therapy had no effect, screening would still selectively identify persons with a long preclinical phase, and they would consequently experience a longer clinical phase (i.e., those with a better prognosis). These people would have a better prognosis even if there were no screening program or even if there were no true benefits from screening.

FIG. 18.7 Hypothetical population of individuals with long and short natural histories.

This problem can be addressed in several ways. One approach is to use an experimental randomized design in which care is taken to keep the groups comparable in terms of the lengths of the detectable preclinical phase of illness. However, this may not be easy. In addition, survival would have to be examined for all members of each group (i.e., the screened and unscreened). In the screened group, survival would be calculated for those in whom disease is detected by screening and for those in whom disease is detected between screening examinations, the so-called interval cases. We will return to the importance of interval cases later in this chapter.

Lead Time Bias

Another problem that arises in comparing survival in people who are screened with survival in those who are not screened is lead time bias (first illustrated in Fig. 18.2C)—how much earlier can the diagnosis be made if the disease is detected by screening compared with the usual timing of the diagnosis if screening were not carried out?

Consider four individuals with a certain disease shown by the four timelines in Fig. 18.8. The thicker part of each horizontal line denotes the apparent survival that is observed. The first timeline (A) shows the usual time of diagnosis and the usual time of death. The second timeline (B) shows an earlier time of diagnosis but the same time of death. Survival seems better because the interval from diagnosis to death is longer, but the patient is not any better off because death has not been delayed. The third timeline (C) shows earlier diagnosis and a delay in death from the disease—clearly a benefit to the patient (assuming that subsequent quality of life is good). Finally, the fourth timeline (D) shows earlier diagnosis, with subsequent prevention of death from the disease.

FIG. 18.8 (A) Outcome of diagnosis at the usual time, without screening. (B–D) Three possible outcomes of an earlier diagnosis as a result of a screening program.

The benefits we seek in screening are delay or prevention of death. Although we have chosen to focus on mortality in this chapter, we could also have used morbidity parameters, recurrence, quality of life, or patient satisfaction as valid measures of outcome.

Lead Time and 5-Year Survival

Five-year survival is a frequently used measure of therapeutic success, particularly in cancer therapy. Let us examine the possible effect of lead time on apparent 5-year survival.

Fig. 18.9A shows the natural history of disease in a hypothetical patient with colon cancer, which was diagnosed in the usual clinical context without any screening. Biologic onset of the disease was in 2008. The patient became aware of symptoms in 2016 and had a diagnostic workup leading to a diagnosis of colon cancer. Surgery was performed in 2016, but the patient died of colon cancer in 2018. This patient has survived for 2 years (2016–18) and clearly is not a 5-year survivor. If we use 5-year survival as an index of treatment success, this patient is a treatment failure.

FIG. 18.9 (A) Natural history of a patient with colon cancer without screening. Disease diagnosed and treated in 2008. (B) Disease detected by screening 3 years earlier in 2013 (lead time). (C) Lead time bias resulting from screening 3 years earlier.

Consider what might happen to this patient if he resides in a community in which a screening program is initiated (see Fig. 18.9B). For this hypothetical example only, let us assume that there is actually no benefit from early detection (i.e., the natural history of colon cancer is unaffected by early intervention). In this case the patient is asymptomatic but undergoes a routine screening test in 2013, the result of which is positive. In 2013, surgery is performed, but the patient dies in 2018. The patient has survived 5 years and is now clearly a 5-year survivor. However, he is a 5-year survivor not because death has been delayed but because the diagnosis has been made earlier. When we compare this screening scenario with the scenario without screening (see Fig. 18.9A), it is apparent the patient has not derived any benefit from earlier detection in terms of having lived any longer. Indeed, the patient may have lost out in terms of quality of life because the earlier detection of disease by screening gave him an additional 3 years of postoperative and other medical care and may have deprived him of 3 years of normal life. This problem of an illusion of better survival only because of earlier detection is called the lead time bias, as shown in Fig. 18.9C.

Thus, even if there is no true benefit from early detection of a disease, there will appear to be a benefit associated with screening, even if death is not delayed, because of an earlier point of diagnosis from which survival is measured. This is not to say that early detection carries no benefit; rather, even without any benefit, the lead time associated with early detection suggests the appearance of a benefit in the form of enhanced survival. Lead time must therefore be taken into account in interpreting the results of nonrandomized evaluations.

Fig. 18.10 shows the effect of the bias resulting from lead time on quantitative estimates of survival. Fig. 18.10A shows a situation in which no screening activity is being carried out. Five years after diagnosis, survival is 30%. If we institute a screening program with a 1-year lead time, the entire frame is shifted to the left (see Fig. 18.10B). If we now calculate survival at 5 years from the new time of diagnosis (see Fig. 18.10C), survival appears to be 50% but only as a result of lead time bias. The problem is that the apparently better survival is not a result of screened people living longer, but it is rather a result of a diagnosis being made at an earlier point in the natural history of their disease. For many diseases, such as cancer, the patient cannot die before the onset of the clinical phase and thus the time before early and usual diagnosis (i.e., the lead time) reflects what is also known as the “immortal time bias.”

FIG. 18.10 (A) Lead time bias-I: 5-year survival when diagnosis is made without screening. (B) Lead time bias-II: Shift of 5-year period by screening and early detection (lead time). (C) Lead time bias-III: Bias in survival calculation resulting from early detection. (Modified from Frank JW. Occult-blood screening for colorectal carcinoma: the benefits. Am J Prev Med. 1985;1:3–9.)

Consequently, in any comparison of screened and unscreened populations we must make an allowance for an estimated lead time in an attempt to identify any prolongation of survival above and beyond that resulting from the artifact of lead time. If early detection is truly associated with improved survival, survival in the screened group should be greater than survival in the control group plus the lead time. We therefore have to generate some estimate of the lead time for the disease being studied. 5

Another strategy is to compare mortality from the disease in the entire screened group with that in the unscreened group, rather than just the cumulative survival or its reciprocal, the case fatality rate, in those in whom disease was detected by screening.

Overdiagnosis Bias

Another potential bias is that of overdiagnosis. At times, people who initiate a screening program have almost limitless enthusiasm for the program. Even cytopathologists reading Pap smears for cervical cancer may become so enthusiastic that they may tend to overread the smears (in other words, to make false-positive readings). If they do overread, some normal women will be included in the group thought to have positive Pap smears. Consequently, the abnormal group will be diluted with women who are free of cancer. If normal individuals in the screened group are more likely to be erroneously diagnosed as positive than are normal individuals in the unscreened group (i.e., labeled as having cancer when in reality they do not), one could get a false impression of increased rates of detection and diagnosis of early-stage cancer as a result of the screening. In addition, because many of the persons with a diagnosis of cancer in the screened group would actually not have cancer and would therefore have a good survival, the results would represent an inflated estimate of survival after screening in persons thought to have cancer, resulting in a mistaken conclusion that screening had been shown to improve survival from cancer in this population.

The possible quantitative impact of overdiagnosis resulting from screening is demonstrated in a hypothetical example shown in Fig. 18.11. Fig. 18.11A shows Scenario 1, in which there is no screening. In this scenario, 1,000 patients with clinical lung cancer are followed for 10 years. At that point, 900 have died and 100 are alive. The 10-year survival for the 1,000 patients is therefore , or 10%.

FIG. 18.11 The impact of overdiagnosis resulting from screening on estimation of survival. (A) Scenario 1—survival with no screening. (B) Scenario 2—when screening results in overdiagnosis: survival after 10 years. (C) Comparison of 10-year survival in Scenario 1 and Scenario 2. (Modified from Welch HG, Woloshin S, Schwartz LM. Overstating the evidence for lung cancer screening: the International Early Lung Cancer Action Program [I-ELCAP] study. Arch Intern Med. 2007;167:2289–2295.)

Fig. 18.11B shows Scenario 2, in which screening results in overdiagnosis. In this scenario, 4,000 people screen positive for lung cancer. Of these, 1,000 are the same patients with clinical lung cancer seen in Fig. 18.11A, and the other 3,000 are people who do not have lung cancer but are overdiagnosed by the screening test as being positive for lung cancer (false-positives).

After 10 years, these 3,000 people are still alive, as are the 100 people who had clinical lung cancer and survived as shown in Fig. 18.11A. The result is that of the 4,000 people who screened positive initially, 3,100 have survived for 10 years. As shown in the comparison of Scenario 1 and Scenario 2 in Fig. 18.11C, 10-year survival in Scenario 2 is now 78% compared with 10% in Scenario 1 in the original patient population of 1,000 who had clinical lung cancer. However, the apparently “better” survival seen in Scenario 2 is entirely due to the inclusion of 3,000 people who did not have lung cancer but were overdiagnosed by the screening method.

In effect, this is a misclassification bias, as discussed in Chapter 15. In this example, 3,000 people without lung cancer have been misclassified by the screening test as having lung cancer. Consequently, it is essential that in such studies of survival, the diagnostic process be rigorously standardized to minimize the potential problem of overdiagnosis.

Study Designs for Evaluating Screening: Nonrandomized and Randomized Studies

Nonrandomized Studies

In discussing the methodologic issues involved in nonrandomized studies of screening, we have in essence been discussing nonrandomized observational studies of screened and unscreened persons—a cohort design (Fig. 18.12).

FIG. 18.12 Design of a nonrandomized cohort study of the benefits of screening.

The case-control design has also been used as a method of assessing the effectiveness of screening (Fig. 18.13). In this design the “cases” are people with advanced disease—the type of disease we hope to prevent by screening. Several proposals have been made for appropriate controls for such a study. Clearly, they should be “noncases” (i.e., people without advanced disease). Although the “controls” used in early case-control studies for evaluating screening were people with disease at an early stage, many researchers believe that people selected from the population from which the cases were derived are better controls. We then determine the prevalence of a history of screening among both the cases and the controls, so that screening is looked at as an “exposure.” If screening is effective, we would expect to find a greater prevalence of screening history among the controls than among those with advanced disease, and an odds ratio can be calculated, which will be less than 1.0 if screening is effective.

FIG. 18.13 Design of a case-control study of the benefits of screening.

Randomized Studies

In this type of study, a population is randomized, half to screening and half to no screening. Such a study is difficult to mount and carry out and may be fraught with ethical concerns. Perhaps the best known randomized trial of screening is the trial of screening for breast cancer using mammography that was carried out at the Health Insurance Plan (HIP) of New York. 6 Shapiro and colleagues conducted a randomized trial in women enrolled in the prepaid HIP program, an early health maintenance organization (HMO) in New York. This study has become a classic in the literature in reporting evaluation of screening benefits through a randomized trial design, and it serves as a model for future studies of this type.

The study was begun in 1963. It was designed to determine whether periodic screening using clinical breast examination by a physician and mammography reduced breast cancer mortality in women aged 40 to 64 years. Approximately 62,000 women were randomized into a study group and a control group of approximately 31,000 each (Fig. 18.14). The study group was offered screening examinations; 65% appeared for the first examination and were offered additional examinations at annual intervals. Most of these women had at least one of the three annual screening examinations that were offered. Screening consisted of physical breast examination, mammography, and interview. Control women received the usual medical care in the prepaid medical program. Many reports have been published from this outstanding study, and we will examine only a few of the results here.

FIG. 18.14 Design of the Health Insurance Plan (HIP) randomized controlled trial begun in 1963 to study the efficacy of mammography screening. (Data from Shapiro S, Venet W, Strax P, et?al., eds. Periodic Screening for Breast Cancer: The Health Insurance Plan Project and Its Sequelae, 1963–1986. Baltimore: Johns Hopkins University Press; 1988.)

Fig. 18.15 shows the number of breast cancer deaths and the mortality rates in both the study group (women who were offered screening mammography) and the control group after 5 years of follow-up.

FIG. 18.15 Numbers of deaths due to breast cancer and mortality rates from breast cancer in control and study groups; 5 years of follow-up after entry into study. Data for study group include deaths among women screened and those who refused screening. (Data from Shapiro S, Venet W, Strax P, et?al. Selection, follow-up, and analysis in the Health Insurance Plan Study: a randomized trial with breast cancer screening. Natl Cancer Inst Monogr. 1985;67:65–74.)

Note that the data for the study group include deaths among women screened and those who refused screening. Recall the presentation on the problem of unplanned crossover in randomized trials. In that context, it was pointed out that the standard procedure in data analysis was to analyze according to the original randomization—an approach known as “intention to treat.” That is precisely what was done here. Once a woman was randomized to mammography, she was kept in that group for purposes of analysis even if she subsequently refused screening. Despite this, we see that breast cancer deaths are much higher in the control group than in the study group.

Fig. 18.16 shows 5-year case-fatality in the women who developed breast cancer in both groups. The case-fatality in the control group was 40%. In the total study group (women who were randomized to receive mammography, regardless of whether or not they were actually screened) the case-fatality was 29%. Shapiro and coworkers then divided this group into those who were screened and those who refused screening. In those who refused screening, the case-fatality was 35%. In those who were screened, the case-fatality was 23%.

FIG. 18.16 Five-year case-fatality among patients with breast cancer. Case-fatality for those in whom detection was due to screening allow for a 1-year lead time. (Data from Shapiro S, Venet W, Strax P, et?al. Ten- to 14-year effect of screening on breast cancer mortality. J Natl Cancer Inst. 1982;69:349–355.)

Shapiro and colleagues then compared survival in women whose breast cancer was detected at the screening examination with that in women whose breast cancer was identified between screening examinations (i.e., no breast cancer was identified at screening, and before the next examination a year later, the women had symptoms that led to the diagnosis of breast cancer). If the cancer had been detected by mammography, the case-fatality was only 13%. However, if the breast cancer was an interval case (i.e., diagnosed between examinations), the case-fatality was 38%. What could explain this difference in case-fatality? The likely explanation is that disease that was found between regular mammographic examinations was rapidly progressive. It was not detectable at the regular mammographic examination but was identified before the next regularly scheduled examination a year later because it was so aggressive. (Another possibility is that at least some apparent interval cases were in reality cases that had not been detected at the previous screening examination [i.e., they were false-negatives].)

These observations also support the notion discussed earlier in this chapter that a long clinical phase is likely to be associated with a long preclinical phase. Women in whom cancer findings were detected at screening had a long preclinical phase and a case-fatality of only 13%, indicating a long clinical phase as well. The women who had normal mammograms and whose disease became clinically apparent before the next examination had a short preclinical phase and, given the group’s high case-fatality, also had a short clinical phase.

Fig. 18.17 shows deaths from causes other than breast cancer in both groups over 5 years. Mortality was much higher in those who did not come for screening than in those who did. Because the screening was only directed at breast cancer, why should those who came for screening and those who did not manifest different mortality rates for causes other than breast cancer? The answer is, clearly, volunteer bias—the well-documented observation that people who participate in health programs differ in many ways from those who do not: in their health status, attitudes, educational and socioeconomic levels, and other factors. This is another demonstration that for purposes of evaluating a health program, comparison of participants and nonparticipants is not a valid approach.

FIG. 18.17 Mortality from all causes excluding breast cancer per 10,000 person-years, Health Insurance Plan. (Data from Shapiro S, Venet W, Strax P, et?al. Selection, follow-up, and analysis in the Health Insurance Plan Study: a randomized trial with breast cancer screening. Natl Cancer Inst Monogr. 1985;67:65–74.)

Before we leave our discussion of the HIP study, we might digress and mention an interesting application of these data carried out by Shapiro and coworkers. 7 Fig. 18.18 shows that, in the United States, 5-year relative survival from breast cancer is better in whites than in blacks.

FIG. 18.18 Five-year relative survival rates, by race, among women with breast cancer diagnosed 1964–73 (Surveillance, Epidemiology, and End Results program). (Data from Shapiro S, Venet W, Strax P, et?al. Prospects for eliminating racial differences in breast cancer survival rates. Am J Public Health. 1982;72:1142–1145.)

The question has been raised whether this is due to a difference in the biology of the disease in blacks and in whites or to a difference between blacks and whites in accessing health care, which may delay the diagnosis and treatment of the disease in black patients. Shapiro and colleagues recognized that the randomized trial of mammography offered an unusual opportunity to address this question. The findings are shown in Fig. 18.19. Let us first look only at the survival curves for the control group consisting of blacks and whites (see Fig. 18.19A). The data are consistent with those in Fig. 18.18: blacks and Hispanics had a worse prognosis than did whites. Now let us also look at the curves for whites and blacks in the study group of women who were screened and for whom there was therefore no difference in access to care or use of care, because screening was carried out on a predetermined schedule (see Fig. 18.19B). We see considerable overlap of the two curves: essentially no difference. This strongly suggests that the screening had eliminated the racial difference in survivorship and that the usually observed difference between the races in prognosis of breast cancer is in fact a result of poorer access to care or poorer use of care among blacks, with a consequent delay in diagnosis and treatment and hence survival.

FIG. 18.19 (A) Cumulative case-survival rates, first 10 years after diagnosis by race, Health Insurance Plan (HIP) control groups. (B) Cumulative case-survival rates, first 10 years after diagnosis by race, HIP study and control groups. (From Shapiro S, Venet W, Strax P, et?al. Prospects for eliminating racial differences in breast cancer survival rates. Am J Public Health. 1982;72:1142–1145.)

Further Examples of Studies Evaluating Screening

Mammography for Women 40 to 49 Years of Age

A major controversy in the 1990s centered on the question of whether mammography should be universally recommended for women in their 40s. The data from the Shapiro et al. study, as well as from other studies, established the benefit of regular mammography examinations for women 50 years and older. However, the data are less clear for women in their 40s. Many issues arise in interpreting the findings of randomized trials carried out in a number of different populations. Although a reduction of mortality has been estimated at 17% for women in their 40s who have annual mammograms, the data available are generally from studies that were not specifically designed to assess possible benefits in this age group. Moreover, many of the trials recruited women in their late 40s, suggesting the possibility that even if there are observed benefits, they could have resulted just as well from mammograms performed when they would have been aged 50 years or older.

A related issue is seen in Fig. 18.20. When mortality over time is compared in screened and unscreened women 50 years of age or older (see Fig. 18.20A), the mortality curves diverge at approximately 4 years after enrollment, with the mammography group showing a lower mortality that persists over time. However, when screened and unscreened women in their 40s are compared (see Fig. 18.20B), the mortality curves do not suggest any differences in mortality for at least 11 to 12 years after enrollment. Further follow-up will be needed to determine if the divergence observed in the mortality curves would actually persist and represent a true benefit to women who have had mammograms in their 40s. However, interpreting these curves is complicated because women who have been followed for 10 or more years in these studies would have passed age 50. Consequently, even if mortality in screened women declines after 11 years, any such benefit observed could be due to mammograms that were performed after age 50 rather than to mammograms in their 40s. Further follow-up of women enrolled in many of these studies, and in newly initiated studies that are enrolling women in their early 40s, may help to clarify these issues.

FIG. 18.20 Cumulative breast cancer mortality rates in screened and unscreened women (A) ages 50 to 69 years and (B) ages 40 to 49 years. • = screened; ○ = unscreened. (From Kerlikowske K. Efficacy of screening mammography among women aged 40 to 49 years and 50 to 69 years: comparison of relative and absolute benefit. Natl Cancer Inst Monogr. 1997;22:79–86. [A] Modified from Tabar L, Fagerberg G, Duffy SW, et?al. Update of the Swedish two-county program of mammographic screening for breast cancer. Radiol Clin North Am. 1992;30:187–210. [B] Modified from Nystrom L, Rutqvist LE, Wall S, et?al. Breast cancer screening with mammography: overview of Swedish randomized trials. Lancet. 1993;341:973–978.)

In 1997 a consensus panel was created by the National Institutes of Health (lead by Professor Gordis) to review the scientific evidence for benefits of mammography in women ages 40 to 49. The panel concluded that the data available did not warrant a universal recommendation for mammography for all women in their 40s. The panel recommended that each woman should decide for herself (in consultation with her physician) whether to undergo mammography. 8 Her decision may be based not only on an objective analysis of the scientific evidence and consideration of her individual medical history, but also on how she perceives and weighs each potential risk and benefit, the values she places on each, and how she deals with uncertainty. Given both the importance and the complexity of the issues involved in assessing the evidence, a woman should have access to the best possible relevant information regarding both benefits and risks, presented in an understandable and usable form.

Most women will depend heavily on the knowledge and sophistication of their physicians rather than make the decision themselves on when to commence screening mammography. One important problem in this regard is that many physicians do not have sufficient knowledge of cancer screening statistics to provide the support needed by women and their families to carefully examine the results and conclusions, as well as the validity, of studies of mammography for women in their 40s. A study by Wegwarth and coauthors gave results of a national survey of primary care physicians in the United States and found that most primary care physicians mistakenly interpreted improved survival and increased detection with screening as evidence that screening saves lives. Few correctly recognized that reduced mortality in a randomized trial constitutes evidence of benefit of screening. 9

The consensus panel added that for women in their 40s who choose to have mammography performed, the costs of the mammograms should be reimbursed by third-party payers or covered by HMOs so that financial impediments will not influence a woman’s decision as to whether or not to have a mammogram. The recommendations of the panel were rejected by the National Cancer Institute, which had itself originally requested creation of the panel, and by other agencies. There were clear indications that strong political forces were operating at that time in favor of mammography for women in their 40s.

The controversy over mammography became an even broader one with the 2001 publication of a review by Olsen and Gøtzsche of the evidence supporting mammography at any age. 10 Among the issues raised by the investigators were concerns about possible inadequacy of some of the randomizations; possible unreliability of assessment of cause of death; their finding that in some trials exclusions of women from the studies were carried out after randomization had taken place and women with preexisting cancer were excluded only from the screened groups; and their assessment that the two best trials failed to find any benefit.

An accompanying Lancet editorial concluded by saying: “At present, there is no reliable evidence from large randomized trials to support screening mammography programmes.” 11 A 2004 article countered the arguments raised by Olsen and Gøtzsche and concluded that the prior consensus on mammography was correct. 12

However, the controversy continues unabated. In 2002 the US Preventive Services Task Force reviewed the evidence and recommended screening mammography every 1 to 2 years for women 40 years of age and older. Using an earlier version of the methodology than that described in Chapter 14, they classified the supporting evidence as “fair” on a scale of “good,” “fair,” or “poor.” 13 In 2009 this task force again reviewed the question of mammography for women in their 40s and recommended that women aged 50 to 74 years should have screening mammography every 2 years, but they also concluded as follows: “For biennial screening mammography in women aged 40 to 49 years, there is moderate certainty that the net benefit is small.” The task force gave its recommendation a “C” grade and pointed out that this grade is a recommendation against routine screening of women aged 40 to 49 years. They added, “The Task Force encourages individualized, informed decision making about when [at what age] to start mammography screening.” 14 The “C” grade was confirmed in a more recent recommendation statement from the task force. 15

In 2007 the American College of Physicians published new guidelines about mammography for women in their 40s, based on an extensive systematic review that addressed both benefits and potential harms. 16 , 17 The group concluded that the evidence of net benefit is less clear for women in their 40s than for women in their 50s and that mammography carries significant risks, saying: “We don’t think the evidence supports a blanket recommendation.” In 2011 the National Health Service in the United Kingdom issued its guidelines recommending that women aged 47 to 73 years undergo mammography every 3 years. 18

In 2015 the American Cancer Society (ACS) updated its guidelines for breast cancer screening for women at “average risk.” 19 The ACS recommended starting screening at age 45 years, with annual screening through age 54, after which biennial screening should be considered. As is clear, this is not an area where science, epidemiology, and public policy are totally aligned!

Thus the controversy between proponents and critics continues and is not likely to be settled to everyone’s satisfaction by expert pronouncements. The problems in methodology and interpretation are complex and will probably not be resolved by further large trials. Such trials are difficult and expensive to initiate and conduct, and because of the time needed to complete them, these trials are also limited in that the findings often do not reflect the most recent improvements in mammographic technology. However, with so much of the data equivocal and a focus of controversy, progress will most likely come from new technologies for detecting breast cancer. Meanwhile, women are left with a decision-making challenge regarding their own choices concerning mammography, given the major uncertainties in the available evidence.

Screening for Cervical Cancer

Perhaps no screening test for cancer has historically been used more widely than the Pap smear. One would therefore assume that there has been overwhelming evidence of its effectiveness in reducing mortality from invasive cervical cancer. Unfortunately, there has never been a properly designed randomized, controlled trial of cervical cancer screening; there probably never will be, because cervical cancer screening has been accepted as effective for the early detection of cervical cancer both by health authorities and by women.

In the absence of randomized trials, several alternative approaches have been used. Perhaps the most frequent evaluation design has been to compare incidence and mortality rates in populations with different rates of screening. A second approach has been to examine changes over time in rates of diagnosis of carcinoma in situ. A third approach has been that of case-control studies in which women with invasive cervical cancer are compared with control women and the frequency of past Pap smears is examined in both groups. All of these studies are generally affected by the methodologic problems raised previously in this chapter. Given the recognition that HPV is in the causal chain to cervical cancer, prevention currently recommends HPV testing along with Pap testing. The ACS recommends starting screening at age 21 with annual Pap tests (either conventional cytology or liquid based), with HPV testing started at age 30 or the use of high-risk HPV screening alone. 20 However, even for high risk HPV types, infection may revert after HPV screening alone, resulting in a large number of false-positives. Accordingly, the US Preventive Services Task Force recommends screening in women aged 21 to 65 years with cytology (Pap test) every 3 years or, for women aged 30 to 65 years who want a less frequent screening, cytology combined with HPV every 5 years. 4

Despite these reservations, the evidence indicates that many carcinomas in situ probably do progress to invasive cancer; consequently, early detection of cervical cancer in the in situ stage would result in a significant saving of life, even if it is lower than many optimistic estimates. Much of the uncertainty we face regarding screening for cervical cancer stems from the fact that no well-designed randomized trial was initially carried out before it became part of routine medical practice. This observation points out that in the United States, a set of standards must be met before new pharmacologic agents are licensed for human use, but another, less stringent, set of standards is used for new technology or new health programs. No drug would be licensed in the United States without evaluation through randomized, controlled trials, but unfortunately no such evaluation is required before screening or other types of programs and procedures are introduced. Of course, if universal prevention of HPV through vaccination of presexually active adolescents was applied, cervical cancer would end!

Screening for Neuroblastoma

Some of the issues just discussed are encountered in screening for neuroblastoma, which is a tumor that occurs in young children. The rationale for screening for neuroblastoma was outlined by Tuchman and colleagues 21 : (1) Outcome has improved little in the past several decades. (2) Prognosis is known to be better in children who manifest the disease before the age of 1 year. (3) At any age, children in advanced stages of disease have worse prognoses than those in early stages. (4) More than 90% of children presenting with clinical symptoms of neuroblastoma excrete higher than normal amounts of catecholamines in their urine. (5) These metabolites can easily be measured in urine samples obtained from diapers.

These facts constitute a strong rationale for neuroblastoma screening. Fig. 18.21 shows data from Japan, where a major effort at neuroblastoma screening was mounted. The percentages of children younger than 1 year in whom neuroblastoma was detected were compared before and after initiation of screening in Sapporo, a city in Hokkaido, and these data were compared with birth data from the rest of Hokkaido, where no screening program was mounted. After initiation of screening, a greater percentage of cases of neuroblastoma in children younger than 1 year was detected in Sapporo than in the rest of Hokkaido.

FIG. 18.21 Percentage of neuroblastoma cases younger than 1 year in Sapporo and Hokkaido, Japan, before and after screening. (Modified from Goodman SN. Neuroblastoma screening data: an epidemiologic analysis. Am J Dis Child. 1991;145:1415–1422; Based on data from Nishi M, Miyake H, Takeda T, et?al. Effects of the mass screening of neuroblastoma in Sapporo City. Cancer. 1987;60:433–436. Copyright © 1987 American Cancer Society. Reprinted by permission of Wiley-Liss, Inc., a subsidiary of John Wiley & Sons, Inc.)

However, a number of serious problems arise in assessing the benefits of neuroblastoma screening. It is now clear that neuroblastoma is a biologically heterogeneous disease, and there is clearly a better prognosis from the start in some cases than in others. Many tumors have a good prognosis because they regress spontaneously, even without treatment. Furthermore, screening is most likely to detect slow-growing, less malignant tumors and is less likely to detect aggressive, fast-growing tumors.

Thus it is difficult to show that screening for neuroblastomas is, in fact, beneficial. In fact, two large studies of neuroblastoma screening appeared in 2002. Woods and colleagues 22 studied 476,654 children in Quebec, Canada. Screening was offered to all the children at ages 3 weeks and 6 months. Mortality from neuroblastoma up to 8 years of age among children screened in Quebec was no lower than among four unscreened cohorts (Table 18.1) and no lower than in the rest of Canada, excluding Quebec, and in two historical cohorts (Table 18.2). Schilling and colleagues 23 studied 2,581,188 children in Germany who were offered screening at 1 year of age. They found that neuroblastoma screening did not reduce the incidence of disseminated disease and did not appear to reduce mortality from the disease, although mortality follow-up was not yet complete. Thus the data currently available do not support screening for neuroblastoma. The findings in these studies demonstrate the importance of understanding the biology and natural history of the disease and the need to obtain relevant and rigorous evidence regarding the potential benefits or lack of benefits when screening for any disease is being considered. The ability to detect a disease by screening cannot be equated with a demonstration of benefit to those screened.

TABLE 18.1

Rate of Death From Neuroblastoma by 8 Years of Age in the Screened Quebec Cohort, as Compared With the Rates in Four Unscreened Cohorts

Control Cohort No. of Deaths Expected in Quebec Based on the Control Cohort Standardized Mortality Ratio for Quebec (95% CI)

Ontario 19.8 1.11 (0.64–1.92)

Minnesota 24.4 0.90 (0.48–1.70)

Florida 15.7 1.40 (0.81–2.41)

Greater Delaware Valley 22.8 0.96 (0.56–1.66)

There were 22 deaths due to neuroblastoma in the screened Quebec cohort.

CI, Confidence interval.

From Woods WG, Gao R, Shuster JJ, et?al. Screening of infants and mortality due to neuroblastoma. N Engl J Med. 2002;346:1041–1046.

TABLE 18.2

Rate of Death From Neuroblastoma by 8 Years of Age in the Screened Quebec Cohort, as Compared With the Rates in Unscreened Canadian Cohorts

Control Cohort No. of Deaths Expected in Quebec Based on the Control Cohort Standardized Mortality Ratio for Quebec (95% CI)

Historical Cohorts

Quebec 22.5 0.98 (0.54–1.77)

Canada 21.2 1.04 (0.64–1.69)

Concurrent Cohort

Canada, excluding Quebec 15.8 1.39 (0.85–2.30)

There were 22 deaths from neuroblastoma in the screened cohort. All data were collected by Statistics Canada.

CI, Confidence interval.

From Woods WG, Gao R, Shuster JJ, et?al. Screening of infants and mortality due to neuroblastoma. N Engl J Med. 2002;346:1041–1046.

Problems in Assessing the Sensitivity and Specificity of Screening Tests

New screening programs are often initiated after a screening test first becomes available. When such a test is developed, claims are often made—by manufacturers of test kits, investigators, or others—that the test has high sensitivity and a high specificity. However, as we shall see, from a practical standpoint, this may often be difficult to demonstrate.

Fig. 18.22A shows a 2 × 2 table, as we have seen in earlier chapters, tabulating reality (disease present or absent) against test results (positive or negative).

FIG. 18.22 (A) Problem of establishing sensitivity and specificity because of limited follow-up of those with negative test results. (B) Problem of establishing sensitivity and specificity because of limited follow-up of those with negative test results for human immunodeficiency virus (HIV) using the enzyme-linked immunosorbent assay (ELISA) test.

To calculate sensitivity and specificity, data are needed in all four cells. However, often only those with positive test results (a + b) (seen in the upper row of the figure) are sent for further testing. Data for those who test negative (c + d) are frequently not available, because these patients do not receive further testing. For example, as shown in Fig. 18.22B, the Western blot test serves as a gold standard for detecting human immunodeficiency virus (HIV) infection, and those with positive enzyme-linked immunosorbent assay (ELISA) results are sent for Western blot testing.

However, because those with negative ELISA results are generally not tested further, the data needed in the lower cells for calculating sensitivity and specificity of the ELISA are often not available from routine testing. To obtain such data, it is essential that some negative ELISA specimens also be sent for further testing, together with the ELISA-positive specimens.

Interpreting Study Results That Show No Benefit of Screening

In this chapter, we have stressed the interpretation of results that show a difference between screened and unscreened groups. However, if we are unable to demonstrate a benefit from early detection of disease, any of the following interpretations may be possible:

The apparent lack of benefit may be inherent in the natural history of the disease (e.g., the disease has no detectable preclinical phase or an extremely short detectable preclinical phase).
The therapeutic intervention currently available may not be any more effective when it is provided earlier than when it is provided at the time of usual diagnosis.
The natural history and currently available therapies may have the potential for enhanced benefit, but inadequacies of the care provided to those who screen positive may account for the observed lack of benefit (i.e., there is efficacy but poor effectiveness).

Cost-Benefit Analysis of Screening

Some people respond to cost-benefit issues by concentrating only on cost, asking, if the test is inexpensive, why not perform it? However, although the test for blood in the stool, for example, in screening for colon cancer, costs only a few dollars for the filter paper kit and the necessary laboratory processing, to calculate the total cost of such a test we must include the cost of the colonoscopies that are done after the initial testing for those who are detected as “positive,” as well as the cost of the complications that infrequently result from colonoscopy.

The balance of cost effectiveness includes not only financial costs but also nonfinancial costs to the patient, including anxiety, emotional distress, and inconvenience. Is the test itself invasive? Even if it is not, if the test result is positive, is invasive therapy warranted by the test result? What is the false-positive rate in such tests? In what proportion of persons will invasive tests be carried out or anxiety be generated despite the reality that the individuals do not have the disease in question? Thus the “cost” of a test is not only the cost of the test procedure but also the cost of the entire follow-up process that is set in motion by a positive result, even if it turns out to be a false-positive result. These considerations are reflected in the four major concerns voiced by the ACS in revising its guidelines for cancer screening (Box 18.3). 24

Box 18.3

Criteria Used by the American Cancer Society for Recommendations on Cancer-Related Checkups

There must be good evidence that each test or procedure recommended is medically effective in reducing morbidity or mortality.
The medical benefits must outweigh the risks.
The cost of each test or procedure must be reasonable compared with its expected benefits.
The recommended actions must be practical and feasible.

Another view of cost-benefit was presented by Elmore and Choe. 25 In discussing screening mammography for women aged 40 to 49, they wrote:

Here’s one way to explain the evidence (with the caveat that numbers are rounded and simplified): For every 10,000 women who receive regular screening mammography starting at age 40 years, 6 of them might benefit through a decreased risk for death due to breast cancer. Yet even this modest benefit requires multiple screening examinations and follow-up for all 10,000 women for more than a decade. Stated another way, 9,994 women receive no mortality benefit at all, because most women will not develop breast cancer and some women will have cancer detected when it is too late for a cure. 25

Conclusion

This chapter has reviewed some of the major sources of bias that must be taken into account in assessing study findings that compare screened and unscreened populations. The biases of selection for screening and prognostic selection can be addressed, in large part, by using a randomized, controlled trial as the study design. Reasonable estimates of the lead time can be made if appropriate information is available. Few of the methods that are currently used to detect disease early have been subjected to evaluation by randomized trials, and most are probably not destined to be studied in this way. This is a result of several factors, including the difficulty and expense associated with conducting such studies and the ethical issues inherent in randomizing a population to receive or not receive modalities of care that are widely used and considered effective, even in the absence of strong supporting evidence. Consequently, we are obliged to maximize our use of evidence from nonrandomized approaches, and to do so, the potential biases and problems addressed in this chapter must be considered.

In approaching programs for early disease detection, we need to be able to identify groups who are at high risk. This would include not only those at risk for developing the disease in question but also those who are “at risk” for benefiting from the intervention. These are the groups for whom cost-benefit calculations will favor benefit. We must keep in mind that, even if a screening test, such as a Pap smear, is not in itself overly invasive, the intervention mandated by a positive screening test result may be highly invasive.

The overriding issue is how to make decisions when our data are inconclusive, inconsistent, or incomplete. We face this dilemma regularly, both in clinical practice and in the development of public health policy. These decisions must first consider the existing body of relevant scientific evidence. However, in the final analysis, the decision whether or not to screen a population for a disease is a value judgment that should take into account the incidence and severity of the disease, the feasibility of detecting the disease early, the likelihood of intervening effectively in those with positive screening results, and the overall cost-benefit calculation for an early detection program. Assignment: Using Epidemiology to Evaluate Health Services

To improve our ability to make appropriate decisions, additional knowledge is needed regarding the natural history of disease and, specifically, regarding the definition of characteristics of individuals who are at risk for a poor outcome. Before new screening programs are introduced, we should argue strongly for well-conducted randomized, controlled trials, so that we will not be operating in an atmosphere of uncertainty at the time in the future when such trials have become virtually impossible to conduct. Nevertheless, given the fact that most medical and public health practices—including early detection of disease—have not been subjected to randomized trials and that decisions regarding early detection must be made on the basis of incomplete and equivocal data, it is essential that we as health professionals appreciate and understand the methodologic issues involved so that we can make the wisest use of the available knowledge on behalf of our patients. Even the best of intentions and passionate evangelism cannot substitute for rigorous evidence that supports or does not support the benefit of screening.

References

1 Whittier JG. The Panorama, and Other Poems. from Maud Muller; Boston, Ticknor and Fields 1856.

2 Harte B. “Mrs. Judge Jenkins: Sequel to Maud Muller” East and West Poems. from; Boston, James R. Osgood and Company 1871.

3 Hutchison GB. Evaluation of preventive services. J Chronic Dis. 1960;11:497–508.

4 Moyer VA. Screening for lung cancer: U.S. Preventive Services Task Force Recommendation Statement. on behalf of the; U.S. Preventive Services Task Force Ann Intern Med. 2014;160:330–338.

5 Szklo M, Nieto FJ. Epidemiology: Beyond the Basics. 3rd ed Burlington, MA: Jones & Bartlett; 2014.141–145.

6 Shapiro S, Venet W, Strax P, et al., eds. Periodic Screening for Breast Cancer: The Health Insurance Plan Project and Its Sequelae, 1963–1986. Baltimore: Johns Hopkins University Press; 1988.

7 Shapiro S, Venet W, Strax P, et al. Prospects for eliminating racial differences in breast cancer survival rates. Am J Public Health. 1982;72:1142–1145.

8 Breast Cancer Screening for Women Ages 40–49. NIH Consensus Statement Online, 1997 January 21–23, cited 15:1–35, 2007.

9 Wegwarth O, Schwartz LM, Woloshin S, et al. Do physicians understand cancer screening statistics? A national survey of primary care physicians in the United States. Ann Intern Med. 2012;156:340–349.

10 Olsen O, Gøtzsche C. Cochrane review on screening for breast cancer with mammography. Lancet. 2001;358:1340–1342.

11 Horton R. Screening mammography: an overview revisited. Lancet. 2001;358:1284–1285.

12 Freedman DA, Petitti DB, Robins JM. On the efficacy of screening for breast cancer. Int J Epidemiol. 2004;33:43–55.

13 U.S. Preventive Services Task Force. Breast cancer screening: a summary of the evidence for the U.S. Preventive Services Task Force. Ann Intern Med. 2002;137:347–360.

14 U.S. Preventive Services Task Force. Breast cancer: screening. July http://www.uspreventiveservicestaskforce.org/uspstf09/breastcancer/brcanrs.htm . 2010.

15 Siu AL. Screening for breast cancer: US Preventive Services Task Force recommendation statement. on behalf of the; U.S. Preventive Services Task Force Ann Intern Med. 2016;164:279–296.

16 Brewer NT, Salz T, Lillie SE. Systematic review: the long-term effects of false-positive mammograms. Ann Intern Med. 2007;146:502–510.

17 Qaseem A, Snow V, Sherif K, et al. Screening mammography for women 40 to 49 years of age: a clinical practice guideline from the American College of Physicians. Ann Intern Med. 2007;146:511–515.

18 Warner E. Breast-cancer screening. N Engl J Med. 2011;365:1025–1032.

19 Oeffinger KC, Fontham ETH, Etzioni R, et al. Breast cancer screening for women at average risk. 2015 Guideline update from the American Cancer Society. JAMA. 2015;314(15):1599–1614.

20 American Cancer Society. The American Cancer Society Guidelines for the Prevention and Early Detection of Cervical Cancer. https://www.cancer.org/cancer/cervical-cancer/prevention-and-early-detection/cervical-cancer-screening-guidelines.html .

21 Tuchman M, Lemieux B, Woods WG. Screening for neuroblastoma in infants: investigate or implement?. Pediatrics. 1990;86:791–793.

22 Woods WG, Gao R, Shuster JJ, et al. Screening of infants and mortality due to neuroblastoma. N Engl J Med. 2002;346:1041–1046.

23 Schilling FH, Spix C, Berthold F, et al. Neuroblastoma screening at one year of age. N Engl J Med. 2002;346:1047–1053.

24 Smith RA, Mettlin CJ, David KJ, et al. American Cancer Society guidelines for the early detection of cancer. CA Cancer J Clin. 2000;50:34–49.

25 Elmore JG, Choe JH. Breast cancer screening for women in their 40s: moving from controversy about data to helping individual women. Ann Intern Med. 2007;146:529–531.

Review Questions for Chapter 18

Questions 1 through 4 are based on the following information:

A new screening program was instituted in a certain country. The program used a screening test that is effective in detecting cancer Z at an early stage. Assume that there is no effective treatment for this type of cancer and therefore that the program results in no change in the usual course of the disease. Assume also that the rates noted are calculated from all known cases of cancer Z and that there were no changes in the quality of death certification of this disease.

1 What will happen to the apparent incidence rate of cancer Z in the country during the first year of this program?

Incidence rate will increase
Incidence rate will decrease
Incidence rate will remain constant

2 What will happen to the apparent prevalence rate of cancer Z in the country during the first year of this program?

Prevalence rate will increase
Prevalence rate will decrease
Prevalence rate will remain constant

3 What will happen to the apparent case-fatality for cancer Z in the country during the first year of this program?

Case-fatality will increase
Case-fatality will decrease
Case-fatality will remain constant

4 What will happen to the apparent mortality rate from cancer Z in the country as a result of the program?

Mortality rate will increase
Mortality rate will decrease
Mortality rate will remain constant

5 The best index (indices) for concluding that an early detection program for breast cancer truly improves the natural history of disease, 15 years after its initiation, would be:

A smaller proportionate mortality for breast cancer 15 years after initiation of the early detection program compared to the proportionate mortality prior to its initiation
Improved long-term survival rates for breast cancer patients (adjusted for lead time)
A decrease in incidence of breast cancer
A decrease in the prevalence of breast cancer
None of the above

6 In general, screening should be undertaken for diseases with the following feature(s):

Diseases with a low prevalence in identifiable subgroups of the population
Diseases for which case-fatality is low
Diseases with a natural history that can be altered by medical intervention
Diseases that are readily diagnosed and for which treatment efficacy has been shown to be equivocal in evidence from a number of clinical trials
None of the above

Question 7 is based on the information given below:

The diagram below shows the natural history of disease X:

7 Assume that early detection of disease X through screening improves prognosis. For a screening program to be most effective, at which point in the natural history in the diagram must the critical point be?

Between A and B
Between B and C
Between C and D
Anywhere between A and C
Anywhere between A and D

8 Which of the following is not a possible outcome measure that could be used as an indicator of the benefit of screening programs aimed at early detection of disease?

Reduction of case-fatality in screened individuals
Reduction of mortality in the population screened
Reduction of incidence in the population screened
Reduction of complications
Improvement in the quality of life in screened individuals

CHAPTER 19

Epidemiology and Public Policy

Keywords

Susceptibility; Risk assessment; Exposure assessment; Number need to undergo the intervention to prevent one case/death; Systematic reviews and meta-analysis; Publication bias; Uncertainty

All scientific work is incomplete—whether it be observational or experimental.

All scientific work is liable to be upset or modified by advancing knowledge.

LEARNING OBJECTIVES

To review the role of epidemiology in disease prevention and to contrast two possible strategies for prevention: targeting groups at high risk for disease as compared with focusing on the general population.
To define risk assessment and discuss the role of epidemiology in risk assessment, including measurement of possible exposures.
To discuss how epidemiology can be used to shape public policy through the courts in the United States.
To introduce the systematic review and meta-analysis as tools to summarize all the available epidemiologic evidence to influence public policy and to discuss how publication bias may affect the results of both systematic reviews and meta-analyses.
To identify some possible sources of uncertainty in using the results of epidemiologic studies as a basis for making public policy.

A major role of epidemiology is to serve as a basis for developing policies that affect human health, including primary and secondary prevention and control of disease. As seen in previous chapters, the findings from epidemiologic studies may be relevant to issues in both clinical practice and community health and to population approaches to disease prevention and health promotion. As discussed in Chapter 1, the practicaapplications of epidemiology are often viewed as being so integral to the discipline that they are incorporated into the very definition of epidemiology. Historically, epidemiologic investigations were initiated to address emerging challenges relating to human disease (most often communicable diseases) and the health of the public. Indeed, one of the major sources of excitement in epidemiology is the direct applicability of its findings to alleviate problems of human health. This chapter presents an overview of some issues and problems relating to epidemiology in its application in formulating and evaluating public policy.

Epidemiology and Prevention

The importance of epidemiology in prevention has been emphasized in several of the preceding chapters. Identifying populations at increased risk, ascertaining the cause(s) of their increased risk(s), and analyzing the costs and benefits of eliminating or reducing exposure to the causal factor or factors all require an understanding of basic epidemiologic concepts and of the possible interpretation of the findings of epidemiologic studies. In addition, assessing the strength of all available evidence and identifying any limits on the inferences derived or on the generalizability of the findings are critically important. Thus epidemiology is often considered to be the “basic science” of prevention.

How much epidemiologic data do we need to justify a prevention effort? Clearly there is no simple answer to this question. Some of the issues involved differ depending on whether primary or secondary prevention is being considered. If we are discussing primary prevention, the answer depends on the severity of the condition, the costs involved (in terms of dollars, human suffering, and loss of quality of life), the strength of the evidence implicating a certain causal factor or factors in the etiology of the disease in question, and the difficulty of reducing or eliminating exposure to that factor.

With secondary prevention, the issues are somewhat different. We must still consider the severity of the disease in question. In addition, however, we must ask whether we can detect the disease earlier than usual by screening and how invasive and expensive such detection would be. Additional considerations include whether a benefit accrues to a person who has the disease if treatment is initiated at an earlier-than-usual stage and whether there are harmful effects associated with screening. Epidemiology offers valuable approaches to resolve many of these issues.

In recent years considerable attention has been addressed to expanding what has been called the traditional risk-factor model of epidemiology, in which we explore the relationship of an independent factor (exposure) to a dependent factor (disease outcome) (Fig. 19.1). It has been suggested that this approach should be expanded in two ways: First, it should include measurement not only of the adverse outcome—the disease itself—but also of the economic, social, and psychological impacts resulting from the disease outcome on the individual, his or her family, and the wider community. Second, it is clear that exposure to a putative causal agent is generally not distributed uniformly in a population. The factors that determine whether a person becomes exposed must therefore be explored if prevention is to be successful in reducing the exposure (Fig. 19.2). The full model is even more complex, as seen in Fig. 19.3: The relationship is influenced by determinants of susceptibility of the individual to the exposure; these include genetic factors together with environmental influences and social determinants. Although such an expanded approach is intuitively attractive and provides an excellent framework in which to analyze public health problems, we still have to demonstrate whether certain exposures or other independent variables are associated with increased risks of specific diseases.

FIG. 19.1 Diagram of classic risk-factor epidemiology.

FIG. 19.2 Diagram of an expanded risk-factor epidemiology model to include determinants of exposure as well as social, psychological, family, economic, and community effects of the disease.

FIG. 19.3 Diagram of expanded risk-factor epidemiology model to include interrelationships of factors that determine susceptibility or vulnerability.

In any case, deciding how much data and what types of data we need for prevention will be societally driven, reflecting society’s values and priorities. Epidemiology, together with other disciplines, can provide much of the necessary scientific data that are relevant to addressing questions of risks and prevention. However, the final decision on initiating or sustaining a prevention program will be largely determined by economic and political considerations as well as societal values. At the same time, it is hoped that such decisions will also be based on a firm foundation of scientific evidence provided by epidemiology and other relevant disciplines.

It is important to distinguish between macroenvironmental and microenvironmental exposures. Macroenvironmental exposures are exposures to things such as air pollution, which affect populations or entire communities. Microenvironmental exposures are environmental factors that affect a specific individual, such as diet (and the availability of healthy foods), smoking (by the individual or exposure to secondhand smoke), and alcohol consumption (personally and the availability of alcohol in the community). From the prevention standpoint, macroenvironmental factors are in many ways easier to control and modify, as this can be accomplished by legislation and regulation (e.g., setting environmental standards for pollutants). In contrast, modification of microenvironmental factors depends on modifying individual habits and lifestyle and addressing the availability of healthy food, green space, and safe neighborhoods, which can often be a much greater challenge.

In dealing with microenvironmental factors, providing scientific evidence and risk estimates is frequently not enough to induce individuals to modify their lifestyles (e.g., stopping smoking). Individuals often differ in the extent to which they are willing to take risks in many aspects of their lives, including health. In addition, the behaviors of individuals may differ depending on whether they are confronted with the risk of an adverse outcome or the probability of a positive event (Fig. 19.4). In addition, individuals often place the blame elsewhere for health problems brought on by their own lifestyles. Thus risk communication, mentioned previously, must extend beyond communicating risk data to policy makers. It must also deal with communicating with the public in an understandable fashion in the context of people’s perceptions of their risk, so that individuals will be motivated to accept responsibility and act on behalf of their own health to the greatest extent possible. Epidemiologists should therefore work with health educators to more appropriately educate the public about personal risk issues.

Population Approaches Versus High-Risk Approaches to Prevention

An important question in prevention is whether our approach should target groups that are known to be at high risk or whether it should extend primary prevention efforts to the general population as a whole. This issue was first brought up by Rose in 1985 3 and later amplified by Whelton in 1994 4 in a discussion of the prevention of hypertension as well as deaths from coronary heart disease (CHD).

Epidemiologic studies have demonstrated that the risk of death from CHD steadily increases with increases in both systolic and diastolic blood pressure; there is no known threshold. Fig. 19.5A and B shows the distribution of systolic blood pressures in the general population of men and women who are above 18 years of age in the United States (2001–2008), respectively.

FIG. 19.5 (A) Mean systolic blood pressure for men aged 18 years and over, by age and hypertension status.(B) Mean systolic blood pressure for women aged 18 years and over, by age and hypertension status. (C) Adjusted hazard ratio of first occurrence of all-cause death, nonfatal myocardial infarction, or nonfatal stroke as a function of age (in 10-year increments), systolic blood pressure (SBP). Reference systolic blood pressure for hazard ratio: 140?mm?Hg, respectively. Blood pressures (BP) are the on-treatment average of all postbaseline recordings. The quadratic terms for systolic blood pressures were statistically significant in all age groups (all P < .001). The adjustment was based on sex, race, history of myocardial infarction, heart failure, peripheral vascular disease, diabetes, stroke/transient ischemic attack, renal insufficiency, and smoking. DBP, Diastolic blood pressure. (A and B, From Wright JD, Hughes JP, Ostchega Y, et al. Mean systolic and diastolic blood pressure in adults aged 18 and over in the United States, 2001–2008. Natl Health Stat Report. 2011;(35):1–22, 24. C, Modified from Denardo SJ, Gong Y, Nichols WW, et al. Blood pressure and outcomes in very old hypertensive coronary artery disease patients: an INVEST substudy. Am J Med. 2010;123(8):719–726.)

Looking at the US general population above 50 years of age, Fig. 19.5C shows the risk of a composite end point of first occurrence of all-cause death, nonfatal myocardial infarction, or nonfatal stroke in relation to systolic blood pressure; the risk increases steadily with higher levels of systolic blood pressure. Individuals below 60 years of age with systolic blood pressures of 160?mm?Hg had more than 1.50 times the risk of the composite CHD end point than those whose systolic blood pressure was below 140?mm?Hg.

Based on the Joint National Committee on the Prevention, Detection, Evaluation and Treatment of High Blood Pressure (JNC 7), values as low as those defining prehypertension (systolic and diastolic blood pressures ranging from 120 to 139?mm?Hg and 80 to 99?mm?Hg, respectively) may result in a 20% excess risk of strokes. 5

It therefore seems reasonable to combine a high risk within a population approach: one set of preventive measures addressed to those at particularly high risk and another designed for the primary prevention of hypertension and addressed to the general population.

Such analyses can have significant implications for prevention programs. The types of preventive measures that might be used for high-risk individuals often differ from those that are applicable to the general population. Those who are at high risk and are aware that they are at high risk are more likely to tolerate more expensive, uncomfortable, and even more invasive procedures. However, in applying a preventive measure to a general population, the measure must have a low cost and be only minimally invasive; it needs to be associated with relatively little pain or discomfort if it is to be acceptable to the general population.

Fig. 19.6 shows the goal of a population-based strategy, which is a downward shifting of the entire curve of blood pressure distribution when a blood pressure–lowering intervention is applied to an entire community, such as reduction of the salt content of processed foods. Because the blood pressure of most members of the population is above the very lowest levels that are considered optimal, even a small downward shift (shift to the left) in the curve is likely to have major public health benefits, as Rose suggested some three decades ago. 3 In fact, such a shift would prevent more strokes in the population than would successful treatment limited to “high-risk” individuals. Furthermore, Rose 3 pointed out that the high-risk strategy is essential to protecting susceptible individuals. Ultimately, however, our hope is to understand the basic cause of the incidence of a disease—in this case, elevated blood pressure—and to develop and implement the necessary means for its (primary) prevention. Rose concluded as follows:

FIG. 19.6 Representation of the effects of a population-based intervention strategy on the distribution of blood pressure. (From National Institutes of Health. Working Group Report on Primary Prevention of Hypertension. NIH Publication No. 93–2669. Washington, DC: National Heart, Lung, and Blood Institute; 1993:8.)

Realistically, many diseases will long continue to call for both approaches, and fortunately competition between them is usually unnecessary. Nevertheless, the priority of concern should always be the discovery and control of the causes of incidence. 3

Epidemiology and Clinical Medicine: Hormone Replacement Therapy in Postmenopausal Women

Epidemiology can also be considered a basic science of clinical investigation. Data obtained from epidemiologic studies are essential in clinical decision making in many situations. An understanding of epidemiology is crucial to the process of designing meaningful studies of the natural history of disease, the quality of different diagnostic methods, and the effectiveness of clinical interventions. Epidemiology is highly relevant to addressing the many uncertainties and dilemmas in clinical policy, not all of which can easily be resolved.

A dramatic example is the use of hormone replacement therapy (HRT) by postmenopausal women. In 1966 Robert Wilson, a physician, published a book titled Feminine Forever, which advocated HRT for postmenopausal women. After the publication of this book, millions of postmenopausal women began taking estrogens in the hope of retaining their youth and attractiveness and avoiding the unpleasant, often encountered symptoms of menopause, such as hot flashes, night sweats, and vaginal dryness. The medical community largely accepted Wilson’s recommendation for estrogen replacement, and even gynecology textbooks supported it. However, in the 1970s, an increased risk of uterine cancer was reported in women taking estrogen replacement. As a result, estrogen was subsequently combined with progestin, which counteracts the effect of estrogen on the uterine endometrial lining. This combination leads to monthly uterine bleeding that resembles a normal menstrual period.

A number of nonrandomized observational studies subsequently appeared and reported other health benefits, such as fewer heart attacks and strokes, less osteoporosis, and fewer hip fractures associated with HRT. Considering the entire body of evidence that had accumulated, support for the conclusion that estrogen protected women against heart disease appeared strong and generally consistent. Women were advised that when they reached 50 years of age, they should discuss with their physicians whether they should begin HRT to protect themselves against heart disease and other conditions associated with aging.

Recognizing that there was little supporting evidence from randomized trials using hard disease end points, such as risk of myocardial infarction, two randomized trials were initiated: the Heart and Estrogen/Progestin Replacement Study (HERS) and the Women’s Health Initiative (WHI). The HERS study 6 included 2,763 women with known CHD. It found that, in contrast to accepted beliefs, combination HRT increased women’s risk of myocardial infarction during the initial years after starting therapy. The study failed to find evidence that HRT offered protection during a follow-up period of almost 7 years (Fig. 19.7).

FIG. 19.7 Kaplan-Meier estimates of the cumulative incidence of coronary heart disease events (death and nonfatal myocardial infarctions). (From Grady D, Herrington D, Bittner V, et?al, for the HERS Research Group. Cardiovascular disease outcomes during 6.8 years of hormone therapy: heart and estrogen/progestin replacement study follow-up [HERS II]. JAMA. 2002;288:49–57.)

The WHI 7 was a randomized, placebo-controlled trial of 16,608 women designed in 1991 and 1992 to evaluate HRT for the primary prevention for heart disease and other conditions common in the elderly. The planned duration of the trial was 8.5 years. One component (study arm) of the trial was a randomized, placebo-controlled investigation of estrogen plus progestin in postmenopausal women who had an intact uterus. This component of the study was stopped 3 years early because, by that time, results had shown increased risks of heart attack, stroke, breast cancer, and blood clots (Fig. 19.8). Although the study showed a reduced incidence of osteoporosis, bone fractures, and colorectal cancer, overall the dangers from HRT outweighed the benefits.

FIG. 19.8 Disease rates for women assigned to estrogen plus progestin or to placebo in the Women’s Health Initiative (WHI) study. (WHI online. http://www.nhlbi.nih.gov/health/women/upd2002.htm. Accessed June 14, 2013.)

Only about 2.5% of the enrolled women had adverse events. On the basis of the study results, it has been estimated that, annually, for every 10,000 women taking estrogen plus progestin, we would expect 7 more women to have a heart attack (37 women taking estrogen plus progestin would have a heart attack compared with 30 women taking placebo), 8 more women to have a stroke, 8 more women to have breast cancer, and 18 more women to have blood clots. At the same time, we would expect 6 fewer cases of colorectal cancer and 5 fewer hip fractures.

Many women who had been taking HRT were shocked by the results of the WHI. The findings strongly indicated that, in women taking estrogen plus progestin for protection against heart disease, the risks of cardiovascular end points were actually increased. These women were left uncertain as to whether to continue with HRT or whether to seek alternatives. Many also believed that they had been misled by the medical community because, for many years, they had been reassured about the effectiveness and safety of HRT by their physicians, despite the absence of clear data from placebo-controlled randomized trials. Complicating the decision-making process for women at the time of menopause is that the WHI did not address the question faced by many women who often take combination HRT for brief periods to prevent and relieve postmenopausal symptoms such as hot flashes.

A major methodologic question is why there was such a discrepancy between the results of the placebo-controlled randomized WHI study regarding risk of heart disease and the results of a large number of nonrandomized, observational studies that previously supported a protective benefit from combination HRT. This issue is of great importance because, in many areas of medicine and public health, we depend on the findings of nonrandomized, observational studies because the costs of randomized trials may be prohibitive, and randomized studies may not be feasible for other reasons.

Several explanations have been offered. 8 – 10 In the observational studies, the women who were prescribed HRT were often healthier women who had a better cardiovascular risk profile. Women who use HRT are often better educated, leaner, more physically active, less likely to be smokers, more health-conscious, and of higher socioeconomic status than women who do not. Often, women who were prescribed HRT were judged to be compliant, and compliers often have other healthier patterns of behavior. Thus, confounding by lifestyle and other factors may have taken place in the observational studies. In addition, when adverse effects occurred early in the observational studies and led to the discontinuation of HRT, these events might not always have been identified in the periodic cross-sectional measurements used. An additional explanation related to cardiovascular risk is that the observational studies were conducted soon after menopause, when the beneficial effects of HRT—such as its favorable effects on lipids and endothelial function—are known to occur, whereas the WHI trial included much older women with extensive underlying atherosclerosis, among whom there is a predominance of the prothrombotic and inflammatory effects of HRT. 11

Clearly in the future it will be essential to address these issues when nonrandomized observational studies are used as the basis for clinical guidelines development and dissemination and setting new public health policies.

Risk Assessment

A major use of epidemiology in relation to public policy is for risk assessment. Risk assessment has been defined as the characterization of the potential adverse health effects of human exposures to environmental hazards. Risk assessment is thus viewed as part of an overall process that flows from research to risk assessment and then to risk management, as shown in Fig. 19.9. Samet and colleagues 12 reviewed the relationship of epidemiology to risk assessment and described risk management as involving the evaluation of alternative regulatory actions and the selection of the strategy to be applied. Risk management is followed by risk communication, which is the communication of the findings of risk assessment to those who need to know the findings in order to participate in policy making and to take appropriate risk-management actions, including communications to the public at large.

FIG. 19.9 Relationships among the four steps of risk assessment and between risk assessment and risk management. (Modified from Committee on the Institutional Means for Assessment of Risks to Public Health, Commission on Life Sciences, National Research Council. Risk Assessment in the Federal Government: Managing the Process. Washington, DC: National Academy Press; 1983:21.)

The National Research Council (1983) listed four steps in the process of risk assessment 13 :

Hazard identification: Determination of whether a particular chemical is causally linked to particular health effects
Dose-response assessment: Determination of the relationship between the magnitude of exposure and the probability of occurrence of the health effects in question
Exposure assessment: Determination of the extent of human exposure before or after the application of regulatory controls
Risk characterization: Description of the nature—and often the magnitude—of human risk, including attendant uncertainty

Clearly epidemiologic data are essential in each of these steps, although epidemiology is not the only relevant scientific discipline in the process of risk assessment. In particular, toxicology plays a major role as well, and an important challenge remains to reconcile epidemiologic and toxicologic data when findings from the respective disciplines disagree.

A number of important methodologic problems affect the use of epidemiology in risk assessment. Because epidemiologic studies may address the relationship between an environmental exposure and the risk of a disease, rigorous assessment of each variable is critical. Perhaps the most significant problem is the assessment of exposures.

Assessment of Exposure

Data regarding exposure generally come from several types of sources (Box 19.1). Each type of source has advantages and disadvantages; the latter include lack of completeness and biases in reporting. Frequently investigators use several sources of information regarding exposure, but a problem often results when different sources yield conflicting information.

Box 19.1

Sources of Exposure Data

Interviews
Subject
Surrogate
Employment or other records
Physician records
Hospital records
Disease registry records (e.g., cancer registries)
Death certificates

Another problem in exposure assessment is that macroenvironmental factors generally affect many individuals simultaneously, so that individual exposures may be difficult to measure. As a result, ecologic approaches are often chosen, in which aggregate rather than individual measurements are used (described in Chapter 7), and the aggregation is often carried out over large areas and populations. The characteristics of the community are therefore ascribed to the individuals residing in that community, but the validity of characterizing an individual exposure by this process is often open to question (recall the “ecological fallacy”). Furthermore, personal exposure histories can be quite difficult to obtain either retrospectively or prospectively and may be subject to considerable measurement error. In addition, the long latent or induction period between exposure and development of disease makes it necessary to ascertain long past exposures, which is particularly difficult. Sometimes it is possible to evaluate exposure of macroenvironmental factors at the individual level, as was done in an ancillary study within the Multi-Ethnic Study of Atherosclerosis (MESA Air). In this study, household levels of air pollution were estimated by considering distance from a major roadway 14 and by the use of a special device for the home monitoring of air pollution levels. 15

A parallel set of problems is seen when we try to characterize the occupational exposures of an individual worker and to link an exposure at work to an adverse health outcome. First, because a worker is likely to be exposed to many different agents in an industrial setting, it is often difficult to isolate the independent risk that can be ascribed to a single specific exposure. Second, because there is often a long latent period between the exposure and the subsequent development of disease, studies of the exposure-disease relationship may be difficult; for example, unless a concurrent prospective study can be done (see Chapter 8), recall may be poor and records of exposure may have been lost. Third, increased disease risks may occur among those living near an industrial plant, so that it may be difficult to ascertain how much of a worker’s risk results from living near the plant and how much is due to an occupational exposure in the work setting itself.

Perhaps the most fundamental problem in measuring exposures in epidemiologic studies is that sources and measures are often indirect. For example, considerable interest has arisen in recent years over the possible health effects of electromagnetic fields (EMFs). This interest followed the article of Wertheimer and Leeper in 1979, 16 which reported increased levels of leukemia in children living near high-voltage transmission lines. Subsequently, many methodologic questions were raised, and the question of whether such fields are associated with adverse health effects remains unresolved. For example, conclusions were discrepant in an update of two meta-analyses and a more recent meta-analysis done by the same first author! 17 , 18

In studying EMFs, several approaches are used for measuring exposure, including the wiring configuration in the home, spot or 24-hour measurements of the fields, or self-reports of electrical appliance use. However, the results of different studies regarding risk of disease differ depending on the type of exposure measurement used. In fact, actual magnetic field measurements, even 24-hour measurements, generate weaker associations with childhood leukemia than do those for wire configuration codes. 19 This observation raises a question about any possible causal link between exposure to magnetic fields and the occurrence of disease.

Even the best indirect measure of exposure often leaves critical questions unanswered. First, exposure is generally not dichotomous; data are therefore needed regarding the dose of exposure to explore a possible dose-response relationship. Second, it is important to know whether the exposure was continuous or periodic. For example, in the pathogenesis of cancer, a periodic exposure with alternating exposure and nonexposure periods may allow for DNA repair during the nonexposure periods. In the case of a continuous exposure, no such repair can take place. Finally, information about latency is critical: How long is the latent period and what is its range? This knowledge is essential to focus efforts on ascertaining exposure during a particular time period in which a causal exposure might well have occurred.

Because of these problems in measuring exposure using indirect approaches, much interest has focused on the use of biologic markers of exposures. (Use of such biomarkers has often been termed molecular epidemiology.) 20 The advantage of using biomarkers is that they overcome some problems of limited recall or lack of awareness of an exposure. In addition, biomarkers can overcome errors resulting from variation in individual absorption or metabolism by focusing on a later step in the causal chain.

Biomarkers can be markers of exposure, markers of biologic changes resulting from exposures, or markers of risk or susceptibility. Fig. 19.10 schematically represents the different types of exposures we may choose to measure.

FIG. 19.10 What exposures are we trying to measure?

We might also wish to measure ambient levels of possibly toxic substances in a general environment, the levels to which a specific individual is exposed, the amount of substance absorbed, or the amount of substance or metabolite of the absorbed substance that reaches the target tissue. Biomarkers bring us closer to being able to measure an exposure at a specific stage in the process by which an exposure is linked to human disease. For example, we can measure not only environmental levels of a substance but also DNA adducts that reflect the effect of the substance on biologic processes in the body after absorption.

Nevertheless, despite these advantages, biomarkers generally give us a dichotomous answer—a person was either exposed or not exposed. Biomarkers generally do not shed light on several important questions, such as the following:

What was the total exposure dose?
What was the duration of exposure?
How long ago did the exposure occur?
Was the exposure continuous or periodic?

An example of some of these shortcomings is salivary cotinine, which is a biomarker of nicotine absorption in smokers. As it is a marker only for recent smoking, it does not provide information on duration of exposure or whether the habit was continuous or periodic.

The answers to these questions are crucial in properly interpreting the potential biologic importance of a given exposure. For example, in assessing the biologic plausibility of a causal inference being made from observations of exposure and outcome, we need relevant data that will permit us to determine whether the interval observed between the exposure and the development of the disease is (biologically) consistent with what we know from other studies about the incubation period of the disease.

In addition to these concerns, a potential limitation of the use of exposure biomarkers is that, in a traditional case-control study, collection of a biologic sample and measurement of a biomarker are done only after the onset of the disease. Thus it is impossible to find out whether the exposure was present prior to the onset of the disease of interest. This shortcoming, however, is not present in case-control studies within a cohort in which biologic samples, such as serum or urine, are frozen and stored at baseline—that is, before incident cases develop during follow-up of the cohort.

It should be pointed out that use of biomarkers is not new in epidemiology. In Ecclesiastes it is written: “There is nothing new under the sun.” 21 Even before the revolution in molecular biology, laboratory techniques were essential in many epidemiologic studies; these included bacterial isolates and cultures, phage typing of organisms, viral isolation, serologic studies, and assays of cholesterol lipoprotein fractions. With the tremendous advances made in molecular biology, a new variety of biomarkers has become available that is relevant to areas such as carcinogenesis. These biomarkers not only identify exposed individuals but also cast new light on the pathogenetic process of the disease in question.

Meta-Analysis

Several scientific questions arise when epidemiologic data are used for formulating public policy:

Can epidemiologic methods detect small increases in risk that are clinically meaningful?
How can we reconcile inconsistencies between animal and human data?
How can we use incomplete or equivocal epidemiologic data?
How can results be interpreted when the findings of epidemiologic studies disagree?

Many of the risks with which we are dealing may be quite small, but they may potentially be of great public health importance because of the large numbers of people exposed, with a resulting potential for adverse health effects in many people (recall the hypothesis proposed by Rose 3 ). However, an observed small increase in relative risk above 1.0 may easily result from bias or from other methodologic limitations, and such results must therefore be interpreted with great caution unless the results have been replicated and other supporting evidence has been obtained.

Given that the results of different epidemiologic studies may not be consistent and that at times they may be in dramatic conflict, attempts have been made to systematize the process of reviewing the epidemiologic literature on a given topic. One process, the systematic review, uses standardized methodology to select and assess peer-reviewed articles to synthesize the literature regarding a specific health topic. 22 Systematic reviews may be accompanied by a process called meta-analysis, which has been defined as “the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings.” 23 Meta-analysis allows for aggregating the results of a set of studies included in a systematic review, with appropriate weighting of each study for the number of subjects sampled and for other characteristics. It can help to give an overall perspective on an issue when the results of studies disagree.

However, a number of problems and questions are associated with meta-analysis. First, should the analysis include all available studies or only published studies? Second, when the relative risks or odds ratios from various studies differ (i.e., the results are not homogeneous), meta-analysis results may mask important differences among individual studies. It is therefore essential that a systematic review resulting in a meta-analysis include only studies that meet well-established design and quality criteria. Third, the results of meta-analyses themselves may not always be reproducible by other analysts. Finally, a systematic review with or without meta-analysis is subject to the problem of publication bias (discussed later in this chapter). Fig. 19.11 shows a hypothetical “forest plot” and the definition of its components. The forest plot is the type of presentation that is frequently used to show the results of individual studies as well as the results of the meta-analysis. Fig. 19.12 shows a forest plot on the relationship of socioeconomic status and depression. Note that of the 51 studies included in this meta-analysis, 5 suggest a negative association. Thus the results of this meta-analysis are not entirely homogeneous.

FIG. 19.11 Hypothetical forest plot, with components, labeled, showing the type of diagrammatic presentation frequently used to show results of individual studies (A–E) as well as the results of a meta-analysis.

FIG. 19.12 Odds ratios for major depression in the lowest socioeconomic status group in 51 prevalence studies published after 1979. Horizontal lines, 95% confidence interval. Squares show original estimates; diamonds show meta-analyzed results. (From Lorant V, Deliège D, Eaton W, et al. Socioeconomic inequalities in depression: a meta-analysis. Am J Epidemiol. 2003;157(2):98–112.)

Meta-analysis was originally usually applied to randomized trials, but this technique is being used increasingly to aggregate nonrandomized, observational studies, including case-control and cohort studies. In these instances, the studies do not necessarily share a common research design. Hence the question arises as to how similar such studies need to be in order to legitimately be included in a meta-analysis. In addition, appropriate control of biases (such as selection bias and misclassification bias) is essential but often proves to be a formidable challenge in meta-analyses. In view of the considerations just discussed, meta-analysis remains a subject of considerable controversy.

A final problem with meta-analysis is that in the face of all the difficulties discussed, putting a quantitative imprint on the estimation of a single relative risk or odds ratio from all the studies may lead to a false sense of certainty regarding the magnitude of the risk. People often tend to have an inordinate belief in the validity of findings when a number is attached to them; as a result, many of the difficulties that arise in meta-analysis may at times be ignored.

Publication Bias

Chapter 16 discussed the use of twin studies as a means of distinguishing the contributions of environmental and genetic factors to the root cause of disease. In that discussion it was mentioned that the degree of concordance and discordance in twins is an important observation for drawing conclusions about the role of genetic factors, but that estimates of concordance reported in the literature may be inflated by publication bias, which is the tendency for articles to be published that report concordance for rare diseases in twin pairs.

Publication bias is not limited to genetic studies of twins; it can occur in any area of epidemiology. It is a particularly important phenomenon in the publication of articles regarding environmental risks and on the results of clinical trials. Publication bias may occur because investigators do not submit the results of their studies when the findings do not support “positive” associations and increased risks (that is, “null findings”). In addition, journals may differentially select for publication studies that they believe to be of greatest reader interest, and they may not find studies that report no association to fall in this category. As a result, a literature review that is limited to published articles may preferentially identify studies that report increased risk. Clearly such a review is highly selective in nature and omits many studies that have obtained what have been called “negative” results (i.e., results showing no effect), which may not have reached publication.

Publication bias therefore has a clear effect on systematic review and meta-analysis. One approach to this problem is to try to identify unpublished studies and to include them in the analysis (pulling studies from the “gray” literature, often from conference presentations, often reporting null results, that do not lead to publication). However, the difficulty here is that, in general, unpublished studies are likely not to have passed journal peer review; therefore their suitability for inclusion in a meta-analysis may be questionable. Regardless of whether we are discussing a traditional type of literature review or a structured meta-analysis, the problem of potential publication bias must be considered.

It has been proposed that in order to prevent publication bias in systematic reviews (and thus, in meta-analyses), study registers, similar to the Cochrane collaboration, should be implemented. There are also strategies to evaluate publication bias in meta-analyses, including the Beggs funnel and tests of symmetry. These approaches are based on plotting the studies’ values of the measure of association (e.g., relative risk or odds ratio) against their precision levels (measured by their standard errors, which are usually a function of their sample sizes). Using the relative risk as an example, as the standard errors increase, thus denoting decreasing precision, the relative risks become more variable, but it is expected that they follow a symmetric distribution around the between-study mean relative risk. If the distribution is asymmetric, publication bias is likely.

Epidemiology in the Courts

As mentioned earlier, litigation has become a major path for policy making in the United States. Epidemiology is assuming ever-increasing importance in the legal arena. Particularly in the area of toxic torts, it provides one of the major types of scientific evidence that is relevant to the questions involved. Issues such as effects of dioxin, silicone breast implants, tobacco smoking, and EMFs are but a few examples.

However, the use of data from epidemiologic studies is not without its problems. Epidemiology answers questions about groups, whereas the court often requires information about individuals (where it is necessary to causally link individual exposure and their disease status). Furthermore, considerable attention has been directed to the court’s interpretation of evidence of causality. Whereas the legal criterion is often “more likely than not”—that is, that the substance or exposure in question is “more likely than not” to have caused a person’s disease—epidemiology relies to a great extent on the US Surgeon General’s guidelines for causal inferences. 24 It has been suggested that an attributable risk in the exposed greater than 50% might constitute evidence of “more likely than not.” 25

Until recently, evidence from epidemiology was only reluctantly accepted in the courts, but this has changed to a point where epidemiologic data are often cited as the only source of relevant evidence in toxic tort cases. For many years the guiding principle for using scientific evidence in the courts in the United States was the Frye test, which states that for a study to be admissible, “it must be sufficiently established to have gained general acceptance in the field in which it belongs.” 26 Although terms such as “general acceptance” and “field in which it belongs” were left undefined, it did lead to an assessment of whether the scientific opinion expressed by an expert witness was generally accepted by other professionals in the discipline.

In 1993, in Daubert v. Merrell Dow Pharmaceuticals, 27 a case in which the plaintiff alleged that a limb deformity at birth was due to ingestion of the drug Bendectin during pregnancy, the US Supreme Court articulated a major change in the rules of evidence. The court ruled that “general acceptance” is not a necessary condition for the admissibility of scientific evidence in court. Rather, the trial judge is now considered a “gatekeeper” and is assigned the task of ensuring that an expert’s testimony rests on a reliable foundation and is relevant to the “task at hand.” Thus the judge “must make a preliminary assessment of whether the testimony’s underlying reasoning or methodology is scientifically valid and can be properly applied to the facts at issue.” Among the considerations cited by the court are whether the theory or technique in question can be and has been tested and whether the methodology has been subjected to peer review and publication.

Given their new responsibilities, judges presiding at trials in which epidemiology is a major source of evidence must have a basic knowledge of epidemiologic concepts—including, for example, study design, biases and confounding, and causal inference—if they are to be able to rule in a sound fashion on whether the approach used by the experts follows accepted “scientific method.” Recognizing this need, the Federal Judicial Center has published the Research Manual on Scientific Evidence for judges, which includes a section on epidemiology. 28 Although it is premature to know the ultimate effect of the Daubert ruling, given the tremendous increase in the use of epidemiology in the courts, it will clearly require enhanced knowledge of epidemiology by many parties involved in legal proceedings that use evidence derived from epidemiologic studies.

Sources and Impact of Uncertainty

In 1983, the National Research Council in the United States wrote:

The dominant analytic difficulty [in conducting risk assessments for policy decision making] is pervasive uncertainty … data may be incomplete, and there is often great uncertainty in estimates of the types, probability, and magnitude of health effects associated with a chemical agent, of the economic effects of a proposed regulatory action, and of the extent of current and possible future human exposures. 29

This insight remains as relevant today as when it was originally written. Uncertainty is a reality that we must accept and that must be addressed. Uncertainty is an integral part of science. What we believe to be “truth” today often turns out to be transient. Tomorrow a study may appear that contradicts or invalidates the best scientific information available to us today.

Uncertainty is relevant not only to risk assessments but also to issues of treatment, to issues of prevention such as screening, and to health economics issues. Clearly it is a relevant concern in the legal setting discussed earlier (Fig. 19.13).

FIG. 19.13 One jury’s approach to uncertainty. (Arnie Levin/The New Yorker Collection/The Cartoon Bank.)

Some of the possible sources of uncertainty are listed in Box 19.2. As seen there, the sources of uncertainty may be in the design of the study or in the conduct and implementation of the study, or they may result from the presentation and interpretation of the study findings. Many of these sources are addressed in earlier chapters.

Box 19.2

Examples of Possible Sources of Uncertainty in Epidemiology

Uncertainty resulting from the design of the study
The study may not have been designed to provide a relevant answer to the question of interest
Biases that were not recognized or not adequately addressed

(1) Selection bias

(2) Information bias

Measurement errors, which may lead to misclassification
Inadequate sample size
Inappropriate choice of analytic methods
Failure to take into account potential confounders
Use of surrogate measures that may not correctly measure the outcomes that are the major dependent variables of interest
Problems of external validity (generalizability to the population of interest): the conclusions regarding potential interventions may not be generalizable to the target population
Uncertainty resulting from deficiencies in the conduct and implementation of the study
Observations may be biased if observers were not blinded
Poor quality of laboratory or survey methods
Large proportion of nonparticipants and/or nonrespondents
Failure to identify reasons for nonresponse and characteristics of nonrespondents
Uncertainty resulting from the presentation and interpretation of the study findings
How were the results expressed?
If the study assessed risk and possible etiology, were the factors involved described as risk factors or causal factors?
If the study assessed the effectiveness of a proposed preventive measure, was the benefit of the measure expressed as relative risk reduction or absolute risk reduction? Why was it chosen to be expressed as it was, and how was the finding interpreted?

One issue listed in Box 19.2 is whether, in a study of the effectiveness of a preventive measure, the results are described as a relative risk reduction or an absolute risk reduction. Often the percent reduction in mortality is selected because it gives a more optimistic view of the effectiveness of a preventive measure. If, however, absolute risk reduction is used, such as the number of individuals per 1,000 whose lives would be saved, the result appears less impressive (recall the disease risks associated with HRT presented earlier in this chapter). If the rate of adverse events, such as mortality from the disease that is observed without screening, is low, a percent reduction will always seem more impressive than an absolute risk reduction because the number of events that could potentially be prevented is small even if the percent reduction is higher. If, for example, the mortality in those screened is 2 per 100,000 and in those not screened is 1 per 100,000, the reduction resulting from screening is 50%, but the absolute difference is merely 1 per 100,000.

A more relevant measure of the effectiveness (and efficiency) of a preventive or curative measure is the number needed to undergo the intervention to prevent one case or one death from the disease. This measure is based on the absolute difference. For example, if the difference between a new preventive strategy and the current (standard of care) strategy is 20%, the number needed to have the intervention in order to prevent the occurrence of one incident case is ([100 × 1] ÷ 20) = 5. However, if the difference is only 2%, this number becomes ([100 × 1] ÷ 5) = 20. Note that the effectiveness is the same if the mortality rates are, for example, 60% and 40% or 6% and 4% in two studies evaluating different novel interventions to prevent the same disease: ([60% − 40%] ÷ 60%) = 33.3% in the first study, and ([6% − 4%] ÷ 6%) = 33.3% in the second study. It is, however, clear that the first study deals with a more important public health problem for which prevention would be more efficient, as one case can be prevented by subjecting fewer individuals to the new approach.

Another issue that contributes to uncertainty in policy making that is not generally related to specific epidemiologic studies is how we deal with anecdotal evidence, such as that provided by a person who states that she was screened for breast cancer 10 years earlier, received early treatment, and is alive and apparently well 10 years after the screening. There is often a tendency to accept such evidence as supporting the effectiveness of the screening in reducing mortality from the disease. However, anecdotal evidence has two major problems. First, it does not take into account slow-growing tumors that might have been detected by screening but might not have affected survival even if the patient had not been screened. Second, it does not take into account very fast-growing tumors that screening would have missed, so that the person would not have received early treatment. That is, for those giving anecdotal evidence of survival after screening, there is no comparison group of individuals who were screened but did not survive. As an unknown sage has said, “The plural of anecdote is not ‘data.’ ” Nevertheless, despite these major limitations, anecdotal evidence given by patients who have survived serious illnesses may have a strong emotional impact, which may significantly influence policy makers.

Ultimately the impact of scientific uncertainty on the formulation of public policy will depend on how the major stakeholders consider uncertainty. Among the different groups of stakeholders are scientists (including epidemiologists), policy makers, politicians, and the public (or the target populations). Each of these groups may have a different level of sophistication, a different level and type of self-interest, and may view data differently and be influenced to varying degrees by colleagues, friends, and various constituencies in society. Moreover, individuals have different personalities with different levels of risk tolerance and different ways of dealing with uncertainty. An important mediator is the set of values that every individual has relating to issues such as the value of a human life and the principles that should guide the allocation of limited resources in a society. The result is a complex interaction of uncertainty, resulting from characteristics of a study, interacting with a network of relationships relating to the elements just described. A schematic of some of the interrelationships influencing the effect of uncertainty on public policy is shown in Fig. 19.14. These factors are clearly major concerns in formulating appropriate public health and clinical policy. It is important that they be taken into account if a plan of action is to be successfully developed and implemented to address health issues in the population.

FIG. 19.14 Schematic presentation of some of the factors involved in the impact of uncertainty on the decision-making process for health policy.

Policy Issues Regarding Risk: What Should the Objectives Be?

Public policy is generally recognized to be largely developed through the processes of legislation and regulation. As discussed earlier, in the United States, litigation has also become an important instrument for developing and implementing public policy. Ideally, each of these processes should reflect societal values and aspirations.

Certain major societal issues must be considered in making decisions about risk. Among the questions that must be confronted are the following:

What percentage of the population should be protected by the policy?
What level of risk is society willing to tolerate?
What level of control of risk is society willing to pay for?
Who should make decisions about risk?

At first glance, it might seem appealing to protect the entire population from any amount of risk, but realistically this is difficult if not impossible to accomplish. Regardless of what we learn from risk data about populations, there are clearly rare individuals who are extraordinarily sensitive to minute concentrations of certain chemicals. If the permissible amount of a chemical is to be set at a level that protects every worker, it is possible that entire manufacturing processes might have to be halted. Similarly, if we demand zero risk for workers or for others who may be exposed, the economic base of many communities might be destroyed. Policy making therefore requires a balance between what can be done and what should be done. The degree of priority attached to elimination of all risk and the decision as to what percentage of risk should be eliminated clearly are not scientific decisions but rather depend on societal values. It is hoped that such societal decisions will capitalize on available epidemiologic and other scientific knowledge in the context of political, economic, ethical, and social considerations.

Conclusion

The objectives of epidemiology are to enhance our understanding of the biology, pathogenesis, and other determinants of disease to improve human health and to prevent and better treat disease. A thorough understanding of the methodologic issues that arise is needed to better interpret epidemiologic results properly as a basis for formulating both clinical and public health policy. The appropriate and judicious use of the results of epidemiologic studies is fundamental to an assessment of risk to human health and to the control of these risks. Such use is therefore important to both primary and secondary prevention. Policy makers are often obliged to develop policy in the presence of incomplete or equivocal scientific data. In clinical medicine, in both the diagnostic and therapeutic processes, decisions are often made with incomplete or equivocal data; this has perhaps been more of an overt impediment in public health and community medicine. No simple set of rules can eliminate this difficulty. As H. L. Mencken wrote: “There is always an easy solution to every human problem—neat, plausible, and wrong.” 30 A major challenge remains to develop the best process for formulating rational policies under such circumstances—a process that is relevant for both clinical medicine and public health.

References

1 Hill AB. The environment and disease: association or causation?. Proc R Soc Med. 1965;58:295–300.

2 Jones FB. Saturday Evening Post. November 29 1953.

3 Rose G. Sick individuals and sick populations. Int J Epidemiol. 1985;14:22–38.

4 Whelton PK. Epidemiology of hypertension. Lancet. 1994;344:101–106.

5 Chobanian A, et al. The seventh report of the Joint National Committee on prevention, detection, evaluation and treatment of high blood pressure: the JNC 7 report. JAMA. 2003;289:2560–2572.

6 Grady D, Herrington D, Bittner V, et al. Cardiovascular disease outcomes during 68 years of hormone therapy: heart and estrogen/progestin replacement study follow-up (HERS II). for; the HERS Research Group JAMA. 2002;288:49–57.

7 The Women’s Health Initiative. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results The Women’s Health Initiative randomized controlled trial. JAMA. 2002;288:321–333.

8 Grodstein F, Clarkson TB, Manson JE. Understanding the divergent data on postmenopausal hormone therapy. N Engl J Med. 2003;348:645–650.

9 Michels KB. Hormone replacement therapy in epidemiologic studies and randomized clinical trials—are we checkmate?. Epidemiology. 2003;14:3–5.

10 Whittemore AS, McGuire V. Observational studies and randomized trials of hormone replacement therapy: what can we learn from them?. Epidemiology. 2003;14:8–10.

11 Manson JE, Bassuk SS, Harman SM, et al. Postmenopausal hormone therapy: new questions and the case for new clinical trials. Menopause. 2006;13:139–147.

12 Samet JM, Schnatter R, Gibb H. Epidemiology and risk assessment. Am J Epidemiol. 1998;148:929–936.

13 National Research Council Committee on the Institutional Means for Assessment of Risks to Public Health. Risk Assessment in the Federal Government: Managing the Process. Washington, DC: National Academy Press; 1983.21.

14 Auchincloss AH, Diez Roux AV, Dvonch JT, et al. Associations between recent exposure to ambient fine particular matter and blood pressure in the Multi-Ethnic Study of Atherosclerosis (MESA). Environ Health Perspect. 2008;116:486–491.

15 Cohen MA, Adar SD, Allen RW, et al. Approach to estimating participant pollutant exposures in the Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air). Environ Sci Technol. 2009;43(13):4687–4693.

16 Wertheimer N, Leeper E. Electrical wiring configurations and childhood cancer. Am J Epidemiol. 1979;109:273–284.

17 Kheifets L, Monroe J, Vergara X, et al. Occupational electromagnetic fields and leukemia and brain câncer: an update of two meta-analyses. J Occup Environ Med. 2008;50:677–688.

18 Kheifets L, Ahlbom A, Crespi CM, et al. Pooled analysis of recente studies on magnetic fields and childhood leukaemia. Br J Cancer. 2010;103:1128–1135.

19 Calvente I, Fernandez MF, Villalba J, et al. Exposure to electromagnetic fields (non-ionizing radiation) and its relationship with childhood leukemia: a systematic review. Sci Total Environ. 2010;408(16):3062–3069.

20 Bonassi S, Taioli E, Vermeulen R. Omics in population studies: a molecular epidemiology perspective. Environ Mol Mutagen. 2013;54(7):455–460.

21 Ecclesiastes 1:9.

22 Porta M. A Dictionary of Epidemiology. 5th ed New York: Oxford University Press; 2008.

23 Glass GV. Primary, secondary and meta-analysis of research. Educ Res. 1976;5:3–8.

24 U.S. Department of Health, Education and Welfare. Smoking and Health: Report of the Advisory Committee to the Surgeon General. Washington, DC, Public Health Service 1964.

25 Black B, Lilienfeld DE. Epidemiology proof in toxic tort litigation. Fordham Law Rev. 1984;52:732–785.

26 Frye v. United States, 293 F 1013 (D.C. Cir 1923).

27 Daubert v. Merrell Dow Pharmaceuticals, Inc, 113 S Ct 2786. 1993.

28 Green M, Freedman M, Gordis L. Reference guide on epidemiology. In: Reference Manual on Scientific Evidence. 3rd ed Washington, DC: The National Academies Press; 2011:549.

29 National Research Council Committee on the Institutional Means for Assessment of Risks to Public Health. Risk Assessment in the Federal Government: Managing the Process. Washington, DC: National Academy Press; 1983.11.

30 Mencken HL. The divine afflatus. New York Evening Mail. 1917 November 16; Essay reprinted in Mencken HL: Prejudices, series 2 New York, Alfred A Knopf; 1920. Assignment: Using Epidemiology to Evaluate Health Services