You are on page 1of 17

CHAPTER 13. READING THE MEDICAL LITERATURE Purpose of the Chapter This final chapter has several purposes.

Most importantly, it ties together concepts and skills presented in previous chapters and applies these concepts very specifically to reading medical journal articles. Throughout the text, we have attempted to illustrate the strengths and weaknesses of some of the studies discussed, but this chapter focuses specifically on those attributes of a study that indicate whether we, as readers of the medical literature, can use the results with confidence. The chapter begins with a brief summary of major types of medical studies. Next, we examine the anatomy of a typical journal article in detail, and we discuss the contents of each componentabstract or summary, introduction, methods, results, discussion, and conclusions. In this examination, we also point out common shortcomings, sources of bias, and threats to the validity of studies. Clinicians read the literature for many different reasons. Some articles are of interest because you want only to be aware of advances in a field. In these instances, you may decide to skim the article with little interest in how the study was designed and carried out. In such cases, it may be possible to depend on experts in the field who write review articles to provide a relatively superficial level of information. On other occasions, however, you want to know whether the conclusions of the study are valid, perhaps so that they can be used to determine patient care or to plan a research project. In these situations, you need to read and evaluate the article with a critical eye in order to detect poorly done studies that arrive at unwarranted conclusions. To assist readers in their critical reviews, we present a checklist for evaluating the validity of a journal article. The checklist notes some of the characteristics of a well-designed and well-written article. The checklist is based on our experiences with medical students, house staff, journal clubs, and interactions with physician colleagues. It also reflects the opinions expressed in an article describing how journal editors and statisticians can interact to improve the quality of published medical research (Marks et al, 1988). A number of authors have found that only a minority of published studies meet the criteria for scientific adequacy. The checklist should assist you in using your time most effectively by allowing you to differentiate valid articles from poorly done studies so that you can concentrate on the more productive ones. Two guidelines recently published increase our optimism that the quality of the published literature will continue to improve. The International Conference on Harmonization (ICH) E9 guideline "Statistical Principles for Clinical Trials" (1999) addresses issues of statistical methodology in the design, conduct, analysis, and evaluation of clinical trials. Application of the principles is intended to facilitate the general acceptance of analyses and conclusions drawn from clinical trials. The International Committee of Medical Journal Editors published the Uniform Requirements of Manuscripts Submitted to Biomedical Journals in 1997. Under Statistics, the document states: Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to verify the reported results. . . . When data are summarized in the results section, specify the statistical methods used to analyze them. The requirements also recommend the use of confidence intervals and to avoid depending solely on P values. Review of Major Study Designs Chapter 2 introduced the major types of study designs used in medical research, broadly divided into experimental studies (including clinical trials); observational studies (cohort, casecontrol, crosssectional/surveys, caseseries); and meta-analyses. Each design has certain advantages over the others as well as some specific disadvantages; they are briefly summarized in the following paragraphs. (A more detailed discussion is presented in Chapter 2.) Clinical trials provide the strongest evidence for causation because they are experiments and, as such, are subject to the least number of problems or biases. Trials with randomized controls are the study type of choice when the objective is to evaluate the effectiveness of a treatment or a procedure.

Drawbacks to using clinical trials include their expense and the generally long time needed to complete them. Cohort studies are the best observational study design for investigating the causes of a condition, the course of a disease, or risk factors. Causation cannot be proved with cohort studies, because they do not involve interventions. Because they are longitudinal studies, however, they incorporate the correct time sequence to provide strong evidence for possible causes and effects. In addition, in cohort studies that are prospective, as opposed to historical, investigators can control many sources of bias. Cohort studies have disadvantages, of course. If they take a long time to complete, they are frequently weakened by patient attrition. They are also expensive to carry out if the disease or outcome is rare (so that a large number of subjects needs to be followed) or requires a long time to develop. Casecontrol studies are an efficient way to study rare diseases, examine conditions that take a long time to develop, or investigate a preliminary hypothesis. They are the quickest and generally the least expensive studies to design and carry out. Casecontrol studies also are the most vulnerable to possible biases, however, and they depend entirely on high-quality existing records. A major issue in case control studies is the selection of an appropriate control group. Some statisticians have recommended the use of two control groups: one similar in some ways to the cases (such as having been hospitalized or treated during the same period) and another made up of healthy subjects. Cross-sectional studies and surveys are best for determining the status of a disease or condition at a particular point in time; they are similar to casecontrol studies in being relatively quick and inexpensive to complete. Because cross-sectional studies provide only a snapshot in time, they may lead to misleading conclusions if interest focuses on a disease or other time-dependent process. Caseseries studies are the weakest kinds of observational studies and represent a description of typically unplanned observations; in fact, many would not call them studies at all. Their primary use is to provide insights for research questions to be addressed by subsequent, planned studies. Studies that focus on outcomes can be experimental or observational. Clinical outcomes remain the major focus, but emphasis is increasingly placed on functional status and quality-of-life measures. It is important to use properly designed and evaluated methods to collect outcome data. Evidence-based medicine makes great use of outcome studies. Meta-analysis may likewise focus on clinical trials or observational studies. Meta-analyses differ from the traditional review articles in that they attempt to evaluate the quality of the research and quantify the summary data. They are helpful when the available evidence is based on studies with small sample sizes or when studies come to conflicting conclusions. Meta-analyses do not, however, take the place of well-designed clinical trials. The Abstract & Introduction Sections of a Research Report Journal articles almost always include an abstract or summary of the article prior to the body of the article itself. Most of us are guilty of reading only the abstract on occasion, perhaps because we are in a great hurry or have only a cursory interest in the topic. This practice is unwise when it is important to know whether the conclusions stated in the article are justified and can be used to make decisions. This section discusses the abstract and introduction portions of a research report and outlines the information they should contain. The Abstract The major purposes of the abstract are (1) to tell readers enough about the article so they can decide whether to read it in its entirely and (2) to identify the focus of the study. The International Committee of Medical Journal Editors (1997) recommended that the abstract "state the purposes of the study or investigation, basic procedures (selection of study subjects or experimented animals; observational and analytic methods), main findings (specific data and their statistical significance, if possible) and the principal conclusions." An increasing number of journals, especially those we consider to be of high quality, now use structured abstracts in which authors succinctly provide the above-mentioned information in separate, easily identified paragraphs (Haynes et al, 1990).

We suggest asking two questions to decide whether to read the article: (1) If the study has been properly designed and analyzed, would the results be important and worth knowing? (2) If the results are statistically significant, does the magnitude of the change or effect also have clinical significance; if the results are not statistically significant, was the sample size sufficiently large to detect a meaningful difference or effect? If the answers to these questions are yes, then it is worthwhile to continue to read the report. Structured abstracts are a boon to the busy reader and frequently contain enough information to answer these two questions. The Introduction or Abstract At one time, the following topics were discussed (or should have been discussed) in the introduction section; however, with the advent of the structured abstract, many of these topics are now addressed directly in that section. The important issue is that the information be available and easy to identify. Reason for the Study The introduction section of a research report is usually fairly short. Generally, the authors briefly mention previous research that indicates the need for the present study. In some situations, the study is a natural outgrowth or the next logical step of previous studies. In other circumstances, previous studies have been inadequate in one way or another. The overall purpose of this information is twofold: to provide the necessary background information to place the present study in its proper context and to provide reasons for doing the present study. In some journals, the main justification for the study is given in the discussion section of the article instead of in the introduction. Purpose of the Study Regardless of the placement of background information on the study, the introduction section is where the investigators communicate the purpose of their study. The purpose of the study is frequently presented in the last paragraph or last sentences at the end of the introduction. The purpose should be stated clearly and succinctly, in a manner analogous to a 15-second summary of a patient case. For example, in the study described in Chapter 5, Dennison and colleagues (1997, p. 15) do this very well; they stated their objective as follows: To evaluate, in a population-based sample of healthy children, fruit juice consumption and its effects on growth parameters during early childhood. This statement concisely communicates the population of interest (healthy children), the focus of the study or independent variable (fruit juice consumption), and the outcome (effects on growth). As readers, we should be able to determine whether the purpose for the study was conceived prior to data collection or if it evolved after the authors viewed their data; the latter situation is much more likely to capitalize on chance findings. The lack of a clearly stated research question is the most common reason medical manuscripts are rejected by journal editors (Marks et al, 1988). Population Included in the Study In addition to stating the purpose of the study, the structured abstract or introduction section may include information on the study's location, length of time, and subjects. Alternatively, this information may be contained in the methods sections. This information helps readers decide whether the location of the study and the type of subjects included in the study are applicable in the readers' own practice environment. The time covered by a study gives important clues regarding the validity of the results. If the study on a particular therapy covers too long a period, patients entering at the beginning of the study may differ in important ways from those entering at the end. For example, major changes may have occurred in the way the disease in question is diagnosed, and patients entering near the end of the study may have had their disease diagnosed at an earlier stage than did patients who entered the study early (see detection bias, in the section of that title.). If the purpose of the study is to examine sequelae of a condition or procedure, the period covered by the study must be sufficiently long to detect consequences. The Method Section of a Research Report

The method section contains information about how the study was done. Simply knowing the study design provides a great deal of information, and this information is often given in a structured abstract. In addition, the method section contains information regarding subjects who participated in the study or, in animal or inanimate studies, information on the animals or materials. The procedures used should be described in sufficient detail that the reader knows how measurements were made. If methods are novel or require interpretation, information should be given on the reliability of the assessments. The study outcomes should be specified along with the criteria used to assess them. The method section also should include information on the sample size for the study and on the statistical methods used to analyze the data; this information is often placed at the end of the method section. Each of these topics is discussed in detail in this section. How well the study has been designed is of utmost importance. The most critical statistical errors, according to a statistical consultant to the New England Journal of Medicine, involve improper research design: "Whereas one can correct incorrect analytical techniques with a simple reanalysis of the data, an error in research design is almost always fatal to the studyone cannot correct for it subsequent to data collection" (Marks et al, 1988, p. 1004). Many statistical advances have occurred in recent years, especially in the methods used to design, conduct, and analyze clinical trials, and investigators should offer evidence that they have obtained expert advice. Subjects in the Study Methods for Choosing Subjects Authors should provide several critical pieces of information about the subjects included in their study so that we readers can judge the applicability of the study results. Of foremost importance is how the patients were selected for the study and, if the study is a clinical trial, how treatment assignments were made. Randomized selection or assignment greatly enhances the generalizability of the results and avoids biases that otherwise may occur in patient selection (see the section titled, "Bias Related to Subject Selection."). Some authors believe it is sufficient merely to state that subjects were randomly selected or treatments were randomly assigned, but most statisticians recommend that the type of randomization process be specified as well. Authors who report the randomization methods provide some assurance that randomization actually occurred, because some investigators have a faulty view of what constitutes randomization. For example, an investigator may believe that assigning patients to the treatment and the control on alternate days makes the assignment random. As we emphasized in Chapter 4, however, randomization involves one of the precise methods that ensure that each subject (or treatment) has a known probability of being selected. Eligibility Criteria The authors should present information to illustrate that major selection biases (discussed in the section titled, "Bias Related to Subject Selection.") have been avoided, an aspect especially important in nonrandomized trials. The issue of which patients serve as controls was discussed in Chapter 2 in the context of casecontrol studies. In addition, the eligibility criteria for both inclusion and exclusion of subjects in the study must be specified in detail. We should be able to state, given any hypothetical subject, whether this person would be included in or excluded from the study. Sauter and coworkers (2002) gave the following information on patients included in their study: Patients undergoing CHE in our Surgical Department were consecutively included into the study provided that they did not meet one the following exclusion criteria: (a) inflammatory bowel disease, history of intestinal surgery, or diarrhea within the preceding 2 years, (b) body weight > 90 kg, (c) pregnancy, (d) abnormal liver function tests . . ., (e) diabetes mellitus, (f) history of radiation of the abdominal region, and (g) drug therapy with antibiotics, lipid lower agents, laxatives, and cholestyramine. Patient Follow-Up

For similar reasons, sufficient information must be given regarding the procedures the investigators used to follow up patients, and they should state the numbers lost to follow-up. Some articles include this information under the results section instead of in the methods section. The description of follow-up and dropouts should be sufficiently detailed to permit the reader to draw a diagram of the information. Occasionally, an article presents such a diagram, as was done by Hbert and colleagues in their study of elderly residents in Canada (1997), reproduced in Figure 131. Such a diagram makes very clear the number of patients who were eligible, those who were not eligible because of specific reasons, the dropouts, and so on. Figure 131.

Flow of the subjects through the study, a representative sample of elderly people living at home in Sherbrooke, Canada, 19911993. (Reproduced, with permission, from Figure 1 in Hbert R, Brayne C, Spiegelhalter D: Incidence of functional decline and improvement in a community-dwelling very elderly population. Am J Epidemiol 1997;145:935944.) Bias Related to Subject Selection Bias in studies should not happen; it is an error related to selecting subjects or procedures or to measuring a characteristic. Biases are sometimes called measurement errors or systematic errors to distinguish them from random error (random variation), which occurs any time a sample is selected from a population. This section discusses selection bias, a type of bias common in medical research. Selection biases can occur in any study, but they are easier to control in clinical trials and cohort designs. It is important to be aware of selection biases, even though it is not always possible to predict exactly how their presence affects the conclusions. Sackett (1979) enumerated 35 different biases. We discuss some of the major ones that seem especially important to the clinician. If you are interested in a more detailed discussion, consult the article by Sackett and the text by Feinstein (1985), which devotes several chapters to the discussion of bias (especially Chapter 4, in the section titled "The Meaning of the Term Probability," and Chapters 1517). Prevalence or Incidence Bias Prevalence (Neyman) bias occurs when a condition is characterized by early fatalities (some subjects die before they are diagnosed) or silent cases (cases in which the evidence of exposure disappears when the disease begins). Prevalence bias can result whenever a time gap occurs between exposure and selection of study subjects and the worst cases have died. A cohort study begun prior to the onset of disease is able to detect occurrences properly, but a casecontrol study that begins at a later date consists only of the people who did not die. This bias can be prevented in cohort studies and avoided in casecontrol studies by limiting eligibility for the study to newly diagnosed or incident cases. The practice of limiting eligibility is common in population-based casecontrol studies in cancer epidemiology. To illustrate prevalence or incidence bias, let us suppose that two groups of people are being studied: those with a risk factor for a given disease (eg, hypertension as a risk factor for stroke) and those without the risk factor. Suppose 1000 people with hypertension and 1000 people without hypertension have been followed for 10 years. At this point, we might have the situation shown in Table 131. Table 131. Illustration of Prevalence Bias: Actual Situation. Number of Patients in 10-Year Cohort Study

Patients

Alive with Cerebrovascular Dead Disease Stroke 250

from Alive with No Cerebrovascular Disease 700

With hypertension 50

Without 80 20 900 hypertension A cohort study begun 10 years ago would conclude correctly that patients with hypertension are more likely to develop cerebrovascular disease than patients without hypertension (300 to 100) and far more likely to die from it (250 to 20). Suppose, however, a casecontrol study is undertaken at the end of the 10-year period without limiting eligibility to newly diagnosed cases of cerebrovascular disease. Then the situation illustrated in Table 132 occurs. Table 132. Illustration of Prevalence Bias: Result with CaseControl Design. Number of Patients in CaseControl Study at End of 10 Years Patients With hypertension With Cerebrovascular Disease Without Cerebrovascular Disease 50 700

Without hypertension 80 900 The odds ratio is calculated as (50 x 900)/(80 x 700) = 0.80, making it appear that hypertension is actually a protective factor for the disease! The bias introduced in an improperly designed case control study of a disease that kills off one group faster than the other can lead to a conclusion exactly the opposite of the correct conclusion that would be obtained from a well-designed casecontrol study or a cohort study. Admission Rate Bias Admission rate bias (Berkson's fallacy) occurs when the study admission rates differ, which causes major distortions in risk ratios. As an example, admission rate bias can occur in studies of hospitalized patients when patients (cases) who have the risk factor are admitted to the hospital more frequently than either the cases without the risk factor or the controls with the risk factor. This fallacy was first pointed out by Berkson (1946) in evaluating an earlier study that had concluded that tuberculosis might have a protective effect on cancer. This conclusion was reached after a case control study found a negative association between tuberculosis and cancer: The frequency of tuberculosis among hospitalized cancer patients was less than the frequency of tuberculosis among the hospitalized control patients who did not have cancer. These counterintuitive results occurred because a smaller proportion of patients who had both cancer and tuberculosis were hospitalized and thus available for selection as cases in the study; chances are that patients with both diseases were more likely to die than patients with cancer or tuberculosis alone. It is important to be aware of admission rate bias because many casecontrol studies reported in the medical literature use hospitalized patients as sources for both cases and controls. The only way to control for this bias is to include an unbiased control group, best accomplished by choosing controls from a wide variety of disease categories or from a population of healthy subjects. Some statisticians suggest using two control groups in studies in which admission bias is a potential problem. Nonresponse Bias and the Volunteer Effect Several steps discussed in Chapter 11 can be taken to reduce potential bias when subjects fail to respond to a survey. Bias that occurs when patients either volunteer or refuse to participate in studies is similar to nonresponse bias. This effect was studied in the nationwide Salk polio vaccine trials in 1954 by using two different study designs to evaluate the effectiveness of the vaccine (Meier, 1989). In some communities, children were randomly assigned to receive either the vaccine or a placebo injection.

Some communities, however, refused to participate in a randomized trial; they agreed, instead, that second graders could be offered the vaccination and first and third graders could constitute the controls. In analysis of the data, researchers found that families who volunteered their children for participation in the nonrandomized study tended to be better educated and to have a higher income than families who refused to participate. They also tended to be absent from school with a higher frequency than nonparticipants. Although in this example we might guess how absence from school could bias results, it is not always easy to determine how selection bias affects the outcome of the study; it may cause the experimental treatment to appear either better or worse than it should. Investigators should therefore reduce the potential for nonresponse bias as much as possible by using all possible means to increase the response rate and obtain the participation of most eligible patients. Using databases reduces response bias, but sometimes other sources of bias are present, that is, reasons that a specific group or selected information is underrepresented in the database. Membership Bias Membership bias is essentially a problem of preexisting groups. It also arises because one or more of the same characteristics that cause people to belong to the groups are related to the outcome of interest. For example, investigators have not been able to perform a clinical trial to examine the effects of smoking; some researchers have claimed it is not smoking itself that causes cancer but some other factor that simply happens to be more common in smokers. As readers of the medical literature, we need to be aware of membership bias because it cannot be prevented, and it makes the study of the effect of potential risk factors related to life-style very difficult. A problem similar to membership bias is called the healthy worker effect; it was recognized in epidemiology when workers in a hazardous environment were unexpectedly found to have a higher survival rate than the general public. After further investigation, the cause of this incongruous finding was determined: Good health is a prerequisite in persons who are hired for work, but being healthy enough to work is not a requirement for persons in the general public. Procedure Selection Bias Procedure selection bias occurs when treatment assignments are made on the basis of certain characteristics of the patients, with the result that the treatment groups are not really similar. This bias frequently occurs in studies that are not randomized and is especially a problem in studies using historical controls. A good example is the comparison of a surgical versus a medical approach to a problem such as coronary artery disease. In early studies comparing surgical and medical treatment, patients were not randomized, and the evidence pointed to the conclusion that patients who received surgery were healthier than those treated medically; that is, only healthier patients were subjected to the risks associated with the surgery. The Coronary Artery Surgery Study (CASS, 1983) was undertaken in part to resolve these questions. It is important to be aware of procedure selection bias because many published studies describe a series of patients, some treated one way and some another way, and then proceed to make comparisons and draw inappropriate conclusions as a result. Procedures Used in the Study and Common Procedural Biases Terms and Measurements The procedures used in the study are also described in the method section. Here authors provide definitions of measures used in the study, especially any operational definitions developed by the investigators. If unusual instruments or methods are used, the authors should provide a reference and a brief description. For example, the study of screening for domestic violence in emergency department patients by Lapidus and colleagues (2002) defined domestic violence as "past or current physical, sexual, emotional, or verbal harm to a woman caused by a spouse, partner, or family member." Domestic violence screening was defined as "assessing an individual to determine if she has been a victim of domestic violence."

The journal Stroke has the practice of presenting the abbreviations and acronyms used in the article in a box. This makes the abbreviations clear and also easy to refer to in reading other sections of the article. For example, in reporting their study of sleep-disordered breathing and stroke, Good and colleagues (1996) presented a list of abbreviations at the top of the column that describe the subjects and methods in the study. Several biases may occur in the measurement of various patient characteristics and in the procedures used or evaluated in the study. Some of the more common biases are described in the following subsections. Procedure Bias Procedure bias, discussed by Feinstein (1985), occurs when groups of subjects are not treated in the same manner. For example, the procedures used in an investigation may lead to detection of other problems in patients in the treatment group and make these problems appear to be more prevalent in this group. As another example, the patients in the treatment group may receive more attention and be followed up more vigorously than those in another group, thus stimulating greater compliance with the treatment regimen. The way to avoid this bias is by carrying out all maneuvers except the experimental factor in the same way in all groups and examining all outcomes using similar procedures and criteria. Recall Bias Recall bias may occur when patients are asked to recall certain events, and subjects in one group are more likely to remember the event than those in the other group. For example, people take aspirin commonly and for many reasons, but patients diagnosed as having peptic ulcer disease may recall the ingestion of aspirin with greater accuracy than those without gastrointestinal problems. In the study of the relationship between juice consumption and growth, Dennison and associates (1997) asked parents to keep a daily journal of all the liquid consumed by their children; a properly maintained journal helps reduce recall bias. Insensitive-Measure Bias Measuring instruments may not be able to detect the characteristic of interest or may not be properly calibrated. For example, routine x-ray films are an insensitive method for detecting osteoporosis because bone loss of approximately 30% must occur before a roentgenogram can detect it. Newer densitometry techniques are more sensitive and thus avoid insensitive-measure bias. Detection Bias Detection bias can occur because a new diagnostic technique is introduced that is capable of detecting the condition of interest at an earlier stage. Survival for patients diagnosed with the new procedure inappropriately appears to be longer, merely because the condition was diagnosed earlier. A spin-off of detection bias, called the Will Rogers phenomenon (because of his attention to human phenomena), was described by Feinstein and colleagues (1985). They found that a cohort of subjects with lung cancer first treated in 19531954 had lower 6-month survival rates for patients with each of the three main stages (localized, nodal involvement, and metastases) as well as for the total group than did a 1977 cohort treated at the same institutions. Newer imaging procedures were used with the later group; however, according to the old diagnostic classification, this group had a prognostically favorable zero-time shift in that their disease was diagnosed at an earlier stage. In addition, by detecting metastases in the 1977 group that were missed in the earlier group, the new technologic approaches resulted in stage migration; that is, members of the 1977 cohort were diagnosed as having a more advanced stage of the disease, whereas they would have been diagnosed as having earlier-stage disease in 19531954. The individuals who migrated from the earlier-stage group to the later-stage group tended to have the poorest survival in the earlier-stage group; so removing them resulted in an increase in survival rates in the earlier group. At the same time, these individuals, now assigned to the laterstage group, were better off than most other patients in this group, and their addition to the group resulted in an increased survival in the later-stage group as well. The authors stated that the 19531954 and 1977 cohorts actually had similar survival rates when patients in the 1977 group were classified

according to the criteria that would have been in effect had there been no advances in diagnostic techniques. Compliance Bias Compliance bias occurs when patients find it easier or more pleasant to comply with one treatment than with another. For example, in the treatment of hypertension, a comparison of -methyldopa versus

hydrochlorothiazide may demonstrate inaccurate results because some patients do not take methyldopa owing to its unpleasant side effects, such as drowsiness, fatigue, or impotence in male patients. Assessing Study Outcomes Variation in Data In many clinics, a nurse collects certain information about a patient (eg, height, weight, date of birth, blood pressure, pulse) and records it on the medical record before the patient is seen by a physician. Suppose a patient's blood pressure is recorded as 140/88 on the chart; the physician, taking the patient's blood pressure again as part of the physical examination, observes a reading of 148/96. Which blood pressure reading is correct? What factors might be responsible for the differences in the observation? We use blood pressure and other clinical information to examine sources of variation in data and ways to measure the reliability of observations. Two classic articles in the Canadian Medical Association Journal (McMaster University Health Sciences Centre, Department of Clinical Epidemiology and Biostatistics, 1980a; 1980b) discuss sources of clinical disagreement and ways disagreement can be minimized. Factors That Contribute to Variation in Clinical Observations Variation, or variability in measurements on the same subject, in clinical observations and measurements can be classified into three categories: (1) variation in the characteristic being measured, (2) variation introduced by the examiner, and (3) variation owing to the instrument or method used. It is especially important to control variation due to the second two factors as much as possible so that the reported results will generalize as intended. Substantial variability may occur in the measurement of biologic characteristics. For example, a person's blood pressure is not the same from one time to another, and thus, blood pressure values vary. A patient's description of symptoms to two different physicians may vary because the patient may forget something. Medications and illness can also affect the way a patient behaves and what information he or she remembers to tell a nurse or physician. Even when no change occurs in the subject, different observers may report different measurements. When examination of a characteristic requires visual acuity, such as the reading on a sphygmomanometer or the features on an x-ray film, differences may result from the varying visual abilities of the observers. Such differences can also play a role when hearing (detecting heart sounds) or feeling (palpating internal organs) is required. Some individuals are simply more skilled than others in history taking or performing certain examinations. Variability also occurs when the characteristic being measured is a behavioral attribute. Two examples are measurements of functional status and measurements of pain; here the additional component of patient or observer interpretation can increase apparent variability. In addition, observers may tend to observe and record what they expect based on other information about the patient. These factors point out the need for a standardized protocol for data collection. The instrument used in the examination can be another source of variation. For instance, mercury column sphygmomanometers are less inherently variable than aneroid models. In addition, the environment in which the examination takes place, including lighting and noise level, presence of other individuals, and room temperature, can produce apparent differences. Methods for measuring behavior-

related characteristics such as functional status or pain usually consist of a set of questions answered by patients and hence are not as precise as instruments that measure physical characteristics. Several steps can be taken to reduce variability. Taking a history when the patient is calm and not heavily medicated and checking with family members when the patient is incapacitated are both useful in minimizing errors that result from a patient's illness or the effects of medication. Collecting information and making observations in a proper environment is also a good strategy. Recognizing one's own strengths and weaknesses helps one evaluate the need for other opinions. Blind assessment, especially of subjective characteristics, guards against errors resulting from preconceptions. Repeating questionable aspects of the examination or asking a colleague to perform a key aspect (blindly, of course) reduces the possibility of error. Having well-defined operational guidelines for using classification scales helps people use them in a consistent manner. Ensuring that instruments are properly calibrated and correctly used eliminates many errors and thus reduces variation. Ways to Measure Reliability and Validity A common strategy to ensure the reliability or reproducibility of measurements, especially for research purposes, is to replicate the measurements and evaluate the degree of agreement. We discussed intraand interrater reliability in Chapter 5 and discussed reliability and validity in detail in Chapter 11. One approach to establishing the reliability of a measure is to repeat the measurements at a different time or by a different person and compare the results. When the outcome is nominal, the kappa statistic is used; when the scale of measurement is numerical, the statistic used to examine the relationship is the correlation coefficient (Chapter 8) or the intraclass correlation (Chapter 11). Hbert and colleagues (1997) used the Functional Autonomy Measurement System (SMAF) instrument to measure cognitive functioning and depression. They evaluated its validity by comparing the SMAF score with the nursing time required for care (r = 0.88) and between disability scores for residents living in settings of different levels of care. The high correlation between nursing time and score indicates that patients with higher (more dependent) scores required more nursing time, a reasonable expectation. Another indication of validity is higher disability scores among residents living in settings where they were provided with a high level of care and lower scores among residents who live independently. Blinding Another aspect of assessing the outcome is related to ways of increasing the objectivity and decreasing the subjectivity of the assessment. In studies involving the comparison of two treatments or procedures, the most effective method for achieving objective assessment is to have both patient and investigator be unaware of which method was used. If only the patient is unaware, the study is called blind; if both patient and investigator are unaware, it is called double-blind. Ballard and colleagues (1998) studied the effect of antenatal thyrotropin-releasing hormone in preventing lung disease in preterm infants in a randomized study. Experimental subjects were given the hormone, and controls were given placebo. The authors stated: The women were randomly assigned within centers to the treatment or placebo group in permuted blocks of four. The study was double-blinded, and only the pharmacies at the participating centers had the randomization schedule. Blinding helps to reduce a priori biases on the part of both patient and physician. Patients who are aware of their treatment assignment may imagine certain side effects or expect specific benefits, and their expectations may influence the outcome. Similarly, investigators who know which treatment has been assigned to a given patient may be more watchful for certain side effects or benefits. Although we might suspect an investigator who is not blinded to be more favorable to the new treatment or procedure, just the opposite may happen; that is, the investigator may bend over backward to keep from being prejudiced by his or her knowledge and therefore may err in favor of the control. Knowledge of treatment assignment may be somewhat less influential when the outcome is straightforward, as is true for mortality. With mortality, it is difficult to see how outcome assessment

can be biased. Many examples exist in which the outcome appears to be objective, however, even though its evaluation contains subjective components. Many clinical studies attempt to ascribe reasons for mortality or morbidity, and judgment begins to play a role in these cases. For example, mortality is often an outcome of interest in studies involving organ transplantation, and investigators wish to differentiate between deaths from failure of the organ and deaths from an unrelated cause. If the patient dies in an automobile accident, for example, investigators can easily decide that the death is not due to organ rejection; but in most situations, the decision is not so easy. The issue of blinding becomes more important as the outcome becomes less amenable to objective determination. Research that focuses on quality-of-life outcomes, such as chest pain status, activity limitation, or recreational status, require subjective measures. Although patients cannot be blinded in many studies, the subjective outcomes can be assessed by a person, such as another physician, a psychologist, or a physical therapist, who is blind to the treatment the patient received. Data Quality and Monitoring The method section is also the place where steps taken to ensure the accuracy of the data should be described. Increased variation and possibly incorrect conclusions can occur if the correct observation is made but is incorrectly recorded or coded. Dennison and colleagues (1997, p. 16) stated: "All questionnaire data were dual-entered and verified before being entered into a . . . database." Dual or duplicate entry decreases the likelihood of errors because it is unusual for the same entry error to occur twice. Multicenter studies provide additional data quality challenges. It is important that an accurate and complete protocol be developed to ensure that data are handled the same way in all centers. Gelber and colleagues (1997) studied data collected from 63 centers in North America in setting normative values for cardiovascular autonomic nervous system tests. They reported that: All site personnel were trained by a member of the Autonomic Nervous System (ANS) Reading Center in the use of the equipment and testing methodology. . . . All data were analyzed at a single Autonomic Reading Center. The analysis program contains internal checks which alert the analyzing technician to any aberrant data points overlooked during the editing process and warns the technician when the results suggest that the test may have been performed improperly. The analysis of each study was reviewed by the director of the ANS Reading Center. In addition to standardized training, the data entry process itself was monitored for potential errors. Determining an Appropriate Sample Size Specifying the sample size needed to detect a difference or an effect of a given magnitude is one of the most crucial pieces of information in the report of a medical study. Recall that missing a significant difference is called a type II error, and this error can happen when the sample size is too small. We provide many illustrations in the chapters that discuss specific statistical methods, especially Chapters 5, 6, 7, 8, 10, and 11. Determination of sample size is referred to as power analysis or as determining the power of a study. An assessment of power is essential in negative studies, studies that fail to find an expected difference or relationship; we feel so strongly about this point that we recommend that readers disregard negative studies that do not provide information on power. Harper studied the use of paracervical block to diminish cramping associated with cryosurgery (1997). She stated: To have a power of 80% to detect a difference of 20 mm on the visual analog scale at the 0.05 level of significance (assuming a standard deviation of 30 mm), the power analysis a priori showed that 35 women would be needed in each cohort. The first 35 women who met the inclusion and exclusion criteria for cryosurgery were treated in the usual manner with no anesthetic block given before cryosurgery. The variances of the actual responses were greater than anticipated in the a priori power analysis, leading to the subsequent enrollment of the next five women qualifying for the study for a

total of 40 women in the usual treatment group. This increase in enrollment maintained the power of the study. Thus, as a result of analysis of data, the investigator opted to increase the sample size to maintain power. We repeatedly emphasize the need to perform a power analysis prior to beginning a study and have illustrated how to estimate power using statistical programs for that purpose. Investigators planning complicated studies or studies involving a number of variables are especially advised to contact a statistician for assistance. Evaluating the Statistical Methods Statistical methods are the primary focus of this text, and only a brief summary and some common problems are listed here. At the risk of oversimplification, the use of statistics in medicine can be summarized as follows: (1) to answer questions concerning differences; (2) to answer questions concerning associations; and (3) to control for confounding issues or to make predictions. If you can determine the type of question investigators are asking (from the stated purpose of the study) and the types and numbers of measures used in the study (numerical, ordinal, nominal), then the appropriate statistical procedure should be relatively easy to specify. Tables 101 and 102 in Chapter 10 and the flowcharts in Appendix C were developed to assist with this process. Some common biases in evaluating data are discussed in the next sections. Fishing Expedition A fishing expedition is the name given to studies in which the investigators do not have clear-cut research questions guiding the research. Instead, data are collected, and a search is carried out for results that are significant. The problem with this approach is that it capitalizes on chance occurrences and leads to conclusions that may not hold up if the study is replicated. Unfortunately, such studies are rarely repeated, and incorrect conclusions can remain a part of accepted wisdom. Multiple Significance Tests Multiple tests in statistics, just as in clinical medicine, result in increased chances of making a type I, or false-positive, error when the results from one test are interpreted as being independent of the results from another. For example, a factorial design for analysis of variance in a study involving four groups measured on three independent variables has the possibility of 18 comparisons (6 comparisons among the four groups on each of three variables), ignoring interactions. If each comparison is made for P 0.05, the probability of finding one or more comparisons significant by chance is considerably greater than 0.05. The best way to guard against this bias is by performing the appropriate global test with analysis of variance prior to making individual group comparisons (Chapter 7) or using an appropriate method to analyze multiple variables (Chapter 10). A similar problem can occur in a clinical trial if too many interim analyses are done. Sometimes it is important to analyze the data at certain stages during a trial to learn if one treatment is clearly superior or inferior. Many trials are stopped early when an interim analysis determines that one therapy is markedly superior to another. For instance, the Women's Health Initiative trial on estrogen plus progestin (Writing Group for WHI, 2002) and the study of finasteride and the development of prostate cancer (Thompson et al, 2003) were both stopped early. In the estrogen study, the conclusion was that overall health risks outweighed the benefits, whereas finasteride was found to prevent or delay prostate cancer in a significant number of men. In these situations, it is unethical to deny patients the superior treatment or to continue to subject them to a risky treatment. Interim analyses should be planned as part of the design of the study, and the overall probability of a type I error (the adjusted to compensate for the multiple comparisons. Migration Bias level) should be

Migration bias occurs when patients who drop out of the study are also dropped from the analysis. The tendency to drop out of a study may be associated with the treatment (eg, its side effects), and dropping these subjects from the analysis can make a treatment appear more or less effective than it really is. Migration bias can also occur when patients cross over from the treatment arm to which they were assigned to another treatment. For example, in crossover studies comparing surgical and medical treatment for coronary artery disease, patients assigned to the medical arm of the study sometimes require subsequent surgical treatment for their condition. In such situations, the appropriate method is to analyze the patient according to his or her original group; this is referred to as analysis based on the intention-to-treat principle. Entry Time Bias Entry time bias may occur when time-related variables, such as survival or time to remission, are counted differently for different arms of the study. For example, consider a study comparing survival for patients randomized to a surgical versus a medical treatment in a clinical trial. Patients randomized to the medical treatment who die at any time after randomization are counted as treatment failures; the same rule must be followed with patients randomized to surgery, even if they die prior to the time surgery is performed. Otherwise, a bias exists in favor of the surgical treatment. The Results Section of a Research Report The results section of a medical report contains just that: results of (or findings from) the research directed at questions posed in the introduction. Typically, authors present tables or graphs (or both) of quantitative data and also report the findings in the text. Findings generally consist of both descriptive statistics (means, standard deviations, risk ratios, etc) and results of statistical tests that were performed. Results of statistical tests are typically given as either P values or confidence limits; authors seldom give the value of the statistical test itself but, rather, give the P value associated with the statistical test. The two major aspects for readers evaluating the results section are the adequacy of information and the sufficiency of the evidence to withstand possible threats to the validity of the conclusions. Assessing the Data Presented Authors should provide adequate information about measurements made in the study. At a minimum, this information should include sample sizes and either means and standard deviations for numerical measures or proportions for nominal measures. For example, in describing the effect of sex, race, and age on estimating percentage body fat from body mass index, Jackson and colleagues (2002) specified the independent variables in the method section along with how they were coded for the regression analysis. They also gave the equation of the standard error of the estimate that they used to evaluate the fit of the regression models. In the results section, they included a table of means and standard deviations of the variables broken down by sex and race. In addition to presenting adequate information on the observations in the study, good medical reports use tables and graphs appropriately. As we outlined in Chapter 3, tables and graphs should be clearly labeled so that they can be interpreted without referring to the text of the article. Furthermore, they should be properly constructed, using the methods illustrated in Chapter 3. Assuring the Validity of the Data A good results section should have the following three characteristics. First, authors of medical reports should provide information about the baseline measures of the group (or groups) involved in the study as did Jackson and colleagues (2002) in Table 2 of their article. Tables like this one typically give information on the gender, age, and any important risk factors for subjects in the different groups and are especially important in observational studies. Even with randomized studies, it is always a good idea to show that, in fact, the randomization worked and the groups were similar. Investigators often perform statistical tests to demonstrate the lack of significant difference on the baseline measures. If it turns out that the groups are not equivalent, it may be possible to make adjustments for any important differences by one of the covariance adjusting methods discussed in Chapter 10.

Second, readers should be alert for the problem of multiple comparisons in studies in which many statistical tests are performed. Multiple comparisons can occur because a group of subjects is measured at several points in time, for which repeated-measures analysis of variance should be used. They also occur when many study outcomes are of interest; investigators should use multivariate procedures in these situations. In addition, multiple comparisons result when investigators perform many subgroup comparisons, such as between men and women, among different age groups, or between groups defined by the absence or presence of a risk factor. Again, either multivariate methods or clearly stated a priori hypotheses are needed. If investigators find unexpected differences that were not part of the original hypotheses, these should be advanced as tentative conclusions only and should be the basis for further research. Third, it is important to watch for inconsistencies between information presented in tables or graphs and information discussed in the text. Such inconsistencies may be the result of typographic errors, but sometimes they are signs that the authors have reanalyzed and rewritten the results or that the researchers were not very careful in their procedures. In any case, more than one inconsistency should alert us to watch for other problems in the study. The Discussion & Conclusion Sections of a Research Report The discussion and conclusion section(s) of a medical report may be one of the easier sections for clinicians to assess. The first and most important point to watch for is consistency among comments in the discussion, questions posed in the introduction, and data presented in the results. In addition, authors should address the consistency or lack of same between their findings and those of other published results. Careful readers will find that a surprisingly large number of published studies do not address the questions posed in their introduction. A good habit is to refer to the introduction and briefly review the purpose of the study just prior to reading the discussion and conclusion. The second point to consider is whether the authors extrapolated beyond the data analyzed in the study. For example, are there recommendations concerning dosage levels not included in the study? Have conclusions been drawn that require a longer period of follow-up than that covered by the study? Have the results been generalized to groups of patients not represented by those included in the study? Finally, note whether the investigators point out any shortcomings of the study, especially those that affect the conclusions drawn, and discuss research questions that have arisen from the study or that still remain unanswered. No one is in a better position to discuss these issues than the researchers who are intimately involved with the design and analysis of the study they have reported. A Checklist for Reading the Literature It is a rare article that meets all the criteria we have included in the following lists. Many articles do not even provide enough information to make a decision about some of the items in the checklist. Nevertheless, practitioners do not have time to read all the articles published, so they must make some choices about which ones are most important and best presented. Bordage and Dawson (2003) developed a set of guidelines for preparing a study and writing a research grant that contains many topics that are relevant to reading an article as well. The companion to our text, the book by Greenberg and colleagues (2000), is recommended for suggestions in reading the epidemiologic literature. Greenhalgh (1997b) presents a collection of articles on various topics published in the British Medical Journal, and the Journal of the American Medical Association has published a series of excellent articles under the general title of "Users' Guides to the Medical Literature." The following checklist is fairly exhaustive, and some readers may not want to use it unless they are reviewing an article for their own purposes or for a report. The items on the checklist are included in part as a reminder to the reader to look for these characteristics. Its primary purpose is to help clinicians decide whether a journal article is worth reading and, if so, what issues are important when deciding if the results are useful. The items in italics can often be found in a structured abstract. An asterisk (*) designates items that we believe are the most critical; these items are the ones readers should use when a less comprehensive checklist is desired.

Reading the Structured Abstract *A. Is the topic of the study important and worth knowing about? *B. What is the purpose of the study? Is the focus on a difference or a relationship? The purpose should be clearly stated; one should not have to guess. C. What is the main outcome from the study? Does the outcome describe something measured on a numerical scale or something counted on a categorical scale? The outcome should be clearly stated. D. Is the population of patients relevant to your practicecan you use these results in the care of your patients? The population in the study affects whether or not the results can be generalized. E. If statistically significant, do the results have clinical significance as well? Reading the Introduction *A. What research has already been done on this topic and what outcomes were reported? The study should add new information. Reading the Methods *A. Is the appropriate study design used (clinical trial, cohort, case-control, cross-sectional, metaanalysis)? B. Does the study cover an adequate period of time? Is the follow-up period long enough? *C. Are the criteria for inclusion and exclusion of subjects clear? How do these criteria limit the applicability of the conclusions? The criteria also affect whether or not the results can be generalized. *D. Were subjects randomly sampled (or randomly assigned)? Was the sampling method adequately described? E. Are standard measures used? Is a reference to any unusual measurement/procedure given if needed? Are the measures reliable/replicable? F. What other outcomes (or dependent variables) and risk factors (or independent variables) are in the study? Are they clearly defined? *G. Are statistical methods outlined? Are they appropriate? (The first question is easy to check; the second may be more difficult to answer.) *H. Is there a statement about powerthe number of patients that are needed to find the desired outcome? A statement about sample size is essential in a negative study. I. In a clinical trial: 1. How are subjects recruited? *2. Are subjects randomly assigned to the study groups? If not: a. How are patients selected for the study to avoid selection biases? b. If historical controls are used, are methods and criteria the same for the experimental group; are cases and controls compared on prognostic factors? *3. Is there a control group? If so, is it a good one? 4. Are appropriate therapies included? 5. Is the study blind? Double-blind? If not, should it be? 6. How is compliance ensured/evaluated? *7. If some cases are censored, is a survival method such as KaplanMeier or the Cox model used? J. In a cohort study: *1. How are subjects recruited? 2. Are subjects randomly selected from an eligible pool? *3. How rigorously are subjects followed? How many dropouts does the study have and who are they? *4. If some cases are censored, is a survival method such as KaplanMeier curves used? K. In a casecontrol study: *1. Are subjects randomly selected from an eligible pool? 2. Is the control group a good one (bias-free)? 3. Are records reviewed independently by more than one person (thereby increasing the reliability of data)?

L. In a cross-sectional (survey, epidemiologic) study: 1. Are the questions unbiased? *2. Are subjects randomly selected from an eligible pool? *3. What is the response rate? M. In a meta-analysis: *1. How is the literature search conducted? 2. Are the criteria for inclusion and exclusion of studies clearly stated? *3. Is an effort made to reduce publication bias (because negative studies are less likely to be published)? *4. Is there information on how many studies are needed to change the conclusion? Reading the Results *A. Do the reported findings answer the research questions? *B. Are actual values reportedmeans, standard deviations, proportionsso that the magnitude of differences can be judged by the reader? C. Are many P values reported, thus increasing the chance that some findings are bogus? *D. Are groups similar on baseline measures? If not, how did investigators deal with these differences (confounding factors)? E. Are the graphs and tables, and their legends, easy to read and understand? *F. If the topic is a diagnostic procedure, is information on both sensitivity and specificity (falsepositive rate) given? If predictive values are given, is the dependence on prevalence emphasized? Reading the Conclusion and Discussion *A. Are the research questions adequately discussed? *B. Are the conclusions justified? Do the authors extrapolate more than they should, for example, beyond the length of time subjects were studied or to populations not included in the study? C. Are the conclusions of the study discussed in the context of other relevant research? D. Are shortcomings of the research addressed? Exercises Questions 165 are available as an interactive quiz (see Table of Contents page). Questions 6670 These questions constitute a set of extended matching items. For each of the situations outlined here, select the most appropriate statistical method to use in analyzing the data from the choices ai that follow. Each choice may be used more than once. a. Independent-groups t test b. Chi-square test c. Wilcoxon rank sum test d. Pearson correlation e. Analysis of variance f. MantelHaenszel chi-square g. Multiple regression h. Paired t test i. Odds ratio 66. Investigating average body weight before and after a supervised exercise program 67. Investigating gender of the head of household in families of patients whose medical costs are covered by insurance, Medicaid, or self 68. Investigating a possible association between exposure to an environmental pollutant and miscarriage 69. Investigating blood cholesterol levels in patients who follow a diet either low or moderate in fat and who take either a drug to lower cholesterol or a placebo

70. Investigating physical functioning in patients with diabetes on the basis of demographic characteristics and level of diabetic control Questions 7175 These questions constitute a set of multiple truefalse items. For each of the statements, determine whether the statement is true or false. Table 1312 contains the variables used by Lamas and colleagues (1992) to predict aspirin therapy before myocardial infarction (MI). Refer to the table to answer the following questions. Table 1312. Logistic-Regression Model to Predict Aspirin Therapy before Infarction. Variable PTCA before MI Catheterization before MI Previous MI CABG before MI White race Randomization after 1/28/88 Angina before MI Hypertension Male sex Married status Age Education after high school Orthopedic disease Type of hospital (academic vs community) Diabetes Odds Ratio (95% CI) 2.66 (1.574.51) 2.22 (1.593.10) 1.95 (1.492.54) 1.80 (1.272.55) 1.47 (0.972.23) 1.43 (1.111.85) 1.22 (0.941.60) 1.19 (0.951.50) 1.19 (0.861.64) 1.03 (0.791.35) 1.02 (1.011.04) 0.97 (0.771.22) 0.97 (0.551.69) 0.92 (0.711.18) 0.83 (0.621.08)

CI = confidence interval; PTCA = percutaneous transluminal coronary angioplasty; MI = myocardial infarction; CABG = coronary-artery bypass grafting. Source: Adapted and used, with permission, from Table 2 in Lamas GA, Pfeffer MA, Hamm P, Wertheimer J, Rouleau JL, Braunwald E: Do the results of randomized clinical trials of cardiovascular drugs influence medical practice? N Engl J Med 1992;327:241247. 71. Patients who had had a previous MI were significantly more likely to take aspirin. 72. Race was a more significant predictor of aspirin therapy than age. 73. Older patients were significantly more likely to take aspirin. 74. Diabetic patients were significantly less likely to take aspirin. 75. The type of hospital was significantly associated with aspirin use.

You might also like