2
In an ideal health care system, doctors would operate on patients only when there is a strong medical basis for doing so. Researchers would use rigorous methods to evaluate whether specific operations are safe and effective for specific conditions. If there is compelling evidence that a surgical procedure might not work as expected, physicians and medical researchers would investigate. Once the truth about the benefits of the operation is discovered, the information would be disseminated. Clinical guidelines would be revised, and physician behavior would rapidly change. In sum, treatments would be based on the best available scientific evidence, and progress would be continuous, uniform across regions, and swift.
Unfortunately, this is not how the American medical system always works. A large body of research leaves little doubt that sizable gaps exist between these aspirations and real-world performance. As chapter 1 discussed, new tests and procedures are often widely adopted before they are rigorously evaluated, and practice norms often do not change quickly and consistently in response to credible evidence. Doctors in some regions of the country may continue to perform discredited procedures long after providers elsewhere have abandoned them. Table 2.1 displays recent examples of the uneven translation of comparative effectiveness research into outcomes from Timbie and colleagues.1 Many reasons have been suggested for the failure of evidence to alter clinical practice, including financial incentives, ambiguity of study results, local practice norms, and cognitive biases in the interpretation of new information.2 These broad explanations are convincing, but they do not address the role of key organizational actors—including physicians, medical societies, and policy makers—in the uptake of clinical evidence. To gain a feel for how things work (or fail to work) on the ground, it is useful to take a detailed look at the production, use, and nonuse of evidence in a particular case.
TABLE 2.1. Recent Comparative Effectiveness Studies and Their Translation Outcomes
Source: Copyrighted and published by Project HOPE/ Health Affairs as exhibit 1 in Justin W. Timbie, D. Steven Fox, Kristin Van Busum, and Eric C. Schneider. 2012. “Five Reasons That Many Comparative Effectiveness Studies Fail to Change Patient Care and Clinical Practice.” Health Affairs (Millwood) 31 (10): 2168–75. The published article is archived and available online at www.healthaffairs.org.
Here, we examine the use of arthroscopic surgery to treat osteoarthritis (OA) of the knee. By the mid-1980s, arthroscopies became a preferred method of treatment for knee OA. It was believed that debridement and lavage resulted in less pain and postoperative swelling than other surgical procedures and helped patients avoid the need for a total knee replacement.3 The use of knee arthroscopy was encouraged by the growing use of MRIs, which allowed doctors to view structural abnormalities inside the joint. More than 650,000 debridement and lavage procedures were performed annually in the early 2000s, at which time the procedures cost roughly $5,000.4 About half of all knee OA patients typically said they felt better after their arthroscopies, but some experts doubted that the operations were beneficial. In the mid-1990s, a medical research team led by J. Bruce Moseley, a board-certified orthopedic surgeon at Baylor College of Medicine and the team physician of the Houston Rockets and of the 1996 U.S. Olympics Basketball “Dream Team,” and Nelda P. Wray, a physician and health services researcher at the Veterans Administration in Houston, decided to test whether the surgery was effective. The Moseley-Wray team conducted a rare randomized placebo-controlled clinical trial of a surgical procedure, in which some patients underwent fake operations designed to mimic the real interventions. Patients assigned to the placebo arm received only incisions while they were asleep. Tests of knee function showed that patients who received the real operation did not outperform those who received the placebo on physical tasks and reported no more pain relief than those who received only the placebo incisions.5 In short, a surgical procedure performed on millions of patients to ease the pain of arthritic knees worked no better than a fake operation.
On the surface, the widespread use of an operation that works no better than a fake operation is easy to “explain.” A patient has knee pain and doesn’t know the best treatment owing to the uncertainties and information asymmetries of medical care,6 and his or her doctor says, “You will benefit from surgery.” The patient receives “care” when he or she undergoes the surgery. That intervention produces an effect, albeit a placebo effect, but the typical patient would not know this. Moreover, if the knee surgery is the standard in the profession, even shopping around for different treatment options, which is rare, wouldn’t help. There is little in the interaction between doctor and patient that would lead the patient to conclude that a safer and less disruptive treatment might have been available.
But this account, accurate as it may be, evades the really important questions. How did this situation arise? Why were doctors performing a procedure without scientifically demonstrated benefits? What was the state of the medical evidence base before the publication of the Moseley study? How did the medical and public policy communities react when the study came out? How much and how quickly did clinical practices change afterward? These are among the questions this chapter explores.
The U.S. medical system does respond to evidence. When credible research is published in a top-flight medical journal suggesting that a common treatment does not work as believed, as occurred here, the results are widely discussed. However, the uptake of evidence by clinicians is often slow, inconsistent, and variable across communities. The generation, use, and translation of evidence are technical tasks, but they are carried out through decision-making and resource allocation processes that are shaped and constrained by the economic, political, and organizational interests of health sector actors, including doctors. There are no mechanisms in programs like Medicare to ensure that important gaps in the medical evidence base are expeditiously identified and systematically filled to protect the interests of patients and taxpayers. When the effectiveness of a treatment is challenged by a gold-standard study, medical societies do not always behave in ways that put patient interests unambiguously first. Rather than using the emergence of credible evidence that an expensive, invasive procedure works no better than safer alternatives as a welcome opportunity to take a hard look at the conceptual and empirical support for existing treatment protocols, medical societies may give the narrowest possible constructions to research findings. They do so for many reasons, including a desire to maintain professional autonomy and minimize government and insurance industry interference with existing clinical practices. The evidence translation process—turning research findings into better clinical decision making and improved outcomes for patients—sometimes moves at a glacial pace and does not establish an expectation that other therapies (even related interventions in the same practice area) will satisfy higher evidentiary thresholds before diffusing into wide use. Indeed, as we will see later in this chapter, after debridement and lavage were discredited and coverage changes occurred, the number of these operations did begin to decline. However, the number of knee surgeries for tears of the meniscus cartilage of the knee increased dramatically. Many were probably not at all surprised when yet another randomized controlled trial found that the latter procedure also worked no better than sham surgery.7 Arthroscopic surgery on the meniscus is still the most common orthopedic procedure in the United States, performed about 700,000 times a year at an estimated cost of $4 billion.8 Experts agree that the procedure is appropriate under some circumstances, especially for younger patients and for tears from sports injuries. But the vast majority—as many as 80 percent—of tears occur in older patients owing to wear and aging. Some experts believe surgery in those cases is often not appropriate.9
The knee case is illustrative of systemic problems in the promotion of evidence-based medicine. While the involvement of more than 50 medical societies in the “Choosing Wisely” campaign to eliminate low-value, wasteful services is a positive step, the initiative thus far has had little impact. A study in JAMA Internal Medicine found that for seven treatment and testing services listed by the Choosing Wisely campaign as usually unnecessary, use of only two had declined. Use of the other five services either had not changed or had increased.10 Clearly, many medical societies are failing to take seriously their professional responsibility to deliver evidence-based care to patients and ensure that the nation does not squander scarce health care dollars on treatments of dubious worth.
The remainder of the chapter is organized as follows. First, we describe how the orthopedic community mostly ignored clear warnings that the arthroscopic surgery for OA of the knee might not be effective. Next, we review the Moseley-Wray study’s findings and describe the less than optimal responses they prompted from medical societies and policy makers. Finally, we briefly look at the debate surrounding another procedure for knee OA (arthroscopic partial meniscectomy) that has grown in popularity despite a lack of evidence, and identify signs of the same performance pathologies.
Arthroscopic Surgery for OA of the Knee
About 12 percent of those aged 65 and over experience frequent knee pain from OA.11 When anti-inflammatory medication and physical therapy fail to relieve symptoms, doctors may recommend two forms of arthroscopic knee surgery: lavage or debridement with lavage. Lavage (derived from the French “laver,” meaning “to wash”) is a procedure in which the knee joint is thoroughly washed out. Debridement (derived from a French word related to “debris”) “cleans up” the knee by cutting away loose tissue, trimming torn and degenerated meniscus (the shock-absorbing cartilage in the knee joint), and smoothing out the remaining meniscus. These procedures originated in the medical practices of the 1930s and 1940s but did not take off until the development of fiberoptics in the 1970s.12 By the mid-1980s, arthroscopy for OA of the knee had become extremely popular. Arthroscopies are typically performed under general anesthesia and require a recovery period during which the patient experiences pain and decreased mobility. Although serious complications from the procedure are rare, debridement may have a negative effect on the outcomes of a subsequent total knee replacement.13
THE WEAK EVIDENTIARY BASIS OF THE PROCEDURE
The spread of arthroscopic surgery for OA began as surgeons reported their experiences with arthroscopic surgery at scientific conferences and in journal articles. Colleagues became excited about the procedure and started performing it on their patients. These clinical experiences, rather than scientifically rigorous evaluations or a sound theoretical understanding of the mechanism behind alleged treatment effects, provided the basis for the widespread adoption of the procedure in the United States.
A review of the relevant medical literature before the 2002 Moseley study highlights the weak evidentiary basis of this surgery. Although there had been reports on operations similar to lavage and debridement several decades earlier, the first modern study of arthroscopy for OA of the knee was published in 1981. Norman F. Sprague III debrided the joints of patients for whom other therapies had failed.14 The outcome measures were subjective assessments of pain and functioning. About fourteen months after the surgery, patients were asked to compare their current level of pain and functioning to their recollection of what they had experienced before the operation.15 Approximately 75 percent of the patients said their pain and functioning had improved or stayed the same. This early study had two major design weaknesses that persisted in research for two decades. First, the research design did not permit the effect of the surgery to be distinguished from the natural history of the disease or from a placebo effect. Placebo effects are especially common when a subjective measure, such as pain, is the target of the treatment. Second, the assessment was not “blinded.” Those conducting the evaluations knew what procedures individual patients had received. Sprague’s study was actually based on patients’ self-reported experiences, though a similar problem would arise if third-party evaluators knew each patient’s treatment history, as unblinded assessment is an established source of bias.
At least twenty-five additional studies evaluating the effectiveness of lavage and debridement with lavage were published during the next twenty years.16 Like Sprague, most of these follow-up studies relied on retrospective evaluations of the surgery using either patient charts or follow-up interviews—methodologies vulnerable to producing biased results. In a typical example, Yang and Nisonson evaluated 105 postoperative knees by using a 12-point scale that included both subjective measures and measurements of range of motion. Sixty-five percent of patients scored 9 or higher, the cutoff the authors selected to determine which knees should be labeled “good” or “excellent.” The 65 percent “success rate” was taken as supportive of the use of arthroscopic surgery. However, the Yang and Nisonson study lacked a control group against which to measure the benefit from the intervention, and the physician who conducted the case evaluations knew whether individual patients had been operated on (i.e., the study was not blinded).17
The pre-Moseley literature did contain some randomized trials. A few well-designed randomized clinical trials were conducted to gauge the effectiveness of one of the two arthroscopic surgeries, lavage without debridement.18 In contrast to most of the case series evidence, these studies failed to produce convincing evidence that lavage was effective in reducing arthritis severity as measured by outcomes such as function or stiffness. With a much larger number of case histories reporting positive effects from the procedure, however, these negative findings were downplayed. No randomized trial prior to Moseley and colleagues evaluated debridement versus a placebo version of the operation.19
DOUBTS ABOUT THE EFFECTIVENESS OF THE PROCEDURE
Some physicians did question whether or how arthroscopy benefited patients with OA. If there is a strong theoretical argument for how an operation produces benefits, the need for empirical support is lessened. There was, however, no broadly accepted theoretical mechanism to explain how arthroscopic surgery helped patients. Several hypotheses were offered to explain the positive findings reported in the typically nonrandomized, unblinded clinical studies. Some researchers pointed out that the arthritic joint contains irritating debris and enzymes. It was suggested that flushing out these sources of irritation might reduce pain.20 Debridement was said to reduce pain by reducing mechanical problems of the knee such as “catching” or “popping” and improving the distribution of weight on joint surfaces.21 These explanations were recognized as speculative. Many studies reporting positive clinical findings were at a loss to explain them. In one review article generally supportive of the surgery, for example, Hanssen and colleagues acknowledged that “the mechanism of pain relief following arthroscopic treatment of osteoarthritis is obscure.”22
Questions about the use of arthroscopy for patients with knee arthritis were raised at medical conferences and in professional journals. In a symposium (“Uses and Abuses of Arthroscopy”) conducted at the 1992 annual meeting of the American Academy of Orthopaedic Surgeons (AAOS) and an article reprinted in one of the leading journals in the field, practitioners and researchers openly discussed both the state of knowledge regarding arthroscopic surgery and how the financial incentives of the health care system might distort the decision to recommend surgery. The symposium was noteworthy for its blunt language regarding a common orthopedic procedure. Dr. John Goodfellow, a former president of the British Orthopaedic Association, mocked the use of arthroscopic surgeries for OA and suggested they might be “pseudotreatments.” He commented that “no one has performed the double-blind controlled trial that would be necessary to distinguish between the placebo effects of any operation and the direct benefits of debridement.”23 Despite the obvious and recognized need for rigorous investigation to determine whether the procedure worked, there was no sense of urgency. The lack of hard evidence for the procedure’s effectiveness was apparently more a regrettable condition than a pressing problem for physicians and researchers to solve.
This troubling situation—in which the lack of hard evidence for the use of a procedure is a poorly kept secret among practitioners—is not unusual in American medical practice. Many popular surgical procedures rest on weak evidence, including certain procedures done to relieve lower back, neck, spine or neurological pain, tendonitis, and impingement in the shoulder.24 Questions have been raised about vertebroplasty, a procedure in which a form of bone cement is injected into the broken spinal bones of patients with OA. A 2009 randomized, double-blind trial found no beneficial effect compared to a sham procedure in which needles were inserted into the back without injecting cement.25 Two years after the study’s publication, the procedure was still being performed, and insurance coverage was unchanged.26 Another example of a questionable operation is “spinal fusion surgery,” in which a bone graft is used to fuse together two vertebrae in an attempt to relieve lower back pain from degenerated discs. The operation, which has been increasingly popular in the United States, can have serious side effects, including blindness and paralysis. As use of the surgery grew, evidence that spinal fusion surgery works well for its most common indications remained unclear, according to Deyo and colleagues.27
A WIDESPREAD PROBLEM
The problems we observe in the production of evidence in the case of arthroscopic surgery have long been known to experts. An article in Science magazine in 1978 noted that, although new drugs and devices must satisfy federal regulations that require rigorous testing in animals followed by carefully controlled testing, “new surgical operations may or may not be tested in animals, may be introduced as human therapy with or without review by a human experimentation committee and with or without a formal experimental design, and may or may not be evaluated by long-term follow-up observation.”28 A 1975 commentary in the Journal of the American Medical Association pointed out that a double standard exists for surgery. Drugs must undergo Food and Drug Administration (FDA) review—a requirement that is not costless but is almost certainly socially appropriate. Yet no agency has been created to protect the patient from harmful and ineffective operations. This same double standard exists among journal editors and reviewers, “who regularly reject inadequately controlled medical trials and regularly publish inadequately controlled surgical trials.”29
What accounts for this double standard and for the poor state of scientific knowledge about surgical procedures more generally? Some argue that ethical considerations limit the use of rigorous, placebo-controlled clinical trials to study surgical interventions. Although it is morally acceptable to use placebos when evaluating a new drug, it is sometimes thought unacceptable to do so when studying a new operation, even though this research approach has produced startling results on the occasions it has been used.30 Opponents of sham surgery trials argue that it is too difficult to design an operation as inert as the usual sugar pill used as a placebo in drug trials.
Proponents of placebo arms in studies of surgical procedures argue that such research designs are ethically sound.31 Ethical reservations about sham surgery, proponents claim, confuse the ethics of clinical research with the ethics of personal medical therapy. According to bioethicist Franklin G. Miller, “Clinical trials routinely administer interventions to patients that carry risks that are not compensated by medical benefits to them, but are justified by the anticipated value of scientific knowledge accruing from their use. For example, invasive procedures such as blood draws, lumbar punctures, and biopsies are often administered to measure trial outcomes.”32 If there is no consensus among the medical community on the benefits and use of a procedure, if the scientific need for blinding is compelling, if participants are informed about the risks to the placebo control group, and if the risks to subjects in the control group are minimized, then sham surgery trials may be ethically appropriate, as the American Medical Association (AMA) has acknowledged.33
The ethical argument against sham surgery studies can easily be turned on its head. As physician David H. Spodick wrote,
The omission of adequate standards for surgical therapies should be especially surprising, since even the most essential operation involves inevitable trauma—physical, metabolic, and psychic—not to mention the risks of anesthesia. Indeed, when evaluated under comparable conditions against the outcome of alternative treatments or of no treatment, a new operation resulting in net loss to the patients, or in the same degree of recovery that would have occurred spontaneously, might be fantasized as a well-intended “assault.”34
Following Spodick’s line of reasoning, if ethics were decisive, one might reasonably expect placebo-controlled arms to be more prevalent in the evaluation of surgery than in the evaluation of any other kind of medical intervention.
Pioneering Sham Surgery Study
The publication of the landmark study “A Controlled Trial of Arthroscopic Surgery for Osteoarthritis of the Knee,” in the July 11, 2002, issue of the New England Journal of Medicine, presented the findings of a study that subjected a questionable procedure to rigorous scientific analysis. The article reported the results of a double-blind clinical trial in which Houston Veterans Administration hospital patients were randomly assigned to receive lavage, debridement with lavage, or “sham surgery.” For patients in the debridement with lavage group, the joint was lavaged, “rough articular cartilage was shaved (chondroplasty was performed), loose debris was removed, all torn or degenerated meniscal fragments were trimmed, and the remaining meniscus was smoothed to a firm and stable rim.”35 One of the study’s lead authors, surgeon J. Bruce Moseley, had long been skeptical of debridement and lavage. “I just didn’t quite understand why people were reporting so much benefit when seemingly there wasn’t very much done,” he said.36 Nelda Wray, the coleader of the research team, thought the placebo effect might be responsible for patients feeling better after the operation. The two researchers agreed that a sham surgery clinical trial was the best way to test whether the procedure had any benefit beyond a possible placebo effect.37
Participants in the trial were told that they might receive only placebo surgery and had to give their informed consent. The researchers created a placebo arm that mimicked the sight and sound of the actual procedure and included both sedation and an incision. For ethical reasons, patients in the sham surgery group were not placed under general anesthesia (they instead received a short-acting tranquilizer that caused them to fall asleep). Special care was taken to ensure that patients understood that they might receive sham surgery; subjects were required to transcribe the informed consent form prior to signing it. Subjects receiving the real procedures and those receiving the fake operations were equally likely to guess they were in the placebo group.
Moseley performed all the procedures himself and did not know whether the patient would receive a real or fake operation until he opened a sealed envelope when patients were wheeled into surgery. The effects of the procedures were assessed through subjective measures of pain and function as well as through objective measures of function, such as the amount of time it took for patients to walk thirty meters and climb up and down a flight of stairs. Measurements were obtained at several points throughout a two-year follow-up period. Evaluators did not know whether patients were in the treatment or control groups. The results suggested that the surgeries were no better than the placebo operations. At no point during the two-year period did either of the actual surgery groups report a statistically significant improvement in pain or function versus the placebo group. The average outcome measures produced by the placebo group were statistically superior to the surgery groups at two weeks. At all other time periods, the outcomes were statistically equivalent.
Although many scientific breakthroughs receive little attention from broader publics, the study’s novel research design and provocative results made for arresting news copy. The study was praised in the lead editorial of the New England Journal of Medicine and produced extensive, if short-lived, media coverage.38 Front-page articles appeared in major national newspapers.39 The study’s finding that a major procedure worked no better than a fake operation was something of a bombshell, but the results were probably not shocking to many experts. As one leading evidence-based medical research organization stated on its web site, “The results of the [sham surgery study] should not be a surprise based on the evidence from the literature. There never was any good evidence that lavage or debridement were useful things to do.”40
STRENGTHS AND LIMITATIONS OF THE STUDY
The Moseley study was a significant advance on previous research. Whereas most prior investigations in orthopedics were retrospective case-series studies in which surgeons simply reported their experience with a procedure, the Moseley study was a double-blinded, placebo-controlled randomized trial—the “gold standard” in research medicine. Of course, even the best scientific studies have limitations. There were some features of the research design that might cast doubt on the validity or generalizability of the findings. An understanding of the scientific basis of the criticisms is therefore necessary to assess the reactions the study generated from interested parties.
One limitation of the study’s research design is that it could not determine whether the benefits of the sham surgery intervention were due to some active placebo effect (such as some psychological benefit from receiving a physician’s attention) or to the natural history of the patient’s disease. If the decision to have surgery follows a period of unusually severe symptoms, simple regression to the mean might produce improvement over time in the absence of any medical intervention. Although there are reasons to believe the placebo effect was responsible for the improvement in patients’ conditions, the issue cannot be resolved because of the study’s lack of a natural history arm in which patients received neither a real nor a sham procedure.41 However, whereas assigning the improvement in the placebo arm to its sources would be an important research finding, it has no implications for whether the arthroscopic procedures are superior to the sham surgery.42
A potentially more important criticism is the claim that the study used a flawed method for selecting patients for inclusion in the clinical trial. To gain entry into the trial, patients had to report knee pain. Critics argued that prior to the Moseley study the orthopedic community already recognized that patients who present with pain only are unlikely to benefit from the surgery, but that there are subgroups of patients with knee arthritis for whom the procedure is efficacious, including patients with early-stage arthritis and those with mechanical symptoms, such as joint “locking,” “giving way,” “popping,” or “clicking.”43 The implication is the sham surgery study selected the wrong patients as subjects and that its findings should not fundamentally reorient orthopedic practice.
This argument is unconvincing for several reasons. First, it appears to be based on a crude misreading of the study design. The study did include patients on the basis of a pain measure but excluded patients only if they had a severe deformity or a meniscal tear that were observed preoperatively.44 Almost all the patients included in the trial had both pain and mechanical symptoms. Mechanical problems of the knee are ubiquitous among patients with OA. In older patients with joint pain, it is nearly always possible to find some kind of mechanical symptom.45 In follow-up correspondence appearing in the New England Journal of Medicine, the authors reported that 96 percent (172 out of 180) of the patients in the study had at least one mechanical problem.46 The study results therefore suggest that neither lavage nor debridement with lavage is an effective treatment beyond placebo for patients who present with pain and one or more mechanical problems.
In addition, the sham surgery performed as well as the actual operations on the entire sample. If there was in fact a subgroup of patients for whom the operation was effective, it might be expected that the average improvement for the treatment groups, which consist of a mix of “appropriate” and “inappropriate” patients, would be attenuated but not entirely absent; there was, however, no consistent pattern of relative improvement by the treatments versus the placebo group. Third, although there were assertions about the criteria used for patient selection for surgery prior to the Moseley study, no empirical evidence was presented regarding actual patient selection practices. One study that examined the ability of doctors to anticipate which patients were likely to benefit from arthroscopy found they performed only slightly better than chance.47 Finally, it should be noted that even if the Moseley study can be cast aside completely, the state of the evidence merely reverts to the pre-Moseley state; and, prior to Moseley, there is no methodologically convincing evidence that arthroscopic surgery works for the subgroups identified by the Moseley critics.48
To summarize: Moseley found no evidence that arthroscopic surgery relieves pain or improves function any better than a placebo operation for patients with knee arthritis. If there were any large beneficial effects, the study very likely would have found them. “Despite their current popularity, lavage and debridement are probably not efficacious as treatments for most persons with osteoarthritis of the knee,” editorialized the New England Journal of Medicine.49 A leading expert stated that he believes the study’s findings mean that 80 to 90 percent of the arthroscopies that have been performed on patients with arthritic knees should not have been done.50
Medical and Policy Community’s Reactions to Moseley’s Findings
At first glance, the knee surgery case might seem to illustrate the U.S. research and policy-making enterprise working at its best. Moseley, Wray, and their colleagues performed a first-class study that used powerful scientific methods to address a substantively important question. The study was conducted at a Veterans Administration hospital and paid for by federal research grants. It was published in the New England Journal of Medicine, and its results were disseminated to the public by media outlets. Contrary to the argument that large, bureaucratic organizations are hidebound and slow to react, key federal agencies, including the Centers for Medicare and Medicaid Services, altered their coverage policies in direct response to the study’s findings.
But these first impressions are misleading. If we probe more deeply into the case, the system’s performance appears troubling. After the publication of the Moseley study, the orthopedic medical societies pressured the government to adopt a very narrow interpretation of the study’s findings, which preserves surgeons’ clinical autonomy and minimizes the need to revise prior medical practices. The coverage decisions of federal health agencies were largely in line with the associations’ position. That the sham surgery trial was conducted in the first place depended on individual initiative. Although doubts about the efficacy of arthroscopic surgery for knee arthritis had been raised at least since the early 1990s, the value of the procedure might never have been rigorously tested if the Moseley-Wray team had not fortuitously come along to conduct a critical sham surgery test. There were no institutions in place to make detection and investigation of questionable procedures a routine matter.
SURGEONS AND PROFESSIONAL SOCIETIES
We suspect that many individual orthopedic surgeons found it hard to believe Moseley’s stunning finding that the surgery had no advantage over the placebo. What orthopedic surgeons knew—saw with their own eyes—is that their patients clearly improved after the operation. “I’ve done thousands of these in people with osteoarthritic knees, and they really are better,” said surgeon Robert W. Jackson, a fierce defender of the procedures.51 These surgeons may have failed to recognize that the study did not claim the procedure had no impact, only that the observed benefits are due to the placebo effect or natural history of the disease rather than the surgeon’s skill.
The major initial reaction to the study from professional associations was not confusion but opposition. The professional associations defended the practices of their members and argued that Moseley’s findings should not discredit use of the procedure for patients with OA. Professional groups like the American Association of Orthopaedic Surgeons (AAOS) argued that Moseley had failed to examine the benefits of the procedure for various subgroups, such as those with mechanical symptoms and normal alignment, and asserted that responsible surgeons already practiced proper patient selection.
At one level, the resistance to Moseley and colleagues’ findings from the specialty associations reflects a difference in professional norms and orientations. As study coauthor Nelda Wray told us in an interview, “I speak the language of science, and the orthopedists do not.”52 In addition, the study was a direct economic threat to the specialists, observed David T. Felson, a leading physician who coauthored a New England Journal of Medicine editorial about the sham surgery study.53
Some surgeons who questioned the use of arthroscopy for patients with knee arthritis insisted it was vital to preserve insurance coverage and maintain professional autonomy. “I’m both a patient and a physician,” said AAOS chief executive Dr. William J. Tipton Jr., explaining to a New York Times reporter that he has osteoarthritis. “My knee is buckling now, but I’m not going to have arthroscopy done. I recognize that it’s not going to help.” But Tipton said he would hate to see insurers refuse to pay on the basis of the Moseley study. If that occurs, he said, surgeons will complain. “This is where eyebrows are going to be raised,” he said. “There’s going to be a certain group of physicians who are very upset. This is another example of managed care at its lowest, with payers calling the shots. I think it’s not good medicine.”54
FEDERAL AGENCIES AND PRIVATE INSURERS
The federal government pays for a significant share of health care provision through Medicare and other programs and influences coverage decisions in the private sector. When a state-of-the-art medical study finds that an expensive medical procedure works no better than a fake operation for most people with a common medical condition, if there is not convincing counterevidence, this should be reflected in health policy decisions. Yet federal health officials have traditionally treaded uneasily on physicians’ professional autonomy, especially with respect to clinical judgments about what services patients require. The founding premise of the Medicare program is that the clinical autonomy of participating physicians would be protected. Over time, the federal government’s efforts to control health care costs, along with the growth of managed care in the private sector, has resulted in some degree of erosion of physicians’ autonomy, but the presumption remains that doctors can best judge what treatments patients need.55
Federal bureaucrats are often faulted for being slow to act in the face of new information, but federal health agencies began reviewing their coverage policies immediately after the publication of the Moseley study. The final decisions of these agencies, however, followed a questionable, narrow reading of the study’s findings preferred by the professional associations. Although the Veterans Administration initially recommended that arthroscopies for knee arthritis not be performed absent “clear clinical evidence of significant derangement,” it subsequently announced that the Moseley study would not change the standard of practice at VA hospitals after all. Both an internal review by VA officials and an expert panel on orthopedic surgery concluded that the findings of the sham surgery study were not sufficient to limit or prohibit knee arthroscopy within the VA. The main reason given for the decision not to limit coverage was that outside experts asserted that surgeons rarely perform arthroscopy solely for pain associated with OA.56 The federal Centers for Medicare and Medicaid Services (CMS) also made policy decisions that were largely in line with the position of the orthopedic associations.
Following the publication of the Moseley study, which CMS analysts believed was so important that it simply could not be ignored, the agency began a careful review of the scientific evidence to determine whether arthroscopic surgery for patients with arthritic knees should be nationally covered under Medicare.57 The agency met with Dr. Wray, spoke with Dr. Moseley on the phone, and then held two separate meetings (in November 2002 and January 2003) with representatives of key professional associations.
The AAOS, the Arthroscopy Association of North America (AANA), the American Association of Hip and Knee Surgeons, and affiliated groups prepared a joint report to “provide CMS with clinical and scientific information” about arthroscopic procedures for patients with arthritic knees. The major conclusion of the report was that many patients with OA of the knee, especially those with early degenerative arthritis and mechanical symptoms, can be significantly helped with arthroscopic surgery.58 All the research studies cited in support of this conclusion, however, suffered from the basic methodological problems characterizing the research literature prior to the Moseley study. These problems were not acknowledged. The report also did not address Moseley and Wray’s response that 172 of 180 subjects in their study had mechanical symptoms nor did it indicate that the orthopedic groups should (and would) seek to generate hard evidence to support their claims about the benefits of the procedure for population subgroups by sponsoring replications of the study.
In July 2003, the CMS concluded that coverage should be changed in response to the Moseley study and the subsequent review of the evidence.59 The agency concluded that there was no evidence to support lavage alone for OA patients and that the procedure would henceforth be nationally noncovered. With respect to debridement, CMS determined that the procedure would be nationally noncovered when patients presented with knee pain only or with severe OA. CMS decided, however, to maintain local Medicare contractors’ discretion to cover the surgery if physicians requested it for patients with pain and other indications (for example, mechanical symptoms). This policy was subsequently echoed in the coverage decisions of major private insurers.60
CMS was unable to present solid evidence in support of its decision to maintain coverage of debridement for the vast majority of patients with arthritic knees (mechanical symptoms being ubiquitous in the OA population). Indeed, the agency deemed the available medical evidence on the issue to be “inconclusive because of methodological deficiencies.” Including the Moseley and Wray investigation, there were only four studies that addressed debridement in patients with mechanical symptoms as the indication for surgery, but three of them were case series without random assignment to control groups and using nonvalidated assessment scales. The CMS acknowledged that the level of evidence supporting the usefulness of the procedure was “suboptimal” and that case series studies in general are considered methodologically weak in their ability to minimize bias. Nonetheless, CMS continued to pay for debridement. It argued that the three case series “consistently” pointed to improvements in outcomes. The CMS coverage analysis acknowledged the unusually high quality of the Moseley study, calling it the only “large-scale, well-designed” randomized clinical trial in the pool of evidence, but argued that the study failed to “specifically address the issue of reduction of mechanical symptoms.” The fact that virtually all the patients in the Moseley study had one or more mechanical symptoms, as the authors reported in follow-up letters to the New England Journal of Medicine and other professional journals, went unmentioned. Although CMS’s official policy is to take into account the scientific quality of medical evidence, in practice its coverage decision process weighs most heavily the findings of studies published in peer-reviewed journals, even if the methodological quality of a study is problematic.61 CMS’s coverage review process is thus based on a high degree of trust in the medical profession’s ability to determine appropriate medical practices and to self-regulate.
The AAOS welcomed the CMS’s coverage decision: “The coverage decision parallels the position of the musculoskeletal societies. CMS recognized that arthroscopy is appropriate in virtually all circumstances in which the orthopaedic community now employs this technology,” an association newsletter stated.62 In fact, CMS did not perform an empirical investigation of actual surgical practices regarding patient selection for OA of the knee and consequently could not provide assurances regarding current surgical practices. The societies recognized that the CMS decision left key coverage decisions in the hands of local Medicare contractors. The implementation process would therefore be critical. The AAOS promised its members that the organization would provide carriers with “specific and detailed instructions to implement the coverage decision appropriately. The instructions should clearly reflect the limited applicability of the policy decision.”63 Moseley moved on to other endeavors, and no interest group exists to balance the messages that Medicare contractors would hear about the indications for arthroscopy from surgeons.64
CMS was in a difficult political position because a decision to deny coverage for a procedure will create cases where CMS judgments substitute for those of a patient’s doctor. The fact that CMS initiated the coverage review of arthroscopy in the first place reflects well on the agency and its staff. To go further, and deny coverage of both lavage and debridement, would likely have caused the agency to come under withering attack from orthopedists and other medical specialties. The critical policy failure is not that CMS continued to pay for a procedure for a (large) population subgroup in the absence of solid data in one specific case, but that as a result of existing law and decades of precedent the agency routinely makes coverage and reimbursement decisions without taking into account the comparative effectiveness of treatments.65 The creation of a new CER institute under the ACA is unlikely to solve this problem. More fundamental shifts in Medicare payment approaches will be required.66
MEDICAL RESEARCH: ABSENCE OF REPLICATION STUDIES
It is interesting to take note of what did not happen after the publication of the landmark New England Journal of Medicine sham surgery study. This development did not lead the medical research community to immediately carry out new clinical trials to repeat this research design to check the robustness of the findings, even though many prominent orthopedic specialists called for just this to occur. The medical societies did not demand an immediate study to address the “supposed” methodological issues they maintained were the source of their skepticism of the Moseley-Wray study. Replication studies would have allowed any lingering concerns about the stunning finding that arthroscopy works no better than a placebo to be addressed in a scientifically valid way. At the same time, a carefully designed replication study would have allowed researchers to investigate whether the procedure works for a specific subgroup, such as those with mechanical symptoms or a meniscal tear.
Since the Moseley study, only one other randomized controlled trial focusing on the efficacy of arthroscopic debridement and lavage on pain and function has been published. Kirkley and colleagues at the University of Western Ontario, Canada, randomized patients with moderately advanced OA either to debridement plus a standard physical therapy (PT) regimen or to the PT regimen alone. While the surgical group had an initial improvement in pain or functional status compared to the PT group at the three-month follow-up visit, there were no clinically meaningful or statistically significant differences in improvement between the two groups at any subsequent visits. Thus, the Kirkley study failed to find evidence for the clinical benefits of debridement over and above standard PT. An important methodological limitation of the Kirkley RCT, however, is that it did not include a placebo arm.67
IMPACT ON PRACTICE
If the U.S. health care system were working well, the stunning evidence from the Moseley study would have triggered a dramatic decrease in the number of arthroscopies performed for patients with knee OA, especially because there was no good evidence that the procedures worked in the first place. There are signs that the use of debridement has indeed begun to decline, but the decrease in utilization has been gradual. A large gap remains between what research indicates and what doctors do.68 Certainly the belief of experts like Felson that 80–90 percent of these arthroscopic procedures should no longer be done has not been matched.69 Not only has the decline in utilization been slower than would be predicted if medical practices were fully responsive to changes in the evidence base, but the timing of the decline suggests that the revision in Medicare coverage policy was a key driver, not just the emergence of better evidence. Further, it is difficult to determine how much of the observed trend of declining utilization reflects a real change in physicians’ behavior that resulted in fewer arthroscopies actually being performed on patients with knee OA. As discussed below, other possible causes of the observed trend are changes in how doctors diagnosed patients and coded procedures for purposes of insurance reimbursement and payment.
There is no national system to track the medical procedures performed on patients with knee OA. There have been several attempts, however, to measure the changes in the number of debridement and lavage procedures for knee OA patients using available databases. For example, Sunny Kim and colleagues used the U.S. National Ambulatory Surgical Database and found that the number of knee arthroscopies performed for OA patients declined by 18 percent from 1996 to 2006.70 Howard and colleagues examined ambulatory surgery data from Florida “and found that the number of [these procedures] per 100,000 adults declined 47 percent between 2001 and 2010.”71 Utilization declined after the 2002 Moseley study and then again after the 2004 CMS coverage change and the 2008 Kirkley study. Despite the observed decline in utilization, some experts questioned how much physician behavior really changed. In 2008, Moseley told the New York Times, “What happened after our study was that organized orthopedics rallied the troops to try and discredit our study as much as possible. People continued to practice the way they practiced.”72
There are several reasons why trends of declining usage may not accurately measure the actual impact of evidence on physician behavior. First, some doctors could have kept performing the procedures as before, but altered what they said was patients’ main problem, giving diagnoses other than OA. After the Moseley study, AAOS clinical guidelines recommended against debridement and lavage for patients with a primary diagnosis of symptomatic OA of the knee. This guideline did not apply, however, to “patients who had a primary diagnosis of meniscal tear, loose body, or other mechanical derangement, with concomitant diagnosis of osteoarthritis of the knee (emphasis added).”73 Meniscal tears are highly prevalent among people with knee arthritis, including among patients not experiencing pain. (Tears are also common among the general population.) Having a meniscal tear and knee pain does not mean that the tear is the cause of the pain; knee pain arises for many reasons.74 The increasing use of MRIs has therefore produced many incidental findings of meniscal tears that are not the cause of symptoms.75
Second, some doctors that had previously debrided OA patients could have begun performing a closely related procedure, arthroscopic partial meniscectomy (APM). An indication for this procedure is symptomatic meniscal tear.76 As noted above, meniscal tears are ubiquitous among the OA population. The key question is whether “surgeons simply used different codes (such as meniscal tear) when performing arthroscopy for OA, or actually performed fewer of these procedures in the OA population.”77 This question is difficult to answer. Over time, the number of procedures done for OA has declined, and the number done for meniscal tears has increased. The number of procedures done for OA is small in relation to the number of arthroscopies done for other indications, making it difficult to discern whether or not the former’s declining utilization has contributed to the growing utilization of other procedures. There is a clear possibility that orthopedic surgeons have been “unmoved by the pivotal trials … and are still scoping the same patients and their knees, yet possibly coding the procedure differently,” wrote Järvinen and colleagues.78
Déjà Vu All Over Again? The Case of Arthroscopic Partial Meniscectomy
When a strong body of research demonstrates that a common procedure does not work as advertised, the finding should cause the medical community to take a hard look at the effectiveness of other widely used treatments in the relevant practice area that rest on the same fundamental evidence base to see if use of these other treatments should also be reevaluated. The cumulative evidence about the lack of demonstrated effectiveness of the procedures examined in the Moseley-Wray study should have raised serious doubts that arthroscopic surgery would be useful for patients who have symptomatic meniscal tears in the setting of OA. Although the focus of the Moseley and Kirkley studies was on OA, there were participants in these studies who had mechanical problems and/or meniscal tears, and if the procedures were very beneficial for these subgroups, the data would likely have revealed signs of it. Still, it is understandable (indeed desirable) that surgeons would want to see evidence from studies specifically focused on the question of whether APM helps patients who have meniscal tears along with OA. As in the lavage and debridement case, however, the research designs of the early studies were poor. Then, when well-done studies, including Sihvonen and colleagues’ sham-controlled randomized trial,79 demonstrated that APM does not work as advertised, the orthopedic community again responded in ways that raise questions about its responsiveness to evidence.
By 2006, around 700,000 APM procedures were being performed annually in the United States.80 These interventions rested on a weak evidence base. The first studies demonstrating the benefits of APM were nonrandomized studies with generally small samples.81 The case for performing these procedures weakened further with the publication of two important randomized controlled trials in the New England Journal of Medicine in 2013. The first demonstrated that APM combined with physical therapy provides no better relief of knee symptoms than physical therapy alone in patients with a meniscal tear and OA.82 The second—which echoed the power and elegance of the Moseley study of debridement and lavage—showed in a double-blind, sham-controlled randomized trial that APM works no better than a fake operation for people with a degenerative meniscal tear and no knee OA.83 The Sihvonen study deliberately recruited knee patients without OA because these are the patients who were “most likely to have a good response,” and if APM cannot work under the best circumstances (e.g., on those with a meniscal tear but no OA), it follows that the procedure won’t be effective in “less optimal routine settings.”84
Despite the lack of evidence for the clinical efficacy of APM relative to nonsurgical alternatives, the orthopedic community continued to resist the obvious conclusion from well-designed research studies; namely, that APM does not have demonstrated clinical benefits compared to cheaper and less invasive alternatives. Noting that the 700,000 APM procedures performed in the United States generate $4 billion in direct medical spending, one of the authors of the sham surgery study predicted that the research findings “will not be welcomed with open arms.”85 At a packed session at the annual meeting of the AAOS, the conclusions and methodology of the study “were heavily criticized by notable orthopedic leaders.”86 Criticisms from the orthopedic community were also raised in letters to the New England Journal of Medicine. Many of the concerns raised against the APM sham surgery study were similar to the kinds of criticisms directed at the earlier Moseley study, including the alleged exclusion of patients with mechanical problems. In fact, 47 percent of the participants in the Sihvonen study had preoperative mechanical symptoms.87
The most pointed criticism of the APM study, however, was not patient selection. Rather, it was that the sham used in the study was not a “true” sham but in fact lavage, and lavage is (according to a letter to the New England Journal of Medicine written by three physicians critical of the study) “an accepted surgical procedure.”88 A University of Maryland surgeon quoted in a newspaper article said: “If you scope the knee (without touching the cushion), that will often help even if you don’t completely address the torn-meniscus issue,” he said. When fluid is injected, “you’re taking out the junky, thick, irritating fluid that can give a lot of people their pain.”89
We see where we are. According to leading orthopedic surgeons, the negative results of the most rigorous study of the benefits of APM should not be believed. And the key reason the study should not be believed is because the patients in the placebo arm received an injection of fluid—an intervention found equivalent to a placebo intervention in the 2002 Moseley-Wray study. In sum, several doctors claimed that APM should not be discredited because the study providing evidence of its lack of effectiveness was built on the findings of an earlier study that showed that lavage works no better than fake irrigation.
This episode shows the broader value of examining reactions to the Moseley-Wray study since the case makes subsequent developments very recognizable. Patterns repeat with surprisingly little variation over time. The problems we have documented regarding the medical system’s uptake of evidence are deep-seated and widespread, rather than the result of minor performance failures in one part of an otherwise well-functioning system.