15b
Michael Chmielewski and Mayson Trujillo
Flory (this volume) presents an overview of categorical personality disorders (PDs) and briefly discusses some of their well-established problems. Specifically, categorical PDs have excessive diagnostic comorbidity, extreme heterogeneity, and arbitrary boundaries with normality. As Flory notes, these problems are not unique to personality disorders. In fact, categorical diagnoses for nearly all forms of psychopathology appear to be less reliable and valid than commonly believed (Chmielewski, Clark, Bagby, & Watson, 2015; Regier et al., 2013). In addition, the categorical PD model also suffers from poor convergent and discriminant validity, excessive use of the not otherwise specified diagnosis, and low diagnostic stability (for reviews see Clark, 2007; Hopwood et al., 2018; Widiger & Samuel, 2005; Widiger & Trull, 2007)
Flory then relates several commonly touted advantages of categorical PDs, including clinical utility, facilitating communication, reimbursement, public policy, and a common international language. However, such claims are not universally accepted. As Hopwood et al. (2018) argue, the clinical utility of categorical PDs, is at best, questionable, as most categorical PDs do not have validated interventions. It is also debatable whether categorical PDs have an advantage in terms of providing a universal language and facilitating communication. Clinicians already use dimensional models in their practice and the majority of clinicians and researchers support moving away from categorical models to dimensional models (Bernstein, Iscan, Maser, & Boards of the Directors of ARPD and ISSPD, 2007; Helzer, Wittchen, Krueger, & Kraemer, 2008; Nelson, Huprich, Shankar, Sohnleitner, & Paggeot, 2017; Widiger & Samuel, 2005).
This debate notwithstanding, the purported advantages of the categorical PD model hinge on diagnoses being reliable. In other words, a patient should receive the same diagnosis regardless of the hospital or clinic they attend. Likewise, researchers examining specific categorical PDs must be recruiting similar patients into their studies. To the extent that this does not occur, patient care is compromised; potential mechanisms or risk factors (e.g., neurobiological, genetic, trait, cognitive, and environmental) will be elusive; modeling the natural course of personality pathology or identifying what leads to changes in diagnostic status becomes nearly impossible; the effectiveness and efficacy of new treatments cannot be accurately determined; and research findings will be unlikely to replicate (Chmielewski et al., 2015; Chmielewski, Ruggero, Kotov, Liu, & Krueger, 2017; Chmielewski & Watson, 2009; Clarke et al., 2013; Nathan & Langenbucher, 1999). Finally, claims of a “common language” become nearly impossible to support when everyone is, unknowingly, interpreting the language differently. In sum, the claimed advantages require that different diagnosticians, conducting separate independent interviews, come to the same PD diagnosis.
Unfortunately, there is the tendency to give, at best, cursory attention to this critical issue in the literature. Although this is common, we believe it is detrimental to the continued advancement of science. Instead, we argue it is essential to evaluate the assessment of personality disorders with the same scientific rigor we evaluate interventions, search for underlying mechanisms, and test hypotheses.
Taking a More Rigorous Approach to Diagnostic Reliability
We fully agree with Flory that it is essential researchers report the level of diagnostic reliability in their specific studied sample (this is far too rare in the literature), as well as details about interviewer training, additional patient information made available, and how consensus was reached. However, these are only initial steps. Here we focus on two additional issues that are critical for a more serious approach to diagnostic reliability: (1) how should diagnostic reliability be assessed and (2) how reliable should diagnoses be?
Most kappa estimates for PDs come from the audio/video recording method in which one clinician conducts the interviews and provides diagnoses and a second clinician then independently provides diagnoses based on recordings of the interviews. This method inflates reliability estimates for several reasons (Chmielewski et al., 2015; Kraemer, Kupfer, Clarke, Narrow, & Regier, 2012). When interviewing clinicians determine a patient does not meet diagnostic criteria for a PD, they do not ask about any remaining symptoms (this is true for semi-structured interviews because most use “skip outs”). This forces the second clinician to agree that no diagnosis is present because they do not have the symptom information necessary to confer a diagnosis. It is also impossible for the second clinician to clarify interview items, probe patient responses, or ask more in-depth questions about specific symptoms. It is possible that each clinician would have obtained different information if separate interviews were conducted. Moreover, for a variety of reasons (i.e., transient states, comfort with interviewer, clinician skill), patients may volunteer different information to one clinician versus another if separate interviews were conducted. In other words, the audio/video recording method artificially constrains the information to be identical and does not allow for truly independent ratings. As such, it is a poor proxy for whether or not patients would receive the same diagnosis at different hospitals or clinics and whether researchers are studying similar patients (Chmielewski et al., 2015; Kraemer et al., 2012).
A more realistic and ecologically valid estimate of diagnostic reliability is provided by the test-retest method (Chmielewski et al., 2015; Kraemer et al., 2012). In this method, two independent interviewers, with no knowledge of the patients’ diagnostic status, independently conduct separate interviews over an interval short enough that true change in diagnostic status is highly unlikely. Conceptually, this is similar to dependability estimates for self-report measures (Chmielewski et al., 2017; Chmielewski & Watson, 2009; Gnambs, 2014; McCrae, Kurtz, Yamagata, & Terracciano, 2011). High levels of test-retest diagnostic reliability/dependability are essential when assessing personality and personality pathology. Indeed, these analyses are critical for establishing the validity of PD assessments and provide a compelling way to determine which models of personality pathology are more valid.
Basing Expectations on the Nature of the Construct and Benchmarks
There is a tendency in the literature to interpret nearly any kappa coefficient, regardless of its magnitude, as evidence of “adequate” reliability. At best, general guidelines or “rules of thumb” are referenced. As defined in the DSM-5, personality pathology must have “an enduring pattern,” be “pervasive and inflexible,” and be “stable over time” (American Psychiatric Association, 2013, p. 645). It is important to acknowledge that if categorical PDs “carved nature at the joint” and the measures/interviews assessing them did so without measurement error, then test-retest diagnostic reliability would be 1.0. We acknowledge that achieving this goal is not practical because of the ubiquitous nature of measurement error. However, we propose that evaluating estimates of diagnostic reliability in terms of how much they deviate from the expectations set by the DSM-5 definition provides a more meaningful context for evaluating assessments and prevents the practice of interpreting any reliability estimate as acceptable.
We propose that estimates of dependability for measures of “normal” personality traits serve as ideal minimal benchmarks for test-retest diagnostic reliability. This is because “normal” personality traits and categorical PDs are both defined by stable patterns of thoughts, feelings, and behaviors. A recent meta-analysis of dependability estimates for Big Five personality measures reported a median dependability of .82 (Gnambs, 2014). Interview measures of Big Five traits have demonstrated similar levels of dependability (Big Five Domains, ICC = .81 to .93; Trull et al., 1998). These benchmarks are not unrealistic or arbitrary; they represent what can be achieved given typical assessment and measurement tools. As such, we believe they represent minimal acceptable standards for the test-retest reliability/dependability of personality pathology measures and interviews.
In Flory’s review the majority of categorical PDs demonstrated test-retest diagnostic reliability that was substantially below 1.0. In fact, the grand mean kappa of all PDs was only .61; with mean kappas for specific diagnoses ranging from .40 (paranoid) to .75 (borderline). Moreover, these values failed to reach benchmark standards from the normal personality literature.
Advancing Clinical Science
Taking a more rigorous approach to diagnostic reliability leads to a fundamentally different conclusion than that reached by Flory. Indeed, the reliability of categorical PDs using these interviews appears to be far lower than the DSM-5 conceptualization would indicate. Moreover, they demonstrate substantial levels of measurement error when a more rigorous approach to reliability is taken. This represents critical problems for the categorical PD model. As expressed by Hopwood et al. (2018, p. 83), “We are concerned about the implications of retaining a categorical system that has been so thoroughly shown to be empirically and clinically problematic.”
As noted by Flory, diagnostic reliability is required for diagnostic validity. However, the role of test-retest diagnostic reliability/dependability is even more critical for the validity of personality pathology because stability is the core feature of PD. As Blashfield and Livesley (1991, p. 265) note, “short-term stability must be expected” and “Failure to demonstrate stability … raises questions about validity.” The fact that all the interviews reviewed by Flory demonstrated poor test-retest reliability/dependability suggests the fault is not with the interviews themselves. Rather, it is clear that the limits of what can be achieved by the categorical PD model have been reached.
Flory makes the important point that dimensionalizing the categorical PDs results in substantially higher levels of reliability. Indeed, even though the interviews were all developed to assess categorical PDs, dimensional scores created by summing symptoms from these measures achieved test-retest reliability/dependability that was substantially closer to 1.0 (grand mean r/ICC = .72; mean r/ICC range = .61 to .79) and approached benchmarks from the personality literature; this mirrors meta-analyses from the broader psychopathology literature demonstrating that dimensionalizing categorical DSM diagnoses results in a 15 percent increase in reliability and an 37 percent increase in criterion validity (Markon, Chmielewski, & Miller, 2011). Nevertheless, dimensionalizing categorical PDs fails to address the lack of discriminant validity across disorders and the extremely heterogeneous nature of the symptoms included within PDs. In addition, dimensionalizing the current PDs does not allow for research into which elements might be more or less stable over time, which, as Flory notes, is an important area for future research.
The DSM-5 also includes maladaptive personality traits within Section III of the manual, which offer several advantages over the categorical PD model or dimensional representations of it. These pathological traits eliminate comorbidity and heterogeneity (Krueger & Markon, 2014); capture the important variance in the categorical PD model (Hopwood, Thomas, Markon, Wright, & Krueger, 2012; Miller, Few, Lynam, & MacKillop, 2015); are strongly associated with dimensional models which have considerable research documenting their genetic underpinnings, cross-cultural validity, course, and correlates (De Fruyt et al., 2013; Quilty, Ayearst, Chmielewski, Pollock, & Bagby, 2013; Suzuki, Griffin, & Samuel, 2016; Widiger & Trull, 2007); have important links to functioning and other clinical constructs (Chmielewski et al., 2017; Hopwood et al., 2013); and have potential to guide treatment (Hopwood et al., 2018; Hopwood, Zimmermann, Pincus, & Krueger, 2015). Moreover, the DSM-5 pathological traits demonstrate test-retest dependability (mean domain dependability r = .83 to .88) that exceed the categorical PD model and are equal to the aforementioned benchmarks (Chmielewski et al., 2017; Suzuki et al., 2016).
In sum, assessments of categorical PDs fail to achieve levels of diagnostic reliability indicated by their operationalization in DSM-5 and fail to reach benchmarks demonstrated by assessments of personality traits. If the goal is to provide optimal patient care and to advance clinical science, then adopting a trait-based dimensional model of personality pathology is necessary.
References
American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental Disorders (5th ed.). Arlington, VA: American Psychiatric Publishing.
Bernstein, D. B., Iscan, C., Maser, J., & Boards of the Directors of ARPD and ISSPD (2007). Opinions of personality disorder experts regarding the DSM-IV personality disorders classification system. Journal of Personality Disorders, 21(5), 536–551.
Blashfield, R. K., & Livesley, W. J. (1991). Metaphorical analysis of psychiatric classification as a psychological test. Journal of Abnormal Psychology, 100(3), 262–270.
Chmielewski, M., Clark, L. A., Bagby, R. M., & Watson, D. (2015). Method matters: Understanding diagnostic reliability in DSM-IV and DSM-5. Journal of Abnormal Psychology, 124(3), 764–769.
Chmielewski, M., Ruggero, C. J., Kotov, R., Liu, K., & Krueger, R. F. (2017). Comparing the dependability and associations with functioning of the DSM-5 Section III trait model of personality pathology and the DSM-5 Section II personality disorder model. Personality Disorders: Theory, Research, and Treatment, 8(3), 228–236.
Chmielewski, M., & Watson, D. (2009). What is being assessed and why it matters: The impact of transient error on trait research. Journal of Personality and Social Psychology, 97(1), 186–202.
Clark, L. A. (2007). Assessment and diagnosis of personality disorder: Perennial issues and an emerging reconceptualization. Annual Review of Psychology, 58, 227–257.
Clarke, D. E., Narrow, W. E., Regier, D. A., Kuramoto, S. J., Kupfer, D. J., Kuhl, E. A., … Kraemer, H. C. (2013). DSM-5 field trials in the United States and Canada, Part I: Study design, sampling strategy, implementation, and analytic approaches. American Journal of Psychiatry, 170(1), 43–58.
De Fruyt, F., De Clercq, B., Bolle, M. D., Wille, B., Markon, K., & Krueger, R. F. (2013). General and maladaptive traits in a Five-Factor Framework for DSM-5 in a university student sample. Assessment, 20(3), 295–307.
Gnambs, T. (2014). A meta-analysis of dependability coefficients (test–retest reliabilities) for measures of the Big Five. Journal of Research in Personality, 52, 20–28.
Helzer, J. E., Wittchen, H.-U., Krueger, R. F., & Kraemer, H. C. (2008). Dimensional options for DSM-V: The way forward. In J. E. Helzer, H. C. Kraemer, R. F. Krueger, H.-U. Wittchen, P. J. Sirovatka, & D. A. Regier (Eds.), Dimensional Approaches in Diagnostic Classification: Refining the Research Agenda for DSM-V (pp. 115–127). Arlington, VA: American Psychiatric Association.
Hopwood, C. J., Kotov, R., Krueger, R. F., Watson, D., Widiger, T. A., Althoff, R. R., … Blais, M. A. (2018). The time has come for dimensional personality disorder diagnosis. Personality and Mental Health, 12(1), 82–86.
Hopwood, C. J., Thomas, K. M., Markon, K. E., Wright, A. G. C., & Krueger, R. F. (2012). DSM-5 personality traits and DSM–IV personality disorders. Journal of Abnormal Psychology, 121(2), 424–432.
Hopwood, C. J., Wright, A. G., Krueger, R. F., Schade, N., Markon, K. E., & Morey, L. C. (2013). DSM-5 pathological personality traits and the personality assessment inventory. Assessment, 20(3), 269–285.
Hopwood, C. J., Zimmermann, J., Pincus, A. L., & Krueger, R. F. (2015). Connecting personality structure and dynamics: Towards a more evidence-based and clinically useful diagnostic scheme. Journal of Personality Disorders, 29(4), 431–448.
Kraemer, H. C., Kupfer, D. J., Clarke, D. E., Narrow, W. E., & Regier, D. A. (2012). DSM-5: How reliable is reliable enough? American Journal of Psychiatry, 169(1), 13–15.
Krueger, R. F., & Markon, K. E. (2014). The role of the DSM-5 personality trait model in moving toward a quantitative and empirically based approach to classifying personality and psychopathology. Annual Review of Clinical Psychology, 10, 477–501.
Markon, K. E., Chmielewski, M., & Miller, C. J. (2011). The reliability and validity of discrete and continuous measures of psychopathology: A quantitative review. Psychological Bulletin, 137(5), 856–879.
McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. (2011). Internal consistency, retest reliability, and their implications for personality scale validity. Personality and Social Psychology Review, 15(1), 28–50.
Miller, J. D., Few, L. R., Lynam, D. R., & MacKillop, J. (2015). Pathological personality traits can capture DSM-IV personality disorder types. Personality Disorders: Theory, Research, and Treatment, 6(1), 32–40.
Nathan, P. E., & Langenbucher, J. W. (1999). Psychopathology: Description and classification. Annual Review of Psychology, 50, 79–107.
Nelson, S. M., Huprich, S. K., Shankar, S., Sohnleitner, A., & Paggeot, A. V. (2017). A quantitative and qualitative evaluation of trainee opinions of four methods of personality disorder diagnosis. Personality Disorders: Theory, Research, and Treatment, 8(3), 217–227.
Quilty, L. C., Ayearst, L., Chmielewski, M., Pollock, B. G., & Bagby, R. M. (2013). The psychometric properties of the Personality Inventory for DSM-5 in an APA DSM-5 field trial sample. Assessment, 20(3), 362–369.
Regier, D. A., Narrow, W. E., Clarke, D. E., Kraemer, H. C., Kuramoto, S. J., Kuhl, E. A., & Kupfer, D. J. (2013). DSM-5 field trials in the United States and Canada, Part II: Test-retest reliability of selected categorical diagnoses. American Journal of Psychiatry, 170(1), 59–70.
Suzuki, T., Griffin, S. A., & Samuel, D. B. (2016). Capturing the DSM-5 Alternative Personality Disorder Model Traits in the Five-Factor Model’s nomological net. Journal of Personality, 85(2), 220–231.
Trull, T. J., Widiger, T. A., Useda, J. D., Holcomb, J., Doan, B. T., Axelrod, S. R., … & Gershuny, B. S. (1998). A structured interview for the assessment of the Five-Factor Model of Personality. Psychological Assessment, 10(3), 229–240.
Widiger, T. A., & Samuel, D. B. (2005). Diagnostic categories or dimensions? A question for the Diagnostic and Statistical Manual of Mental Disorders–Fifth Edition. Journal of Abnormal Psychology, 114(4), 494–504.
Widiger, T. A., & Trull, T. J. (2007). Plate tectonics in the classification of personality disorder: Shifting to a dimensional model. American Psychologist, 62(2), 71–83.