15

Categorical Assessment of Personality Disorders: Considerations of Reliability and Validity

Janine D. Flory

DSM-5 was published in 2013 and although there was widespread consensus on the shortcomings of the existing paradigm for diagnosing personality disorders (PDs) (First et al., 2002) the revised manual included (a) no changes to the text or structure of the existing ten personality disorders categories, plus PD not otherwise specified and (b) a proposed alternative dimensional model designated in a separate section of DSM-5 entitled “Emerging Measures and Models” (Section III). This was a disappointing outcome to many who see the value in a dimensional classification approach and the pathway to further future revision has been documented in numerous publications (Krueger, 2013; Skodol, Morey, Bender, & Oldham, 2013; Widiger, 2013).

There have been multiple efforts to understand why even a compromise hybrid classification was not adopted. The view that a major shift would be disruptive to clinical practice and research (Gunderson, 2013) appeared to have substantial influence. In a parallel effort, the committee that is revising the International Classification of Diseases-11 Mental or Behavioral Disorders section on PDs has proposed a dimensional approach (Tyrer, Crawford, & Mulder, 2011). Here, too, there is debate about a proposed dimensional approach to supplant or augment the existing categorical diagnoses (Herpertz et al., 2017; Hopwood et al., 2018) and the final product is expected for release in 2019.

The well-known and widely accepted limitations of the existing ten DSM-5 categories for capturing personality dysfunction in a clinically informative and valid manner include the common co-occurrence of two or more categorical diagnoses within individuals, heterogeneity within each diagnostic category (i.e., two people with the same diagnosis may share only one symptom), unclear boundaries between normal and abnormal personality functioning, and incomplete coverage of personality dysfunction (Morey, Benson, Busch, & Skodol, 2015; Widiger & Trull, 2007). These drawbacks are not unique to personality disorders, as there are many psychiatric diagnoses that lend themselves to the use of dimensional conceptualization (Haslam, Holland, & Kuppens, 2012) and co-occurrence of psychiatric disorders within individuals is common. However, this seems to be particularly salient for personality dysfunction as there is widespread acceptance of the view that personality traits adhere to a universal, dimensional structure that is heritable and observed cross-culturally (Widiger & Costa, 2012). Further, the latent structure of personality is similar in clinical and non-clinical samples (O’Connor, 2002). On the other hand, the frequently cited advantages of the categorical method of diagnosis include clinical utility, facilitation of communication among providers, public policy stakeholders, and reimbursement entities; and they provide an international language for describing personality dysfunction and for developing assessment tools (Herpertz et al., 2017; Kendell & Jablensky, 2003).

A review of categorical and dimensional approaches for diagnosis of PDs was presented by Miller and colleagues (Miller, Few, & Widiger, 2012) and provides a still timely and comprehensive review. Readers are referred to this work for a full account of all published interviews and questionnaires that measure DSM-III-R to DSM-IV-TR PDs. Because this chapter was written and published just prior to the publication of DSM-5, the review also provides an important bridge between the two manuals and includes a description of proposals for changes to PD diagnostic procedures for DSM-5. The proposed changes included the possibility of deleting some diagnoses from the DSM altogether (e.g., dependent, narcissistic, schizoid, histrionic, and paranoid); the use of a prototype matching procedure; the addition of personality functioning rating scales to consider self and other functioning, which are key constructs that signify impairment; and the use of personality trait profiles. As is now known, none of these approaches were adopted in the main part of DSM-5, but some aspects were included in Section III as noted above.

Self-Report Scales for Diagnosis of Personality Disorders

Although there is a tendency to conflate the categorical approach with interviews and the dimensional approach with self-report (Strickland et al., 2019) and assume that only interviews can be used to obtain a categorical diagnosis of PDs, there are more than 20 self-report scales that have been developed to make provisional or likely diagnoses of the PDs. Eleven of these scales diagnose all ten PDs included in DSM-IV-TR/DSM-5. A brief discussion of the use of self-report scales for diagnosing personality disorders is presented here. For further discussion and information about scoring procedures, readers are referred to the review by Miller and colleagues (Miller et al., 2012) which also includes original sources for the inventories and scales.

Self-report scales that are used to diagnose personality disorders include those that are atheoretical (e.g., Structured Clinical interview for DSM-IV Axis II Personality Disorders [SCID-II]; First, Gibbon, Spitzer, Williams, & Benjamin, 1997), listing each diagnostic criterion as it appears in the DSM and those that are embedded within comprehensive personality inventories (e.g., Schedule for Nonadaptive and Adaptive Personality [SNAP], Clark, 1993; Revised NEO Personality Inventory [NEO-PI-R], Costa & Macrae, 1992; Millon Clinical Multiaxial Inventory [MCMI-III], Millon, Millon, & Davis, 1997; and Minnesota Multiphasic Personality Inventory [MMPI-2], Butcher et al., 2001). The latter inventories provide information about personality traits or temperament embedded within a theoretical model of personality in addition to PD diagnoses, or in the case of the MMPI-2, other facets of psychopathology. The major advantage to using self-report inventories is the relative ease and low cost of obtaining diagnoses. Moreover, some instruments include validity scales to offset the biases inherent to self-assessment.

When using self-report scales to diagnose PDs, it is important to remember that extreme scores do not necessarily equal impairment. By definition, PDs (and by extension, all psychiatric diagnoses) require that symptoms or traits are associated with functional impairment and/or distress. Thus, the optimal use of dimensional self-report scales should include an assessment of self and other functioning. These aspects are incorporated into the DSM-5 alternative model and include well-accepted characteristics of adaptive personality functioning. The elements associated with a coherent sense of self include a consideration of identity (e.g., self-esteem, boundaries between self and other, emotion regulation) and self-direction (e.g., pursuit of goals, prosocial behavior, self-reflection). With respect to interpersonal functioning, this includes empathy (comprehension and tolerance of other perspectives) and intimacy (e.g., desire and capacity for closeness, mutual regard for others). Because it might be difficult for people to rate their own impairment, the use of clinician ratings might be useful in augmenting self-reported personality. In 2018, First and colleagues (First, Skodol, Bender, & Oldham, 2018) published a structured clinical interview to assess the DSM-5 alternative model for personality disorders. This interview includes two modules and the first module (Bender, Skodol, First, & Oldham, 2018) includes an assessment of view of self and quality of interpersonal relationships. Buer Christensen and colleages (Buer Christensen et al., 2018) published an inter-rater reliability study of this interview showing good to excellent rater reliability; test-retest reliability was less reliable for some sub-domains, but this might reflect actual changes in functioning rather than poor agreement over time.

Given that the PDs described in DSM-5 are unchanged from DSM-IV-TR, the chapter by Miller and colleagues (Miller et al., 2012) is particularly helpful for people who are new to PD assessment in that it provides thorough descriptions of the most commonly used self-report scales (and structured interviews, reviewed below). The descriptions include information about the number of items on the scales and interviews, whether they cover one or more DSM formulations and the key disadvantages/advantages of each instrument. Moreover, a large portion of the chapter is devoted to an exhaustive presentation of convergent validity associations for PD scores using (1) self-report scales compared to other self-reports and (2) self-report scales compared to dimensional scores derived from interviews. Almost without fail, Miller and colleagues noted that there was higher convergence between two self-rated scores than between self-report and interview-based dimensional assessments. The lower convergence between self-rated versus interview-based scores is likely due to at least two factors including method (in)variance and the expected discrepancy between self-rated traits and those rated by a clinician.

Notably, both comparisons yielded the conclusion there is little to no agreement between measures designed to diagnose OCPD. For both comparisons (self-report versus self-report and self-report versus interview), avoidant PD showed the highest convergence. It should be noted that disagreement across assessment instruments may reflect lack of agreement regarding conceptualization of the PD in addition to, or instead of, measurement differences. Also presented in Miller et al.’s (2012) chapter is a table of comparisons between interview assessed PDs, which is notably less comprehensive owing to the general lack of research comparing PD diagnoses derived from structured interviews. This point will be highlighted again below.

Structured Interviews for Diagnosis of Personality Disorders

Although unstructured diagnostic interviews represent the most common practice in the clinical setting, these interviews vary between and within clinicians as the goal is often to come to any diagnosis quickly for treatment planning, including choosing an intervention that can address symptoms. Five semi-structured interviews have been developed and are used widely to diagnose the DSM-IV-TR/DSM-5 personality disorders. A sixth interview was published in 2018 and is designed to assess the DSM-5 Alternative Model for Personality Disorders (First et al., 2018). This section will present and evaluate the most commonly used interviews that have empirical support and have been used to derive categorical diagnoses of PDs. Rater reliability characteristics of these interviews for diagnosing categories will be presented, in alphabetical order, followed by a general discussion of the validity of categorical diagnoses. The use of these interviews in epidemiological samples to describe prevalence will also be noted.

The Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS) was administered as part of the National Institute on Alcohol Abuse and Alcoholism’s National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) in two waves: 2001–2002 (Wave 1) and 2004–2005 (Wave 2). NESARC is a representative sample of nearly 50,000 civilian adults in the United States and was designed to assess current and past alcohol consumption and other mood, anxiety, and personality disorders as operationalized for DSM-IV (Hasin & Grant, 2015). Seven personality disorders were assessed at Wave 1 and the remaining three were assessed at Wave 2. In contrast to the other interviews for PD diagnosis described in this chapter, the AUDADIS was administered by lay interviewers after extensive training (Grant et al., 2004). Diagnostic criteria are queried in the order they appear in the DSM-IV. Following a positive response, respondents were asked whether the item reflected how they “felt or acted most of the time throughout their life regardless of the situation or whom they were with” to assess whether a given PD symptom represented a long-term pattern. Additionally, each positive response was followed by a question about whether the symptom caused problems at work, school or with other people to assess functional impairment. To meet criteria for a categorical diagnosis, the requisite number of items had to be endorsed and at least one item had to be associated with impairment.

The population targeted by NESARC includes non-institutionalized adults in the United States. Of general relevance for the US population with PDs, the sample included people in boarding and rooming houses, non-transient hotels and shelters, college housing and group homes. For the reliability study, groups of 400 subsamples were randomly contacted to complete the reliability study. PDs were assessed at only one of the eight regional offices: Antisocial PD was assessed from the regional office in Los Angeles and the other PDs were assessed from the Kansas City office. Gender distributions were approximately equal in these two regions, but the Los Angeles sample was approximately 40 percent Hispanic. The Kansas City regional office sample was predominantly white, non-Hispanic. Retest reliability was assessed for Wave 1 interview items two to three months after the initial interview and the kappa values ranged from .40 (Histrionic PD) to .67 (Antisocial PD) (Grant et al., 2003), mostly reflecting a moderate level of agreement. It is not clear from the description of this reliability study whether the kappa reflects inter-rater agreement or whether the same interviewer administered both assessments, which would reflect intra-rater consistency, or alternatively, temporal stability.

The Diagnostic Interview for DSM-IV Personality Disorders (DIPD-IV) is a semi-structured interview developed by Zanarini and colleagues (Zanarini, Frankenburg, Sickel, & Yong, 1996) to assess DSM-IV-TR PDs. The interview includes 252 questions to assess the 11 PDs (including PD not otherwise specified), which are assessed in the order presented in DSM-IV-TR. Respondents are asked yes/no questions followed by open-ended questions to determine how to rate each criterion on a 3-point scale (0 = absent or clinically insignificant; 1 = present but uncertain of clinical significance; 2 = present and clinically significant). The time frame for the items is the past two years. Inter-rater and retest reliability characteristics were reported by Zanarini and colleagues (Zanarini, Frankenburg, Chauncey, & Gunderson, 1987) on the initial version of the interview, based on DSM-III PDs in a clinical sample. Forty-three inpatients were assessed by three raters at a single time-point; a separate set of 54 patients were assessed one week apart, by two different raters. Approximately half of the sample was female and all of the people in the study were white. With respect to rater reliability, only one inter-rater coefficient was below .75 (paranoid personality disorder) and the other coefficients ranged from .87 to 1.0 (antisocial, avoidant). Retest reliability retest coefficients ranged from .46 (passive aggressive PD) to .85 (borderline PD).

The DIPD-IV was used to assess PDs in the Collaborative Longitudinal Personality Disorders Study (CLPS) (Gunderson et al., 2000), which is a cohort of nearly 700 treatment-seeking adults recruited across a range of clinical sites and followed over time. The sample included treatment-seeking adults who met DSM-IV-TR diagnostic criteria for one of five PDs, including schizotypal PD, borderline PD, avoidant PD, and obsessive-compulsive PD or a comparison group of patients with major depressive disorder, but no PD. Median inter-rater characteristics in a subset of this cohort were reported by Zanarini and colleagues (2000) and ranged from .58 (paranoid PD) to .71 (obsessive-compulsive PD); antisocial PD ratings were highest (1.0). Retest kappa coefficients ranged from .39 (paranoid PD) to 1.0 (narcissistic PD). The demographic characteristics were not reported for this reliability study, but were selected from the larger CLPS study, which included 64 percent women and was 76 percent white (Gunderson et al., 2000). Participants in the reliability study were selected based on “availability,” which suggests that the sample might include people with less severe pathology. There are no estimates of rater reliability for several of the PDs, as the coefficients were only calculated if the PD was diagnosed in at least five people. Despite incomplete coverage of all of the PD categories, results reported with this cohort have provided useful information about the longitudinal course of personality, including the relationship between change in dimensions versus diagnostic categories (Samuel et al., 2011).

The International Personality Disorder Examination (IPDE) (Loranger et al., 1994) was developed as part of a joint project between the National Institutes of Health and the World Health Organization. Originally developed for the assessment of PDs according to the DSM-III-R and the International Classification of Diseases, Tenth Revision (ICD-10), the interview was revised for DSM-IV-TR. The interview assesses 11 DSM-IV-TR and 10 ICD-10 PDs and was developed in field trials administered in 11 countries across North America, Europe, Africa, and Asia with the aim of diagnosing PDs in different languages and cultures. The Personality Disorder Examination (PDE) was the basis for the IPDE and is described by Loranger et al. (1991). The IPDE includes 157 items rated on a scale of 0 (absent or within normal range), 1 (present to an accentuated degree), or 2 (pathological, meets criterion) and the behavior or trait must be present for five or more years. The interview procedure also includes the IPDE Screening Questionnaire, which is a 77-item T/F self-report scale. Based upon review of the screening questionnaire, the IPDE interview can be administered. The questions are arranged in topical sections that assess background characteristics and aspects about work, self, and interpersonal relationships.

Loranger et al. (1994) reported inter-rater reliability and temporal stability characteristics in a large multi-site international sample of 716 patients. Twenty percent of the interviews were independently rated by a silent observer and 34 percent were reassessed after an average of six months. The sample included approximately equal numbers of men and women and some participants were psychiatric inpatients at the time of assessment. Demographic information was not reported in the published report, but respondents resided in Asia (India and Japan), Africa (Kenya), the United States, and Europe. Kappa statistics for rater reliability and temporal stability were only calculated if at least 5 percent of the sample met diagnostic criteria. Inter-rater agreement for three PDs that met this minimum frequency were between .70 and .80 (dependent, avoidant, and borderline PD), but histrionic PD agreement was low (.34). Temporal stability for these four PD diagnoses ranged from .45 to .70.

The Structured Interview for DSM-IV Personality Disorders (SIDP-IV) (Pfohl, 1995) consists of 337 questions that cover the ten DSM-IV Axis PDs as well as four optional PDs: Mixed Personality Disorder (i.e., PDNOS), Self-Defeating, Depressive, and Negativistic Personality Disorders. The initial version of the interview was developed for DSM-III (Pfohl, Stangl, & Zimmerman, 1983) and the interview was revised for DSM-III-R (Pfohl, Blum, Zimmerman, & Stangl, 1989). The questions of the SIDP-IV are organized into topical sections that assess different aspects of potential functional impairment, including interests and activities, work style, close relationships, social relationships, emotions, observational criteria (rated by interviewer based on a respondent’s behavior during the interview), self-perception, perception of others, stress and anger, and social conformity. The authors of the interview argue that this structure is conducive to establishing rapport. They also describe the questions as “non-pejorative” in that they do not directly use the wording from DSM, which might be off-putting to a respondent. All items are rated on a scale of 0 (not present) to 3 (strongly present). To assess the necessary condition of temporal stability, respondents are also asked whether a trait has been prominent for most of the last five years. The authors also recommend that a close friend or relative be interviewed for additional information (informant interview).

With respect to inter-rater reliability of the SIDP-IV, Damen, De Jong, and Van der Kroft (2004) published a study of 50 Dutch men and women who were seeking treatment for opioid use disorder. The interview was translated from English into Dutch and two raters conducted all of the interviews; both were present for every interview in an observer/rater design and they were blind to each other’s ratings. The authors computed kappa coefficients for each item of the interview and reported that 78 percent of the items had an agreement level > .75. One item (preoccupation with fantasies of unlimited success, power, brilliance, beauty, or ideal love from the Narcissistic PD criteria) had a kappa below .40 and three items could not be determined due to low variability in the ratings. Kappa coefficients for the categorical diagnoses ranged from .65 to 1.0 (mean = .86). Jane, Pagan, Turkheimer, Fiedler, and Oltmanns (2006) also conducted a rater-reliability study with the SIDP-IV, interviewing more than 400 US Air Force recruits selected from a sample of more than 2000. The PD sample included people who scored highly on self-report measures of personality pathology and others who were nominated by peers as having problematic personality characteristics. A third of the sample (60 percent male) was randomly selected from the larger group as control participants. The interviews were tape recorded and rated by a second rater who was blind to the interviewer’s ratings. In this detailed report, the authors present intra-class coefficients for each item as well as a narrative summary of the items that showed the highest and lowest concordance for each personality disorders. Race and ethnicity characteristics were not reported for the sample. Intra-class coefficients for the categorical diagnoses ranged from .35 (Narcissistic PD) to .85 (Dependent PD); Schizoid PD did not occur in this sample, and only one person was diagnosed with Schizotypal PD.

The Structured Clinical Interview for DSM-5 Personality Disorders (SCID-5-PD) (First, Williams, Benjamin, & Spitzer, 2016) is a 302-item interview developed for the assessment of DSM-5 personality disorders. This is a recent revision of the SCID-II, which was originally developed for DSM-III and revised for DSM-III-R and DSM-IV. Although the diagnostic criteria were not changed from DSM-IV to DSM-5, the authors of the SCID-II reviewed and revised several items for clarity. Note that this group also developed an interview for the assessment of the DSM-5 Alternative Model for Personality Disorders (First et al., 2018). The SCID-5-PD includes assessment procedures for the 11 DSM-IV Personality Disorders (including PDNOS), presented in the order they appear in the DSM-5. The SCID-5 Personality Questionnaire can be used as a screening instrument, administered prior to interview. The rater reviews the responses and only queries positive responses to the self-report T/F questionnaire. Training materials are available from the authors, including a taped recording of a sample interview using DSM-IV-TR materials so that a new rater can check his or her ratings against a reference interview. As with the other interviews in the SCID portfolio, there is a computer-assisted version. Results from eight reliability studies using DSM-III-R and DSM-IV criteria are summarized in the SCID-5-PD user’s guide (First et al., 2016). Sample sizes ranged from 31 to 284 and the studies were conducted in the United States, Japan, Italy, and the Netherlands. Kappa coefficients vary widely across groups, with some reliability coefficients as low as .02. However, a review of these studies indicates that the studies conducted with the DSM-IV and DSM-5 versions show acceptable to excellent reliability for categorical diagnoses. Finally, Somma and colleagues (2016) conducted a rater reliability study in 104 patients using the Italian translation of the SCID-5-PD and obtained kappa coefficients in the range of .82 (any PD) to 1.0 on all but two PD diagnoses (avoidant and dependent PD), which were .66 and .58, respectively.

Validity of Categorical Diagnoses Derived from Structured Interviews

If an assessment method does not measure a construct reliably, there is no reason to assess the validity of the methodology. Given that the interviews described above have all established moderate to high levels of reliability for at least some diagnoses, it is then a reasonable question to ask whether the categorical diagnoses derived from the interviews are valid.

In exploring the validity of the tests above, several lines of evidence can be examined. First, a comparison of two interviews can be conducted to evaluate whether there is agreement with respect to the diagnoses that are made. This type of head-to-head comparison has rarely been conducted and/or published. O’Boyle and Self (1990) administered the PDE (the precursor to the IPDE) and the SCID II for DSM-III-R to 20 adults with depression and observed poor agreement between the interviews. In contrast, Pilkonis and colleagues (1995) reported a comprehensive and systematic examination of reliability and validity characteristics of the PDE and the SIDP-III-R in a sample of 108 treatment-seeking adults. The process of rater training and making a best-estimate diagnosis by consensus was described; a process that uses “longitudinal, expert and all data” (LEAD) to come to diagnostic consensus (Kranzler, Kadden, Babor, & Rounsaville, 1994; Pilkonis, Heape, Ruddy, & Serrao, 1991). The inter-rater reliability coefficient for any PD diagnosis in this study was .55 (78 percent agreement) for the PDE and .58 (81 percent agreement) for the SIDP-III-R. They note that the high prevalence of personality dysfunction in the sample constrained the reliability of the categorical diagnoses. Rater reliability coefficients for dimensional scores for the two interviews ranged from .85 to .92 on the PDE and from .82 to .90 on the SIDP-R. With respect to validity, the authors designated clinical consensus among several raters as the “gold standard” of diagnosis. Agreement between group consensus and diagnoses derived from either of the two interviews was low as both methods underestimated the presence versus absence of a personality disorder. The authors concluded by describing the advantages of using dimensional scores derived from the ratings of individual criteria to enhance reliability and validity of the ratings.

A second form of convergent validity is attained by examining correlations between interview-derived categories and self-report scales of normal and/or abnormal personality. As noted above, Miller et al. (2012) compiled results from this literature and show that the correlations are generally low to moderate. While this can be interpreted as low convergent validity for diagnostic categories, this is not the only explanation. Low association between clinician-rated and self-rated items has been observed for other forms of psychopathology (e.g., PTSD) (Forbes, Creamer, & Biddle, 2001) and suggests that the two methods tap different things or alternatively, that first order correlations may not be the most optimal means for examining agreement between clinicians and patients (Monson et al., 2008). Clinicians make informed judgments about temporal stability and level of impairment when asking about whether a characteristic is present or not. Experienced clinicians or diagnosticians make determinations about ratings and/or symptom severity based on a highly contextualized view of psychopathology. Further, clinicians make decisions regarding differential diagnosis related to comorbid conditions that an individual rating his or her own behavior might not consider. For some traits (e.g., Narcissism), the individual may lack insight into whether there is functional impairment associated with the trait and/or not understand the source of such impairment.

A third means of evaluating the validity of diagnostic categories is the examination of sensitivity and specificity rates, which provide information about false positives (being diagnosed with a PD when there is no PD) and false negatives (not being diagnosed with a PD when a PD diagnosis is warranted). Very few studies have reported sensitivity and specificity for categorical diagnoses of PD. The utility of the SCID-II personality questionnaire as a screener for the diagnosis of PD using structured interviews has been examined in three studies (Ekselius, Lindstrom, von Knorring, Bodlund, & Kullgren, 1994; Jacobsberg, Perry, & Frances, 1995; Nussbaum & Rogers, 1992). All three studies reported a low rate of false negatives.

Prevalence of Categorical Personality Disorders in Large Cohorts

The AUDADIS was developed for use in the NESARC cohort (Hasin & Grant, 2015) and has not been used to assess PDs in other samples or cohorts. However, NESARC is a nationally representative sample of the US population and is the largest cohort to date with interview-assessed PDs. Limited data sets are available for investigators, making this an important and useful resource for examining PDs and comorbidity with alcohol use and mood and anxiety disorders. It has been noted that the practice of only requiring that a single item be associated with impairment may have resulted in false positives. The prevalence estimate for any personality disorder diagnosis in this sample was 21.5 percent. In a reanalysis of the data, Trull and colleagues (Trull, Jahng, Tomko, Wood, & Sher, 2010) made a categorical diagnosis only if there was impairment associated with all the requisite number of criteria in a specific diagnostic category; as expected, observed rates were lower for all categories and were more in line with previously published epidemiological studies (i.e., 6–15 percent and see below). Trull and colleagues also noted that the use of two separate data collection waves should ideally be modeled in analyses as method variance may account for observed patterns of covariation between symptoms and/or categories.

As noted above, the NESARC cohort is a large, nationally representative sample, but the regional centers that collected the data for the reliability subsamples did not all assess PDs. All but one of the PDs was assessed in a single Midwestern regional office, while Antisocial PD was assessed only on the west coast. Very little research has been devoted to documenting race and ethnicity with respect to the prevalence of PDs, despite the impact that cultural factors have on self-concept, worldview, and interpersonal behavior (McGilloway, Hall, Lee, & Bhui, 2010). Given the importance of identity and interpersonal functioning in defining personality dysfunction, these factors warrant further study and inclusion in research designed to document prevalence of PDs. None of the reliability studies cited here had demographically representative samples and some studies did not even report the demographic characteristics of the sample. Thus, bias in sampling related to inadequate representation of population race and ethnicity prevalence should be considered as another potential source of error in establishing true estimates of reliability of the diagnoses.

The IPDE has been administered in several large samples to determine the prevalence of DSM PDs, including a subsample of the National Comorbidity Study-Replication (NCS-R) (Lenzenweger, Lane, Loranger, & Kessler, 2007). The prevalence of any DSM-IV PD in this US cohort was 11.9 percent. Benjet and colleagues (Benjet, Borges, & Medina-Mora, 2008) administered the IPDE DSM-IV screening questionnaire in a true prevalence epidemiological study to more than 2300 adults in Mexico and reported a prevalence rate of 6.1 percent for any PD. The SID-P-III-R was administered in a representative sample in Norway to more than 2000 adults (Torgersen et al., 2008), showing a true prevalence rate of 13.4 percent. The SCID II was administered in two longitudinal cohorts, both designed to document the developmental transition from adolescence to adulthood. The prevalence of any DSM-III-R PD in these two cohorts was 15.7 percent (Cohen, Crawford, Johnson, & Kasen, 2005) and 12.7 percent (Johnson, Cohen, Kasen, Skodol, & Oldham, 2008). The latter prevalence rate includes depressive and passive-aggressive PD. Additionally, the SCID II was used to estimate the prevalence of DSM-III-R PDs in a family study in Germany, reporting a rate of any PD at 10 percent.

Conclusions

Reliability

A categorical diagnosis of any PD can generally be assessed in a reliable manner using the structured interviews described above, although there are some exceptions as noted above. Several conclusions can be made from the above review. First, the makeup of the sample will greatly affect reliability. The relative proportion of PD rates, the presence of comorbidities, recruitment strategies (e.g., from clinical settings versus population-based), and demographic makeup of the sample can all greatly affect the rate of PD in a sample and this will impact on whether a PD can be assessed reliably. For example, when a sample is representative of the population, the rate of any PD will be low. A low number of disagreements between any two raters will not have a large impact on reliability. In contrast, when the sample is treatment seeking and has a high base rate of personality dysfunction, the same low number of disagreements can lead to low reliability coefficients. Second, some PD diagnoses can be more reliably assessed than others. Jane and colleagues (Jane et al., 2006) have argued that the items and diagnoses with the highest level of rater agreement are those that are easily observable and/or reportable and that do not require a high degree of insight on the part of the interviewee. This conclusion is supported by the above review, which notes that reliability coefficients for histrionic, paranoid, and narcissistic PDs are lower than .60 in several studies, regardless of the interview that was used to assess PDs.

Third, “dimensionalizing” categories can enhance reliability. Most of the reliability studies described above report findings for both categorical diagnoses and for dimensional scores that are obtained by summing responses to the interview items. Other than the AUDADIS, all the interviews described above use a 3- or 4-point ordinal scale for rating each interview item. These values can be summed within PD categories to obtain a dimensional score that reflects symptom severity or alternatively, the clinician’s confidence in making the rating for each criterion. In all cases, the use of summed scores resulted in appreciably higher reliability coefficients. This method of adding dimensionality to categorical diagnoses has been recommended to strengthen the reliability of PD diagnoses (Helzer, Kraemer, & Krueger, 2006). Research groups who study “discrete” groups based on interview-assessed categories are encouraged to also report dimensional scores for descriptive purposes. These dimensional scores will greatly enhance the ability to compare results across studies and research groups. These dimensional scores lend themselves to a more powerful examination of social and biological correlates and other indices of convergent and divergent validity.

Finally, in addition to reporting summed scores along with dichotomous groups, it is incumbent upon research groups that use interviews (or self-reports) to create dichotomous groups for study to report rater reliability among their own raters and provide details about the consensus process. A description of how temporal stability and functional impairment was determined is critical.

Categories and Dimensions Together

The combination of categories and dimensional traits can be particularly informative for researchers and clinicians who examine PDs over time. The early definitional view of PDs that they are immutable from early adulthood and set a lifelong course of dysfunction has been altered by longitudinal and treatment research (and clinical practice) showing that recovery from PD is possible. As noted above, several self-report scales have been developed to provide both categorical and dimensional measures of PD, and some provide dimensional measures of personality or temperament. There is a distinct advantage to using such a measure instead of a structured interview that only queries/scores maladaptive behavior and traits. In 2007, Clark proposed that PDs can be reconceptualized as representing “change within relative stability” (Clark, 2007). That is, single observation assessments of these sentinel behaviors that signal dysfunction to self and other are inherently more difficult to assess reliably, whereas traits and PD diagnoses will render more stable coefficients. Thus, extreme trait measures in the absence of a PD diagnosis might signal vulnerability to episodic dysfunction. The interview measures, in contrast, even when summed across categories, provide only information about abnormal behavior without conferring information about normative variation in personality. When feasible, the use of such a measure with a structured interview in a research study can also provide information about convergent validity using multi-method assessments.

Finally, as can be seen from this review, there are a relatively large number of self-reports and interviews that have been used to assess PDs according to the categorical model. While this demonstrates a robust interest in this area of investigation, research establishing inter-rater reliability across instruments is somewhat dated and incomplete. The shift from a categorical conceptualization and assessment of PDs to the alternative model is (hopefully) an iterative process and future research and clinical practice that incorporates both approaches will represent progress in assessment and conceptualization of personality disorders.

References

American Psychiatric Association. (2000). Diagnostic and Statistical Manual of Mental Disorders (4th ed., Text Revision). Washington, DC: American Psychiatric Association.

American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental Disorders (5th ed.). Arlington, VA: American Psychiatric Publishing.

Bender, D., Skodol, A. E., First, M. B., & Oldham, J. M. (2018). Structured Clinical Interview for the DSM-5 Alternative Model for Personality Disorders: Module I. Washington, DC: American Psychiatric Press.

Benjet, C., Borges, G., & Medina-Mora, M. E. (2008). DSM-IV personality disorders in Mexico: Results from a general population survey. Revista Brasileira de Psiquiatria30(3), 227–234.

Buer Christensen, T., Paap, M. C. S., Arnesen, M., Koritzinsky, K., Nysaeter, T. E., Eikenaes, I., … Hummelen, B. (2018). Interrater reliability of the Structured Clinical Interview for the DSM-5 Alternative Model of Personality Disorders Module I: Level of Personality Functioning Scale. Journal of Personality Assessment100(6), 630–641.

Butcher, J. N., Graham, J. R., Ben-Porath, Y. S., Tellegen, A., Dahlstrom, W. G., & Kaemmer B. (2001). Minnesota Multiphasic Personality Inventory–2. Minneapolis: University of Minnesota Press.

Clark, L. A. (1993). Manual for the Schedule for Nonadaptive and Adaptive Personality (SNAP). Minneapolis: University of Minnesota Press.

Clark, L. A. (2007). Assessment and diagnosis of personality disorder: Perennial issues and an emerging reconceptualization. Annual Review of Psychology58, 227–257.

Cohen, P., Crawford, T. N., Johnson, J. G., & Kasen, S. (2005). The children in the community study of developmental course of personality disorder. Journal of Personality Disorders19(5), 466–486.

Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO-PI-R) and NEO Five Factor Inventory (NEO-FFI) Professional Manual. Odessa, FL: Psychological Assessment Resources.

Damen, K. F., De Jong, C. A., & Van der Kroft, P. J. (2004). Interrater reliability of the structured interview for DSM-IV personality in an opioid-dependent patient sample. European Addiction Research10(3), 99–104.

Ekselius, L., Lindstrom, E., von Knorring, L., Bodlund, O., & Kullgren, G. (1994). SCID II interviews and the SCID Screen questionnaire as diagnostic tools for personality disorders in DSM-III-R. Acta Psychiatrica Scandivavica90(2), 120–123.

First, M. B., Bell, C. C., Cuthbert, B., Krystal, J. H., Malison, R., Offord, D. R., Reiss, D., Shea, T., Widiger, T, & Wisner, K. L. (2002). Personality disorders and relational disorders: A research agenda for addressing crucial gaps in DSM. In D. J. Kupfer, M. B. First, & D. A., Regier (Eds.), A Research Agenda for DSM-V (pp. 123–199). Washington, DC: American Psychiatric Press.

First, M. B., Gibbon, M., Spitzer, R. L., Williams, J. B. W., & Benjamin, L. S. (1997). Structured Clinical Interview for DSM-IV Axis II Personality Disorders (SCID-II). Washington, DC: American Psychiatric Press.

First, M. B., Skodol, A. E., Bender, D. S., & Oldham, J. M. (2018). User’s Guide for the SCID-5-AMPD: Structured Clinical Interview for the DSM-5 Alternative Model for Personality Disorders. Arlington, VA: American Psychiatric Association.

First, M. B., Williams, J. B. W., Benjamin, L. S., & Spitzer, R. L. (2016). Structured Clinical Interview for DSM-5 (SCID-5): User’s Guide. Arlington, VA: American Psychiatric Association.

Forbes, D., Creamer, M., & Biddle, D. (2001). The validity of the PTSD checklist as a measure of symptomatic change in combat-related PTSD. Behaviour Research and Therapy39(8), 977–986.

Grant, B. F., Dawson, D. A., Stinson, F. S., Chou, P. S., Kay, W., & Pickering, R. (2003). The Alcohol Use Disorder and Associated Disabilities Interview Schedule-IV (AUDADIS-IV): Reliability of alcohol consumption, tobacco use, family history of depression and psychiatric diagnostic modules in a general population sample. Drug and Alcohol Dependence71(1), 7–16.

Grant, B. F., Hasin, D. S., Stinson, F. S., Dawson, D. A., Chou, S. P., Ruan, W. J., & Pickering, R. P. (2004). Prevalence, correlates, and disability of personality disorders in the United States: Results from the national epidemiologic survey on alcohol and related conditions. Journal of Clinical Psychiatry65(7), 948–958.

Gunderson, J. G. (2013). Seeking clarity for future revisions of the personality disorders in DSM-5. Personality Disorders: Theory, Research, and Treatment4(4), 368–376.

Gunderson, J. G., Shea, M. T., Skodol, A. E., McGlashan, T. H., Morey, L. C., Stout, R. L., … Keller, M. B. (2000). The Collaborative Longitudinal Personality Disorders Study: Development, aims, design, and sample characteristics. Journal of Personality Disorders14(4), 300–315.

Hasin, D. S., & Grant, B. F. (2015). The National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) Waves 1 and 2: Review and summary of findings. Social Psychiatry and Psychiatric Epidemiology50(11), 1609–1640.

Haslam, N., Holland, E., & Kuppens, P. (2012). Categories versus dimensions in personality and psychopathology: A quantitative review of taxometric research. Psychological Medicine42(5), 903–920.

Helzer, J. E., Kraemer, H. C., & Krueger, R. F. (2006). The feasibility and need for dimensional psychiatric diagnoses. Psychological Medicine36(12), 1671–1680.

Herpertz, S. C., Huprich, S. K., Bohus, M., Chanen, A., Goodman, M., Mehlum, L., … Sharp, C. (2017). The challenge of transforming the diagnostic system of personality disorders. Journal of Personality Disorders31(5), 577–589.

Hopwood, C. J., Kotov, R., Krueger, R. F., Watson, D., Widiger, T. A., Althoff, R. R., … Zimmermann, J. (2018). The time has come for dimensional personality disorder diagnosis. Personality and Mental Health12(1), 82–86.

Jacobsberg, L., Perry, S., & Frances, A. (1995). Diagnostic agreement between the SCID-II screening questionnaire and the Personality Disorder Examination. Journal of Personality Assessment65(3), 428–433.

Jane, J. S., Pagan, J. L., Turkheimer, E., Fiedler, E. R., & Oltmanns, T. F. (2006). The interrater reliability of the Structured Interview for DSM-IV Personality. Comprehensive Psychiatry47(5), 368–375.

Johnson, J. G., Cohen, P., Kasen, S., Skodol, A. E., & Oldham, J. M. (2008). Cumulative prevalence of personality disorders between adolescence and adulthood. Acta Psychiatrica Scandinavica118(5), 410–413.

Kendell, R., & Jablensky, A. (2003). Distinguishing between the validity and utility of psychiatric diagnoses. American Journal of Psychiatry160(1), 4–12.

Kranzler, H. R., Kadden, R. M., Babor, T. F., & Rounsaville, B. J. (1994). Longitudinal, expert, all data procedure for psychiatric diagnosis in patients with psychoactive substance use disorders. Journal of Nervous and Mental Disease182(5), 277–283.

Krueger, R. F. (2013). Personality disorders are the vanguard of the post-DSM-5.0 era. Personality Disorders: Theory, Research, and Treatment4(4), 355–362.

Lenzenweger, M. F., Lane, M. C., Loranger, A. W., & Kessler, R. C. (2007). DSM-IV personality disorders in the National Comorbidity Survey Replication. Biological Psychiatry62(6), 553–564.

Loranger, A. W., Lenzenweger, M. F., Gartner, A. F., Susman, V. L., Herzig, J., Zammit, G. K., … Young, R. C. (1991). Trait-state artifacts and the diagnosis of personality disorders. Archives of General Psychiatry48(8), 720–728.

Loranger, A. W., Sartorius, N., Andreoli, A., Berger, P., Buchheim, P., Channabasavanna, S. M., … Regier, D. A. (1994). The International Personality Disorder Examination: The World Health Organization/Alcohol, Drug Abuse, and Mental Health Administration international pilot study of personality disorders. Archives of General Psychiatry51(3), 215–224.

McGilloway, A., Hall, R. E., Lee, T., & Bhui, K. S. (2010). A systematic review of personality disorder, race and ethnicity: prevalence, aetiology and treatment. BMC Psychiatry10, 33.

Miller, J. D., Few, L. R., & Widiger, T. A. (2012). Assessment of personality disorders and related traits: Bridging DSM-IV-TR and DSM-5. In T. A. Widiger (Ed.), The Oxford Handbook of Personality Disorders (pp. 108–140). New York: Oxford University Press.

Millon, T., Millon, C., & Davis, R. (1997). MCMI-III Manual (2nd ed.). Minneapolis: National Computer Systems.

Monson, C. M., Gradus, J. L., Young-Xu, Y., Schnurr, P. P., Price, J. L., & Schumm, J. A. (2008). Change in posttraumatic stress disorder symptoms: Do clinicians and patients agree? Psychological Assessment20(2), 131–138.

Morey, L. C., Benson, K. T., Busch, A. J., & Skodol, A. E. (2015). Personality disorders in DSM-5: Emerging research on the alternative model. Current Psychiatry Reports17(4), 558.

Nussbaum, D., & Rogers, R. (1992). Screening psychiatric patients for Axis II disorders. Canadian Journal of Psychiatry37(9), 658–660.

O’Boyle, M., & Self, D. (1990). A comparison of two interviews for DSM-III-R personality disorders. Psychiatry Research32(1), 85–92.

O’Connor, B. P. (2002). The search for dimensional structure differences between normality and abnormality: A statistical review of published data on personality and psychopathology. Journal of Personality and Social Psychology83(4), 962–982.

Pfohl, B. (1995). Structured Interview for DSM-IV Personality (SIDP-IV). Iowa City, IA: University of Iowa College of Medicine.

Pfohl, B., Blum, N., Zimmerman, M., & Stangl, D. (1989). Structured Interview for DSM-III-R Personality: SIDP-R. Iowa City, IA: University of Iowa.

Pfohl, B., Stangl, D., & Zimmerman, M. (1983). Structured Interview for DSM-III Personality (SIDP). Iowa City, IA: University of Iowa Hospitals and Clinics.

Pilkonis, P. A., Heape, C. L., Proietti, J. M., Clark, S. W., McDavid, J. D., & Pitts, T. E. (1995). The reliability and validity of two structured diagnostic interviews for personality disorders. Archives of General Psychiatry52(12), 1025–1033.

Pilkonis, P. A., Heape, C. L., Ruddy, J., & Serrao, P. S. (1991). Validity in the diagnosis of personality disorders: The use of the LEAD standard. Psychological Assessment3, 46–54.

Samuel, D. B., Hopwood, C. J., Ansell, E. B., Morey, L. C., Sanislow, C. A., Markowitz, J. C., … Grilo, C. M. (2011). Comparing the temporal stability of self-report and interview assessed personality disorder. Journal of Abnormal Psychology120(3), 670–680.

Skodol, A. E., Morey, L. C., Bender, D. S., & Oldham, J. M. (2013). The ironic fate of the personality disorders in DSM-5. Personality Disorders: Theory, Research, and Treatment4(4), 342–349.

Somma, A., Fossati, A., Terrinoni, A., Williams, R., Ardizzone, I., Fantini, F., … Ferrara, M. (2016). Reliability and clinical usefulness of the personality inventory for DSM-5 in clinically referred adolescents: A preliminary report in a sample of Italian inpatients. Comprehensive Psychiatry70, 141–151.

Strickland, C. M., Hopwood, C. J., Bornovalova, M. A., Rojas, E. C., Krueger, R. F., & Patrick, C. J. (2019). Categorical and dimensional conceptions of personality pathology in DSM-5: Toward a model-based synthesis. Journal of Personality Disorders33(2), 185–213.

Torgersen, S., Czajkowski, N., Jacobson, K., Reichborn-Kjennerud, T., Roysamb, E., Neale, M. C., & Kendler, K. S. (2008). Dimensional representations of DSM-IV cluster B personality disorders in a population-based sample of Norwegian twins: A multivariate study. Psychological Medicine38(11), 1617–1625.

Trull, T. J., Jahng, S., Tomko, R. L., Wood, P. K., & Sher, K. J. (2010). Revised NESARC personality disorder diagnoses: Gender, prevalence, and comorbidity with substance dependence disorders. Journal of Personality Disorders24(4), 412–426.

Tyrer, P., Crawford, M., & Mulder, R. (2011). Reclassifying personality disorders. Lancet377(9780), 1814–1815.

Widiger, T. A. (2013). A postmortem and future look at the personality disorders in DSM-5. Personality Disorders: Theory, Research, and Treatment4(4), 382–387.

Widiger, T. A., & Costa, P. T., jr. (2012). Integrating normal and abnormal personality structure: The Five-Factor Model. Journal of Personality80(6), 1471–1506.

Widiger, T. A., & Trull, T. J. (2007). Plate tectonics in the classification of personality disorder: Shifting to a dimensional model. American Psychologist62(2), 71–83.

Zanarini, M. C., Frankenburg, F. R., Chauncey, D. L., & Gunderson, J. G. (1987). The Diagnostic Interview for Personality Disorders: Interrater and test-retest reliability. Comprehensive Psychiatry28(6), 467–480.

Zanarini, M. C., Frankenburg, F. R., Sickel, A. E., & Yong, L. (1996). The Diagnostic Interview for DSM Personality Disorders (DIPD). Belmont, MA: McLean Hospital.

Zanarini, M. C., Skodol, A. E., Bender, D., Dolan, R., Sanislow, C., Schaefer, E., … Gunderson, J. G. (2000). The Collaborative Longitudinal Personality Disorders Study: Reliability of axis I and II diagnoses. Journal of Personality Disorders14(4), 291–299.

If you find an error or have any questions, please email us at admin@erenow.org. Thank you!