There is a movement in psychology to require the usage of psychological measures to aid in the diagnosis of some mental illnesses. While developing the DSM-V, there was some discussion of including results of specific measures as part of diagnostic criteria. Considering the two articles you read, Limitations of Diagnostic Precision and Predictive Utility in the Individual Case: A Challenge for Forensic Practice and Diagnostic Utility of the NAB List Learning Test in Alzheimer’s Disease and Amnestic Mild Cognitive Impairment,write a response paper discussing the utility of using psychological measures as suggested above in the diagnosis of mental illness.
ORIGINAL ARTICLE
Limitations of Diagnostic Precision and Predictive Utility
in the Individual Case: A Challenge for Forensic Practice
David J. Cooke Æ Christine Michie
Received: 24 August 2007 / Accepted: 2 February 2009 / Published online: 11 March 2009
American Psychology-Law Society/Division 41 of the American Psychological Association 2009
Abstract Knowledge of group tendencies may not assist
accurate predictions in the individual case. This has
importance for forensic decision making and for the
assessment tools routinely applied in forensic evaluations.
In this article, we applied Monte Carlo methods to examine
diagnostic agreement with different levels of inter-rater
agreement given the distributional characteristics of PCL-R
scores. Diagnostic agreement and score agreement were
substantially less than expected. In addition, we examined
the confidence intervals associated with individual predictions
of violent recidivism. On the basis of empirical
findings, statistical theory, and logic, we conclude that
predictions of future offending cannot be achieved in the
individual case with any degree of confidence. We discuss
the problems identified in relation to the PCL-R in terms of
the broader relevance to all instruments used in forensic
decision making.
There is an important disjunction between the perspective
of science and the perspective of the law; while science
seeks universal principles that apply across cases, the law
seeks to apply universal principles to the individual case.
Bridging these perspectives is a major challenge for psychology
(Faigman, 2007). It is recognized by statisticians
that knowledge of group tendencies—even when precise—
may not assist accurate evaluation of the individual case
(e.g., Colditz, 2001; Henderson & Keiding, 2005; Rockhill,
2001; Tam & Lopman, 2003). It is a statistical truism that
the mean of a distribution tells us about everyone, yet no
one. This has serious implications for the use of psychological
tests in forensic decision making. To illustrate these
limitations, we focus on one of the most widely used, and
perhaps the most extensively validated, test in the forensic
arena—the Psychopathy Checklist Revised (PCL-R1; Hare,
2003). We emphasize, however, that all psychological tests
used in the same way in the forensic arena will suffer from
similar limitations (e.g., VRAG, Quinsey, Harris, Rice, &
Cormier, 1998; Static-99, Hanson & Thornton, 1999;
COVR, Monahan et al., 2005).
Mental health professionals are frequently asked to
opine whether an individual might be violent in the future;
psychopathic personality disorder is an important risk
factor to consider (Hart, 1998). The PCL-R is the most
frequently used measure of psychopathic personality disorder;
it has been described as the ‘‘gold standard’’ for that
purpose (Edens, Skeem, Cruise, & Cauffman, 2001; as
cited in Hare, 2003). There can be little doubt that the PCLR
has made a major contribution to our understanding of
violence (Hart, 1998); nonetheless, it is important for the
field to consider both its strengths and its limitations.
Findings for this instrument will have implications for less
well-validated tools. In this introduction, we consider two
issues; first, the use of PCL-R scores in forensic practice
and second, the general problem of the precision of predictions
about an individual case.
D. J. Cooke (&) C. Michie
Department of Psychology, Glasgow Caledonian University,
Glasgow G4 0BA, UK
e-mail: djcooke@rgardens.vianw.co.uk
1 The PCL-R is a 20-item rating scale of traits and behaviors intended
for use in a range of forensic settings. Definitions of each item are
provided and evaluators rate the lifetime presence of each item on a
3-point scale (0 = absent, 1 = possibly or partially present, and
2 = definitely present) on the basis of an interview with the
participant and a review of case history information.
123
Law Hum Behav (2010) 34:259–274
DOI 10.1007/s10979-009-9176-x
PCL-R SCORES AND FORENSIC PRACTICE
Much of the interest in the construct psychopathy comes
from the relationship between the PCL-R and future
criminal behavior (Lyon & Ogloff, 2000). Previous
research suggests that psychopathy—as assessed using the
Psychopathy Checklist-Revised (PCL-R; Hare, 1991)—is
an important risk marker for criminal and violent behavior
(Douglas, Vincent, & Edens, 2006; Hart, 1998; Hart &
Hare, 1997; Hemphill, Hare, & Wong, 1998; Leistico,
Salekin, DeCoster, & Rogers, 2008; Salekin, Rogers, &
Sewell, 1996). In fact, the PCL-R has been lauded as an
‘‘unparalleled’’ single predictor of violence (Salekin et al.,
1996). Hart (1998) argued that failure to consider psychopathy
in a violence risk assessment may constitute
professional negligence. This empirical base has resulted in
the PCL-R being used, not merely to measure the trait
strength of psychopathy in an individual, but also to make
predictions about what he or she will do in the future (Hare,
1993). As we demonstrate formally below, this additional
step of prediction means that the potential for imprecision
in forensic evidence is greatly increased: It expands the
gulf between inferences about groups and inferences about
individuals.
The PCL-R has been incorporated into statutory or legal
decision making (Hare, 2003). Within England and Wales,
a PCL-R score above a cut-off of 25 or 30 can lead to
detention in either a Special Hospital or a prison (Maden &
Tyrer, 2003); in certain Canadian provinces parole boards
explicitly consider PCL-R scores (Hare, 2003), and in
Texas psychopathy assessments are mandated by statute for
sexual predator evaluation (Edens & Petrila, 2006).2 The
PCL-R plays a role in criminal sentencing, including
decisions regarding indefinite commitment and capital
punishment, institutional placement and treatment, conditional
release, juvenile transfer, child custody, witness
credibility, civil torts, and indeterminate civil commitment
(DeMatteo & Edens, 2006; Fitch & Ortega, 2000; Hart,
2001; Hemphill & Hart, 2002; Lyon & Ogloff, 2000;
Walsh & Walsh, 2006; Zinger & Forth, 1998). The PCL-R
is regarded by many as the best method for operationalizing
the construct of psychopathy. For example, Lyon and
Ogloff (2000) argued that ‘‘…it is critical that the assessment
is made using the PCL-R’’ (p. 166) when evidence
about violence risk, based on psychopathy, is provided.
Because of its central role in forensic decision making it is
vital to assess its strengths and limitations and, by comparison,
the limitations of less well-validated procedures.
PREDICTIONS FOR INDIVIDUALS VERSUS
PREDICTIONS FOR GROUPS
Prediction is the raison d’eˆtre of many forensic instruments
(e.g., VRAG, Quinsey et al., 1998; Static-99, Hanson &
Thornton, 1999; COVR, Monahan et al., 2005). While this is
not true of the PCL-R its frequent use in forensic practice is
underpinned by the assumption—implicit or explicit—that it
can predict future offending (Walsh & Walsh, 2006). How
precise can such predictions be? The precision of any estimate
of a parameter (e.g., mean rate of recidivism of a group)
can be measured by the width of a confidence interval (CI); a
CI gives an estimated range of values, which is likely to
include an unknown population parameter. If independent
samples are taken repeatedly from the same population, and a
CI calculated for each sample, then a certain percentage
(confidence level) of the intervals will include the unknown
population parameter. Typically, 95% of these intervals
should include the unknown population parameter; other
intervals may be used (e.g., 68% and 99%). The width of this
interval provides a measure of the precision—or certainty—
that we can have in the estimate of the population parameter.
The width of a CI of a population parameter is linked, in part,
to the sample size used to estimate the population parameter
(see below for a more technical explanation).
The prevailing prediction paradigm has two stages.
First, the parameters (mean, slope, and variance) of a
regression model linking an independent variable (e.g.,
PCL-R score) to a dependent variable (e.g., likelihood of
reconviction) are estimated. Each of these parameters has
uncertainty associated with them, which can be expressed
by confidence bands about the regression line. Second, a
new case is selected and the PCL-R score is assessed, the
model is applied and the likelihood of reconviction is
estimated. The best estimate of the likelihood of reconviction
for a new case will be identical to the point on the
regression line for that PCL-R score. This new estimate has
a CI—also known as a prediction interval—that expresses
the precision, or certainty, that should be associated with
the prediction made about the new case. Often the two
steps are conflated, with the unrecognized assumption
being made that the prediction interval for the new case is
comparable to the CIs for the model. It is not (see below).
The problem of making predictions for individuals from
statistical models is now recognized in other disciplines. In
relation to medical risks, Rose (1992) expressed the position
clearly: ‘‘Unfortunately the ability to estimate the
average risk for a group, which may be good, is not matched
by any corresponding ability to predict which
individuals are going to fall ill soon’’ (p. 48). In relation to
reoffending, Copas and Marshall (1998) made a related
point ‘‘…the score is not a prediction about an individual
[italics added], but an estimate of what rate of conviction
2 The PCL-R is the most commonly used instrument for assessing
psychopathy in this setting (Mary Alice Conroy, Personal communication,
10 April 2007).
260 Law Hum Behav (2010) 34:259–274
123
might be expected of a group [italics added] of offenders
who match that individual on the set of covariates used by
the score’’ (p. 170) (see also Altman & Royston, 2000;
Bradfield, Huntzickler, & Fruehan, 1970; Colditz, 2001;
Elmore & Fletcher, 2006; Henderson, Jones, & Stare, 2001;
Henderson & Keiding, 2005; Rockhill, 2001; Rockhill,
Kawachi, & Colditz, 2000; Tam & Lopman, 2003; Wald,
Hackshaw, & Frost, 1999).
It is not generally recognized that a risk factor must have
a very strong relative risk (i.e.,[50) if it is to have utility as
a screening instrument at the individual level (Rockhill
et al., 2000; see also Kennaway, 1998). However, others set
the bar higher:
A risk factor has to be extremely strongly associated
with a disease within a population before it can be
considered to be a potentially useful screening test.
Even a relative odds of 200 between the highest and
lowest fifths will yield a detection rate of no more
than about 56% for a 5% false positive rate… (Wald
et al., 1999, p. 1564).
To put this in perspective, the relative risk for the
association between lung cancer and smoking is between
10 and 15 (Rockhill et al., 2000), depending on the definition
of exposure. The relative risk for the PCL-R and
recidivism is something of the order of 3 for general
recidivism and 4 for violent recidivism (Hare, 2003).
Does the application of current forensic tools provide an
adequate basis for testimony concerning the individual case?
In this article, we attempt to answer this question by considering
three issues pertaining to PCL-R data. How
confident can clinicians and legal decision makers be, first, in
the use of critical diagnostic cut-offs; second, in the
numerical value of PCL-R scores; and third, in individual
predictions of violent recidivism? We describe two studies.
The first study addresses the accuracy of diagnostic decisions
and the potential range of discrepancies between two raters.
The second study addresses the accuracy of prediction of
future violence in the individual case. The results have relevance
beyond the PCL-R to the use of other psychometric
instruments in forensic practice: The same limitations may
apply to many forensic assessment instruments.
STUDY ONE
In the first study, we examined diagnostic accuracy, specifically
the allocation of individuals around two critical
cut-offs, i.e., around 30 and around 25; the first is the
standard PCL-R cut-off for the diagnosis of psychopathy
and the second, often adopted in the UK, has proven useful
in that context including in decisions regarding treatment
allocation (Hare, 2003).
The inter-rater reliability figures presented in the PCL-R
manual can be regarded as good (Nunnally & Bernstein,
1994); intraclass correlation coefficient for single ratings
(ICC1) are estimated in some research studies as being
above .80 (Male offenders = .86; Male forensic psychiatric
patients = .88; Hare, 2003, Table 5.4).3 Edens and
Petrila (2006) indicated that these are probably ‘‘best case’’
estimates and ‘‘real world’’ reliabilities may be substantially
poorer.4 Murrie, Boccaccini, Johnson, and Janke
(2008), in one ‘‘real world’’ study, demonstrated poor
agreement (ICC1 = .39). These views and findings echo
concerns expressed by Hare (1998), that while researchers
take great pains to ensure reliability in their studies, the
level of reliability achieved by individual clinicians
remains unknown—and by implication—is likely to be
poorer than published studies. Inter-rater reliability is not
the only relevant consideration: Diagnostic precision is
also influenced by the underlying distribution of test scores.
Diagnostic precision is influenced by the location of the
cut-off and the shape of the distribution of scores—both
skewness and kurtosis. Estimates of the precision of a test
score (e.g., standard errors of measurement, SEM) are
weighted toward the mean of the distribution whereas cutoffs
are generally located substantially above the mean.
Item Response Theory (IRT) studies demonstrate that the
measurement precision of the PCL-R—in terms of measurement
information—falls toward the diagnostic cut-off
(Cooke, Michie, & Hart, 2006); thus, the SEM estimated
on the mean will provide an optimistic estimate of diagnostic
precision.
The SEM cannot be directly translated into estimates of
precision of diagnosis because of the impact of the score
distributions. Equally, it is not possible to estimate misclassification
rates directly using ICC1 values; therefore,
simulation approaches are required. Study one describes a
simulation that examines the impact of unreliability on
diagnostic accuracy.
Method
Monte Carlo studies allow the investigation of the properties
of distributions and estimates of parameters where
results cannot be derived theoretically (Mooney, 1997;
Robert, 2004). Large numbers of simulated datasets can be
3 The estimates of reliability are frequently obtained by re-rating the
same interview or with an observer simultaneously rating within an
interview. This will tend to inflate reliability, but not validity, as the
same information source is being used.
4 The case of THE PEOPLE, Plaintiff and Respondent, v. KURT
ADRIAN PARKER, a Sexual Violent Predator ACT case highlights
the variability that can emerge in some cases; five accredited experts
furnished five PCL-R scores that ranged from 10 to 25. (Edens, John,
Personal Communication, 22 May 2006).
Law Hum Behav (2010) 34:259–274 261
123
created based on an explicit and replicable data-generation
process. The effect of known features designed into the
data, such as levels of inter-rater reliability, on outcomes,
such as diagnostic precision, can be assessed. Multiple
trials of procedures are carried out to allow precise estimation
of outcomes. Mooney (1997) argued that Monte
Carlo simulations could allow social scientists to test
classical parametric inference methods and provide more
accurate statistical models. In our view, this mainstream
statistical technique is underused in forensic research.
Materials
We used Monte Carlo techniques based on distribution
information from two datasets of PCL-R total scores: (1)
data for North American Male Offenders (Table 9.1, Hare,
2003) and (2) data from UK prisoners (Cooke, Michie,
Hart, & Clark, 2005).5
The first distribution, being the largest, probably provides
the best estimate of the true distribution of scores
underlying the PCL-R and is described as ‘‘approximately
normal’’ (Hare, 2003, p. 55). Given the potential impact of
a departure from normality we tested whether this distribution
was in fact normal. The departure from normality
was highly significant (Kolmogorov–Smirnov = .068,
df = 5408, p.0001, Skewness = -.33, Kurtosis =
-.570). Examination of Fig. 1 demonstrates that around the
standard cut-off of 30 cases are over-represented while in
the right tail of the distribution they are under-represented.
In the simulation study,we generated two randomvariables
per case using MATHCAD.13 (2005). These random variableswere
scaled according to one of the two datasets referred
to above with mean l and standard deviation r.6 This gives
two uncorrelated ratings (x1 and x2) from the distribution of
scores: x1 is our first rating on the subject, PCL1. We then
calculated a linear combination of the two ratings to provide a
second rating on the same subject, which has a correlation of
q with the first rating. The linear combination is PCL2 ¼
roundf l þ ðx1 lÞq þ ðx2 lÞ
ffiffiffiffiffiffiffiffiffiffiffiffiffi
1 q2
p
g ; using rounding
to ensure an integer score. This process gives two random,
correlated scores from the distribution. There is a very small
probability of obtaining second ratings less than 0 or greater
than 40: These scores have been taken as 0 or 40, respectively.
Assuming that the ICC1 represents the best estimate of
the correlation between the two scores, we estimated the
distributions for four values of reliability, i.e., ICC1 values
of .75, .80, .85, and .90. The .80 value is a lower bound
estimate for reasonable practice. Hare (1998) indicated that
at least this level should be achievable with ‘‘…properly
conducted assessments’’ (p. 107). The .85 level may be
achievable by one rater with good training; the .90 level,
perhaps the best case scenario, is the level achievable where
two independent sets of ratings are averaged. Values above
.90 are rarely if ever achievable—Hare (1998) describes .95
and higher as ‘‘unbelievably high’’ (p. 107). The .75 provides
a lower-bound estimate of what may be obtained in
clinical practice. These values probably represent optimistic
estimates for actual clinical practice; we did not assess the
‘‘worst case’’ scenarios implied by Edens and Petrila (2006)
and by Murrie et al. (2008). The estimation procedure was
repeated 10,000,000 times for each of the four levels of
ICC1 to provide stable estimates of the distribution of the
correlated ratings and to ensure at least 10,000 cases within
each of the extreme score bands. We examine discrepancies
in two ways: First, in terms of diagnostic disagreement and
second, in terms of disagreements about total scores.
What is the level of diagnostic agreement? Kappa (j)
coefficients measure the proportion of diagnostic agreements
corrected for observed base rates (Fleiss, 1981).
Conventionally, j.75 represents excellent agreement,
.40j.75 represents fair to good agreement and
j.40 represents poor agreement (Gail & Benichou,
2000). Kappa values for three distributions and the four
ICC1 values are given in Table 1. We calculated Kappa
coefficients for agreement in diagnosis between the two
ratings using both common cut-offs, i.e., 30 and 25. The
vast majority of Kappa values are only in the fair to good
range; few values approach the poor range.
Kappa is an omnibus statistic, which is useful for
summarizing group results; however, it tells us little about
agreement in the individual case. The potential for misclassification
is clearer when distributions of disagreements
0 10 20 30 40
PCL-R
0
100
200
300
Frequency
Fig. 1 Distribution of North American male prisoners and normal
curve
5 We carried out a similar analysis of data for Male Forensic
Psychiatric Patients (Table 9.2, Hare, 2003); the results, which
demonstrate the same pattern, can be obtained from the first author.
6 A full description of the simulation study including the Mathcad
code can be obtained from the first author.
262 Law Hum Behav (2010) 34:259–274
123
are considered. The distributions based on the North
American Male Offenders are in Table 2. For ease of
interpretation, we tabulated the distributions in 5-point
ranges. Examination of the sub-table for ICC1 = .80
indicates that if one rater gives a score between 30 and 34,
i.e., just above the diagnostic cut-off then only in 46% of
occasions—approximately half the time—will the other
rater obtain a score within the same range. In 44% of the
occasions, the second rater would place the individual
below the critical cut-off. Even in the best case scenario,
i.e., ICC1 = .90, if one rater gives a score between 30 and
34 then only in 60% of occasions will the other rater obtain
a score within the same range. On 29% of occasions, the
second rater would place the participant below the critical
cut-off.
The distributions based on the UK prisoners are in
Table 3. Examination of the table for ICC1 = .80 indicates
that if one rater gives a score between 30 and 34, i.e., just
above the diagnostic cut-off then only in 39% of occasions
will the second rater obtain a score within the same range. In
54% of the cases, the second rater would place the individual
below the critical cut-off.As previously, even in the best case
scenario, i.e., ICC1 = .90, if one rater gives a score between
30 and 34 then only in 53% of cases will the other rater obtain
a score within the same range. In 39% of cases, the second
rater would place the participant below the critical cut-off.
In the UK, the cut-off of 25, as well as 30, is often applied
(DSPD Programme, 2005; Hare, 2003). Examination of the
table for ICC1 = .80 indicates that if one rater gives a score
between 25 and 29, i.e., just above theUKdiagnostic cut-off,
then only in 29% of occasions will the other rater obtain a
score within the same range. On 49% of occasions, the second
rater would place the individual below the critical cutoff.
Even in the best case scenario, i.e., ICC1 = .90, if one
rater gives a score between 25 and 29 then only in 37% of
cases will the other rater obtain a score within the same
range. On 37% of occasions, the second rater would place the
participant below the critical cut-off.
Therefore, in broad terms, all of the findings reported
above demonstrate that the allocation of an individual
above or below diagnostic cut-offs is much less precise
than previously thought.
Another way of considering the precision of PCL-R
scores is to examine expected discrepancies in scores based
on variations in ICC1 while taking into account the distributional
characteristics of the PCL-R scores. The PCL-R
manual suggests that in 68% of cases the discrepancies
between two raters should be up to 3 points, and in 95% of
cases it should be up to 6 points (Hare, 2003). This assumes
normality of the PCL-R score distribution, an assumption
that is not met (see above). The cumulative distribution of
score discrepancies estimated from the Monte Carlo studies
are tabulated in Table 4. With the North American prisoner
sample and an ICC1 of .80, a discrepancy of between 8 and
9 points would be expected in 9% of cases, around 10
points in 5% of cases, and between 12 and 13 points in 1%
of cases. With the UK prisoner sample, and an ICC1 of .80,
a discrepancy of between 8 and 9 points would be expected
in 23% of cases, around 10 points in 5% of cases, and
around 12 points in 1% of cases.
An alternative approach to summarize the range of
possible discrepancies is to estimate the distribution of a
2nd PCL-R rating given the 1st PCL-R rating. This conditional
distribution can be summarized by a CI that
contains 95% of the 2nd ratings. This interval is thus
defined by the lower and upper limits LL and UL given by
Prob(LL2nd ratingULj1st rating) ¼ 0:95:
Results for both 68% and 95% CIs for ICC1 = .80, and
for both samples, are presented in Table 5. For example, in
the North American prisoner sample, if rater one obtains a
total score of 30, then the 95% CI for rater two’s total score
will be between 19 and 36 (i.e., between the 35th and 99th
percentile).
All the estimates in this study are conservative; that is,
they assume that the SEM that applies at the mean applies
Table 1 Kappa coefficients and levels of agreement for four levels of correlation (q) for two distributions
q Both30 Both C 30 Different j Both25 Both C 25 Different j
North American male offenders
0.75 72.9 10.6 16.4 .46 48.0 29.6 22.4 .54
0.80 73.3 11.5 15.1 .51 48.9 31.0 20.0 .59
0.85 74.4 12.2 13.5 .56 50.5 32.3 17.2 .64
0.90 75.5 13.4 11.1 .64 51.7 34.2 14.1 .71
United Kingdom prisoners
0.75 91.8 2.1 6.1 .38 79.8 7.6 12.6 .47
0.80 92.0 2.4 5.6 .43 83.1 6.8 10.2 .52
0.85 92.4 2.7 5.0 .49 81.2 9.1 9.7 .60
0.90 92.7 3.1 4.2 .57 82.0 10.1 7.9 .67
Law Hum Behav (2010) 34:259–274 263
123
around the cut-off. However, this is an unwarranted
assumption. The overall variance of errors of measurement is
a weighted average of the errors that pertain across the range
of true score values. Precision of measurement of the PCL-R
drops as scores approach the diagnostic cut-off (e.g., Cooke
& Michie, 1997; Cooke et al., 2006). Thus, the degree of
diagnostic misclassification and score discrepancy is likely
to be greater in practice than demonstrated in the simulation
above. The conditional SEM (CSEM)7 is the square root of
the variance of errors at a particular level of true scores. To
Table 2 Distribution of diagnostic disagreements by four levels of
correlation between raters based on distribution of North American
male offenders
PCL-R score
0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40
q = 0.75
0–4 .209 .086 .033 .006 0 0 0 0
5–9 .505 .305 .200 .079 .008 0 0 0
10–14 .245 .318 .250 .188 .072 .005 0 0
15–19 .041 .235 .266 .270 .239 .078 .002 0
20–24 0 .053 .192 .256 .310 .281 .097 0
25–29 0 .002 .056 .155 .236 .369 .371 .130
30–34 0 0 .003 .044 .121 .230 .445 .551
35–40 0 0 0 .001 .014 .037 .085 .319
q = 0.8
0–4 .245 .096 .032 .002 0 0 0 0
5–9 .523 .372 .211 .065 .003 0 0 0
10–14 .217 .315 .288 .198 .055 .001 0 0
15–19 .014 .198 .280 .306 .240 .055 0 0
20–24 0 .019 .162 .269 .340 .283 .063 0
25–29 0 0 .026 .137 .247 .391 .379 .076
30–34 0 0 0 .023 .106 .237 .465 .585
35–40 0 0 0 0 .010 .031 .093 .339
q = 0.85
0–4 .285 .105 .021 0 0 0 0 0
5–9 .552 .423 .217 .038 0 0 0 0
10–14 .158 .328 .333 .202 .028 0 0 0
15–19 .005 .141 .303 .354 .231 .024 0 0
20–24 0 .003 .121 .287 .386 .265 .029 0
25–29 0 0 .005 .111 .266 .430 .351 .038
30–34 0 0 0 .008 .085 .252 .519 .569
35–40 0 0 0 0 .003 .029 .101 .393
q = 0.9
0–4 .361 .103 .009 0 0 0 0 0
5–9 .578 .486 .215 .012 0 0 0 0
10–14 .061 .348 .390 .190 .010 0 0 0
15–19 0 .063 .319 .422 .216 .005 0 0
20–24 0 0 .067 .304 .449 .238 .004 0
25–29 0 0 0 .071 .267 .500 .289 .006
30–34 0 0 0 0 .056 .239 .609 .494
35–40 0 0 0 0 0 .018 .098 .501
The tables show column percentages, which sum to 1 within rounding
error. The rows therefore do not sum to 1
Table 3 Distribution of diagnostic disagreements by four levels of correlation
between raters based on distribution of UK prisoners
PCL-R score
0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40
q = 0.75
0–4 .395 .158 .064 .020 .002 0 0 0
5–9 .477 .352 .228 .107 .035 .002 0 0
10–14 .128 .332 .309 .219 .131 .032 0 0
15–19 0 .153 .271 .336 .265 .198 .041 0
20–24 0 .005 .119 .215 .316 .297 .266 .031
25–29 0 0 .009 .086 .162 .261 .291 .254
30–34 0 0 0 .017 .083 .191 .328 .529
35–40 0 0 0 0 .006 .020 .075 .186
q = 0.8
0–4 .438 .163 .059 .011 0 0 0 0
5–9 .485 .387 .232 .097 .014 0 0 0
10–14 .077 .354 .328 .229 .111 .016 0 0
15–19 0 .095 .296 .354 .289 .154 .023 0
20–24 0 0 .083 .237 .331 .321 .214 .007
25–29 0 0 .002 .067 .176 .287 .303 .224
30–34 0 0 0 .005 .077 .200 .387 .545
35–40 0 0 0 0 .002 .022 .073 .224
q = 0.85
0–4 .470 .168 .042 .003 0 0 0 0
5–9 .474 .437 .224 .069 .003 0 0 0
10–14 .056 .342 .375 .226 .075 .003 0 0
15–19 0 .053 .308 .404 .287 .106 .001 0
20–24 0 0 .050 .251 .383 .325 .140 0
25–29 0 0 0 .047 .194 .321 .328 .145
30–34 0 0 0 0 .058 .229 .447 .578
35–40 0 0 0 0 0 .017 .084 .277
q = 0.9
0–4 .530 .169 .020 0 0 0 0 0
5–9 .450 .501 .219 .033 0 0 0 0
10–14 .019 .312 .457 .207 .037 0 0 0
15–19 0 .018 .286 .489 .275 .050 0 0
20–24 0 0 .017 .252 .450 .318 .062 0
25–29 0 0 0 .018 .214 .367 .325 .046
30–34 0 0 0 0 .024 .257 .528 .587
35–40 0 0 0 0 0 .008 .085 .367
7 Professional standards indicate that the CSEM is an important piece
of information that should be provided in a test manual. For example,
Standard 2.14 ‘‘Conditional standard error of measurements should be
reported at several score levels if constancy cannot be assumed.
Where cut scores are specified for selection or classification, the
standard errors of measurement should be reported in the vicinity of
each cut score.’’ (American Educational Research Association/
American Psychological Association, 1999; p. 35 emphasis added).
264 Law Hum Behav (2010) 34:259–274
123
evaluate the true level of agreement of diagnosis likely to
apply around a cut-off it is necessary to take the CSEM into
account.
Item Response Theory indicates that the error of measurement
varies with location on the trait (h).
IRT gives
SEðhÞ ¼
1
ffiffiffiffiffiffiffiffi
IðhÞ
p
where I(h) is the information at h.
CTT gives
SEM ¼ SD
ffiffiffiffiffiffiffiffiffiffiffi
1 q
p
Let q1 be the correlation at location 1 (h1), q2 be the
correlation at location 2 (h2).
Then
q2 ¼ 1 ð1 q1Þ
Iðh1Þ
Iðh2Þ
Location 1 is h = 0.0 (PCL-R = 20) and Location 2 is
h = 1.0 (PCL-R = 30) (Approximate locations from Hare,
2003; Fig. 6.6; see also Cooke & Michie, 1997). Overall,
the impact of the location of the estimated ICC1 is limited,
dropping—at a maximum—from .75 to .69. However, as
noted above, even small drops in ICC1 (e.g., from .85 to
.80) can substantially affect the misclassification rate and
the range of likely score discrepancies (see Table 6). It is
noteworthy that the magnitude of the drop appears to be
proportionately larger the poorer the mean estimated level
of inter-rater reliability. This suggests that the effect of the
CSEM is larger in cases that start with a relatively poor
level of inter-rater reliability. Equally, this would suggest
that proportionately greater discrepancies would, in
general, be obtained when factor or facet scores are
considered because they have lower levels of reliability
than the total scores (Hare, 2003).
STUDY TWO
The use of the PCL-R in court is frequently justified based
on its predictive utility, the support being garnered from
between-subject designs (Edens & Petrila, 2006; Hare,
2003; Walsh & Walsh, 2006). In this study, we are concerned
with the individual. We examine the confidence that
can be placed in a prediction that an individual with a
particular PCL-R score will be reconvicted for a violent
offence.
All measurements and estimates entail error. As noted
above, the degree of error is expressed by CIs. For
Table 4 Cumulative distribution of expected discrepancies between two raters for different levels of correlation based on two sample
distributions
Point discrepancy SEMa North American male offenders United Kingdom prisoners
Correlation Correlation
0.75 0.80 0.85 0.90 0.75 0.80 0.85 0.90
0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1 .741 .934 .927 .919 .901 .936 .928 .918 .900
2 .503 .804 .785 .758 .704 .804 .785 .758 .703
3 .317 .679 .647 .595 .518 .681 .647 .595 .518
4 .303 .560 .516 .453 .352 .561 .518 .454 .357
5 .095 .449 .397 .327 .217 .452 .400 .332 .222
6 .046 .349 .292 .215 .121 .353 .298 .221 .125
7 .020 .262 .205 .136 .058 .269 .212 .141 .063
8 .007 .190 .137 .079 .023 .196 .142 .083 .026
9 .002 .132 .086 .041 .007 .137 .090 .044 .009
10 .001 .088 .050 .018 .002 .091 .054 .021 .003
11 .055 .026 .007 .058 .030 .009 .001
12 .032 .012 .002 .035 .015 .004
13 .017 .005 .019 .007 .001
14 .008 .002 .010 .003
15 .003 .005 .001
16 .001 .002
17 .001
a This column shows the cumulative distribution of discrepancies which was calculated assuming that discrepancies between two raters are
normally distributed and that the SEM is 3 (Hare, 2003, pp. 66–67)
Law Hum Behav (2010) 34:259–274 265
123
example, while the mean rate of reoffending for a ‘‘High
Risk’’ group may be estimated as being 55%; the 95% CI
indicates that the true value of the mean rate or reoffending
for this group will lie between 44% and 66%, 95% of the
time, i.e., 19 times out of 20 (Hart, Michie, & Cooke,
2007). However, the clinician and the decision maker are
interested in the individual case not the group. Therefore,
how much confidence can the clinician and decision maker
have in predictions of reoffending in the individual case
based on PCL-R scores? We examine CIs for group and
individual predictions.
Participants
Two hundred fifty-five male prisoners between 18 and
40 years of age (M = 26.8, SD = 5.9) were interviewed in
Scotland’s largest prison for a study of psychological
characteristics and violence (Cooke, Michie, & Ryan,
2001; Michie & Cooke, 2006). Prisoners were selected by
systematic random sampling of the prison. The average
sentence length was 39 months (SD = 23 months; range
= 3 months to 10 years and life).
PCL-R Ratings
PCL-R ratings were made according to instructions in the
test manual (Hare, 1991). All PCL-R evaluations were
conducted by trained raters using both interview and file
review (ICC1 = .86).
Assessment of Recidivism
Reconviction data were obtained from two sources: The
Scottish Criminal Records Office (SCRO) and the Police
National Computer (PNC). The average follow-up period
was 29 months. The point-biserial correlation between
PCL-R scores and recidivism (r = .31) was above average
for the field (Walters, 2003). For the purposes of illustration,
we consider reconviction for violence that resulted in
a prison sentence (i.e., generally a more serious violent
offence). Follow-up data were available for 190 cases and
PCL-R data for 184 of these.
Table 5 The 68% and 95% confidence intervals for 2nd PCL-R total
score given 1st PCL-R score and ICC = 0.8
1st PCLR
Prisoners UK
LL
.95
LL
.68
UL
.68
LL
.95
LL
.95
LL
.68
UL
.68
LL
.95
0 0 0 9 12 0 0 8 13
1 0 0 10 13 0 0 9 14
2 0 1 11 14 0 0 10 15
3 0 2 12 15 0 1 11 16
4 0 3 12 15 0 1 12 16
5 0 4 13 16 0 2 12 17
6 0 4 14 17 0 3 13 18
7 0 5 15 18 1 4 14 19
8 1 6 16 19 2 5 15 20
9 2 7 16 19 2 5 16 20
10 3 8 17 20 3 6 16 21
11 4 8 18 21 4 7 17 22
12 4 9 19 22 5 8 18 23
13 5 10 20 23 6 9 19 24
14 6 11 20 23 6 9 20 24
15 7 12 21 24 7 10 21 25
16 8 12 22 25 8 11 21 26
17 8 13 23 26 9 12 22 27
18 9 14 24 27 10 13 23 28
19 10 15 24 27 10 13 24 28
20 11 16 25 28 11 14 24 29
21 12 16 26 29 12 15 25 30
22 12 17 27 30 13 16 26 31
23 13 18 28 31 14 17 27 32
24 14 19 28 31 14 17 28 32
25 15 20 29 32 15 18 28 33
26 16 20 30 33 16 19 29 34
27 16 21 31 34 17 20 30 35
28 17 22 32 35 18 21 31 36
29 18 23 32 35 18 21 32 36
30 19 24 33 36 19 22 32 37
31 20 24 34 37 20 23 33 38
32 20 25 35 38 21 24 34 39
33 21 26 36 38 22 25 35 40
34 22 27 36 39 22 25 36 40
35 23 28 37 40
36 24 28 38 40 24 27 37 40
37 24 29 39 40 25 28 38 40
38 25 30 40 40 26 29 39 40
39 26 31 40 40
40 27 32 40 40
Table 6 Values of conditional standard error of measurement at
diagnostic cut-off of 30 for different values of SEM and distributions
of the three samples
SEM q1 CSEM q2
North American male prisoners 0.75 0.70
0.80 0.76
0.85 0.82
0.90 0.88
United Kingdom prisoners 0.75 0.67
0.80 0.74
0.85 0.80
0.90 0.87
266 Law Hum Behav (2010) 34:259–274
123
Analysis
There are standard methods for estimating CIs for groups;
however, methods for estimating CIs for the individual
case are not generally covered in the standard statistical
texts used in psychology and they may, we suspect, be
unfamiliar to the majority of psychologists. We explicate
the method here. First, we consider the general case of CI
estimation before considering the specific approach based
on linear logistic regression used for our analysis.
Any CI has the general form:
Estimate t (Estimate of Standard error)
where t is the Student’s t-statistic with the appropriate
degrees of freedom.
Suppose we are interested in a single variable, e.g.,
x = IQ, and have taken a sample of size n (x1; x2; . . .; xn) to
estimate the mean and variance of IQ in the population of
interest, then the sample mean ðxÞ is the estimate of the
population mean. The accuracy of this estimate is given by
a CI
x tn
ffiffiffiffi
s2
n
r
where s is an estimate of the standard deviation of x.
Suppose we are now interested in predicting the next
observation in the population, xn?1. Then a CI for the
prediction (i.e., the prediction interval) is given by
x tn
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s2 1 þ
1
n
s
:
Note that the estimate of both the mean and the
prediction is x but that the prediction interval is (much)
wider than the CI for the mean. Note also that the size of
the sample from which the model was derived has little
influence on the width of the prediction interval.
In the linear regression situation, we have a sample of n
pairs of observations (ðx1; y1Þ; ðx2; y2Þ; . . .; ðxn; ynÞ) from
which we estimate the intercept and slope of the line by B0
and B1 in the usual way. The accuracy of estimation of the
line would be given by the CI for the mean y for a given x.
This is calculated in the standard manner (Steel, Torrie, &
Dickey, 1997).
yL; yU ¼ B0 þ B1x tn
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^r2
1
n
þ
ðx x2Þ
SSðXÞ
s
If we have a new case for which we know the x-value,
xn?1 and wish to predict the y-value, this is given by
^ynþ1 ¼ B0 þ B1xnþ1
which is the mean value of y for the given x. The CI for this
prediction (i.e., the prediction interval) is given by
yL; yU ¼ B0 þ B1xnþ1 tn
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^r2 1 þ
1
n
þ
ðxnþ1 x2Þ
SSðXÞ
s
Again this prediction interval is much wider than the
interval for the line and is not influenced to any significant
degree by the size of the sample from which the model was
developed. The square root term is the standard error of the
predicted value. Here, the expression in brackets takes into
account three sources of error. The first is the variability in
participants, the second is the error in the estimate of
variance ð^r2Þ; and the third allows for the fact that the error
in prediction varies with distance from the mean PCL-R
score.
Linear logistic regression is the appropriate method for
modeling the prediction of a binary outcome (e.g.,
reconviction). In linear logistic regression the model is
given by
PrðeventÞ ¼
1
1 þ eZ
where
Z ¼ B0 þ B1x
We have a linear regression of Z on x so the equation for
the CI for Z is the same as the linear regression case. A
prediction interval for Z for a new individual from the same
population with score x0 can be constructed by ZL and ZU
(lower and upper values, respectively) from the equation
ZL; ZU ¼ B0 þ B1x0 tn
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^r2 1 þ
1
n
þ
ðx0 xÞ2
SSðXÞ
!
v
u
u
t
and the interval is then transformed in the following
manner. Since the probability is a monotonic function, the
prediction interval for the probability is given by
1
1 eZL
;
1
1 þ eZU
(note, once again, that the size of sample on which the
model was developed has little influence on the width of
the prediction interval).
Initially, we estimated the linear regression of Z on the
PCL-R total score (Fig. 2), and then estimated the mean
probability of reconviction by PCL-R score (Fig. 3). This is
a monotonic function with the probability of reconviction
accelerating with increasing PCL-R score. Those with an
average PCL-R score (i.e., 12.5) had a 14% probability of
being reconvicted for a violent crime and being sentenced
to prison. Examination of the 95% CIs for the estimate of
the mean rate of reconviction indicated that for an average
PCL-R score of 12.5 the true probability of reconviction
was between 10% and 20% (19 times out of 20) and for a
PCL-R score of 25 the 95% CI was 18–54%. For a score of
Law Hum Behav (2010) 34:259–274 267
123
30, the 95% CI was 21–70% demonstrating that 95% CIs
generally widen the further scores are from the mean.
Fundamentally, however, within the clinical or judicial
context, the individual—not the group—is the focus of
decision making. Therefore, we estimated the CIs for the
likelihood that an individual would be reconvicted using
the method outlined above. For an individual with a mean
PCL-R score of 12.5, the best estimate was that he will
reoffend 14% of the time; but this estimate was very
imprecise because the 95% CI was 0–98%, i.e., the true
value of the prediction would lie in this range 95% of the
time. For an individual with a score of 25, the 95% CI was
0–99% and for an individual with a score of 30 the 95% CI
was 0–99.5%.
To illustrate the extent of the uncertainty associated with
an individual prediction, we calculated the probability
density function associated with a prediction that an individual
with a PCL-R score of 25 would return to prison
within 2 years for a violent crime (point estimate .33). The
probability function is a means of describing degree of
uncertainty. It can be viewed as a smoothed version of a
histogram depicting relative frequencies of the range of
probabilities of reoffending consequent on the variability in
the original sample. Figure 4 displays a relatively flat
probability density function around the point estimate of
.33, with values ranging from 0 to 1.0, indicating that a
broad range of values is likely in any individual case.
One anonymous reviewer made the compelling case that
a more liberal definition of harmful behavior that included
more forms of offending should be considered. We carried
out the same analyses with three other outcome variables,
i.e., convictions for any crime or offence within 2 years of
release; any convictions leading to incarceration and any
conviction for a violent crime over the same time period.
Figure 5 displays results for any convictions within 2 years
of release and reveals the same pattern as for violent crime:
The CI associated with the regression line being much
narrower than the prediction interval.8
Another anonymous reviewer suggested that our results
might be due to sample size (this is highly unlikely given
the mathematical basis for the analysis, see above) or
because of where the sample was drawn. We carried out
further analyses to clarify these points using data from the
MacArthur study of Mental Disorder and Violence (Monahan
et al., 2001). Psychopathy as measured using the
Psychopathy Checklist: Screening Version (PCL:SV; Hart,
Cox, & Hare, 1995) was the strongest risk factor for future
violence in that study, i.e., violence in the 20 weeks following
discharge; the sample size with PCL:SV ratings was
over four times the Scottish sample (n = 860). The pointbiserial
correlation between PCL:SV scores and recidivism
was similar to the equivalent correlation in the Scottish
sample (r = 0.34, cf. r = 0.31). Figure 6 displays essentially
the same pattern as the Scottish data: The slope is
similar in shape to the Scottish curve, a slowly accelerating
curve with risk of violence increasing with PCL:SV score.
As would be expected from consideration of the equations
above increasing the sample size (among other things) has
resulted in a narrower CI around the regression line. But
critically—as would be expected from the mathematics—
increasing the sample size does not result in a narrower
prediction interval.
Perhaps a nonpsychological example may facilitate
explanation. Given someone’s height, how well can we
predict his or her weight? This example has several
advantages for the purpose of illustrating the pervasive
nature of the problem of predicting in the individual case.
First, the reliability of the measurement of height and
0 8 16 24 32 40
PCL Total Score
0
3
6
9
-3
-6
-9
Z
Z
Mean
Prediction
Fig. 2 Group and individual CIs for linear regression of Z on PCL-R
Total Score
0 8 16 24 32 40
PCL-R total score
0
0.2
0.4
0.6
0.8
1
Probability of recidivism
Probability
Mean
Prediction
Fig. 3 Group and individual CIs around of prediction of violent
reoffending resulting in return to prison based on PCL-R score
8 Detailed descriptions of the results for the additional three outcome
variables can be obtained from the first author.
268 Law Hum Behav (2010) 34:259–274
123
weight should be substantially higher than the measurement
of either psychopathy or violent behavior. Second,
the prediction is immediate and not degraded by the passage
of time. Third, the relationship between height and
weight is stronger than that between psychopathy and
violent behavior. How well can we predict in the individual
case under these more benign conditions? We carried out a
Monte Carle simulation9 based on two sets of findings: The
height of males in the UK is normally distributed (Guilford,
Rona, & Chinn, 1992); and the relationship between height
and weight can be assumed to be linear (Hawthorne,
Murdoch, & Womersley, 1979). Figure 7 presents the linear
regression of weight on height with a sample of 2000.
The CI of the regression line is very narrow; however,
when the prediction interval is calculated for an individual
it is very wide. For example, for an individual of average
height (i.e., 1.75 m) his predicted weight would be 81.5 kg
but the prediction interval is between 61.3 and 101.8 kg—a
range of around 40 kg.
In conclusion, the results demonstrate that PCL-R (and
PCL:SV) scores provide little reliable information about the likelihood that an individual will reoffend violently.10
This is not a problem peculiar to the PCL-R but will reflect
individual variability on any scale (e.g., VRAG, Quinsey
et al., 1998; Static-99, Hanson & Thornton, 1999; COVR,
Monahan et al., 2005).
DISCUSSION
One broad conclusion can be drawn from these two studies:
Clinicians must be extremely cautious in what they claim
0 8 16 24 32 40
PCL-R total score
0
0.2
0.4
0.6
0.8
1
Probability of reconviction
Mean
Prediction
Prob
Fig. 5 Group and individual CIs around of prediction of any
reoffending based on PCL-R score
0 5 10 15 20
PCL:SV
0.0
0.2
0.4
0.6
0.8
1.0
Probability
Probability of violence
Confidence interval for line
Confidence interval for prediction
Fig. 6 Group and individual CIs around the prediction of violence in
the 20 weeks following discharge: Data from MacArthur Study of
Mental Disorder and Violence
0 0.2 0.4 0.6 0.8 1
Probability of recidivism
0
0.03
0.06
0.09
0.12
0.15
Probability density
Fig. 4 Probability density function of the probability of a return to
prison on conviction of a violent crime for a PCL-R score of 25
9 Following Guilford et al. (1992), the height of adult males in the
UK between 1973 and 1988 was shown to be normally distributed
with a mean of approximately 1.75 m (SD = 0.07). The relationship
between weight and height for men aged 40–59 can be shown to be
linear with the relationships for nonsmokers being Weight = 82.7
Height = 63.4 with r = 0.50 (Hawthorne et al., 1979). A sample of
2,000 pairs of height and weight were generated using Mathcad. For
each subject, a height H was generated from a N(1.75, 0.49)
distribution. A predicted weight was then calculated; a weight W was
generated by adding an error from a N(0,98.4) distribution. The
correlation between height and weight for this sample of 2,000 cases
did not differ significantly from that reported in Hawthorne et al.
(1979) (0.50 vs. 0.499). The linear regression of weight on height
together with the CI for the regression line and the prediction interval
for an individual whose height was known were calculated and
presented in Fig. 7. Linear regression rather than logistic regression
was used because both variables are continuous. The basic distinction
between confidence intervals and prediction intervals remains the
same.
10 We do not consider additional issues that would add ‘‘noise’’ into
the system including recalibration of the PCL-R in a new jurisdiction
in terms of the metric equivalence—or otherwise—of the scores
(Cooke et al., 2005), the differences of reliability in clinical practice
against research settings, and variations in the predictive validity of
the PCL-R in a setting where detection and conviction rates may be
different, etc.
Law Hum Behav (2010) 34:259–274 269
123
regarding diagnoses, numerical scores, and risk potential of
individual clients based merely on a PCL-R score. First,
allocation above and below key diagnostic cut-offs (i.e., 30
or 25 on the PCL-R) is subject to far greater variability than
previously demonstrated. Second, the precision of numerical
scores is less than previously considered. Third, the
clinician can have little confidence in statistical predictions
regarding an individual’s likelihood of future offending
based on a PCL-R score or the scores of violence risk
assessment instruments. Fourth, the concatenation of these
two sources of imprecision—score and predictive—is
likely to further intensify uncertainty about what any one
individual will do in the future.
We emphasize again that these problems are not unique
to the PCL-R: The shape of underlying score distributions
will influence the precision of any scores estimated or any
diagnoses derived. Statistical predictions about individuals
will always be poor (Hart et al., 2007). As noted above, all
psychological tests used in the same way in the forensic
arena may suffer from similar limitations (e.g., VRAG,
Quinsey et al., 1998; Static-99, Hanson & Thornton, 1999;
COVR, Monahan et al., 2005).
Neither are these problems unique to psychology. They
bedevil—as our height and weight example demonstrates—
any attempts to use group data to predict individual outcomes
accurately, whether the outcome is, for example,
heart attacks, cancer, juvenile delinquency, or recidivism
(Copas & Marshall, 1998; Elmore & Fletcher, 2006; Rose,
1992; Scott, 2003). The problem reflects inherent human
variation.
There are perhaps two broad findings to note when it
comes to considering the precision of our estimates of trait
strength (or indeed, diagnosis). First, the use of aggregate
statistics (e.g., Kappa or ICC1) to measure agreement, or to
infer precision of our measurement processes, can obscure
clinically important imprecision at the level of the individual.
Second, untested assumptions (e.g., that scores are
normally distributed) can be misleading when it comes to
estimating the precision of our estimates. The findings from
the Monte Carlo study (Study 1, described above) whether
expressed in terms of diagnostic agreement, score disagreement,
or range of score discrepancies may provide
some explanation for the growing evidence of clinically
significant discrepancies in PCL-R ratings in forensic settings
(Boccaccini, Turner, & Murrie, 2008; Edens &
Petrila, 2006; Murrie et al., 2008; Murrie, Boccaccini,
Turner, et al., in press).
Ethical forensic practice requires practitioners to maximize
their reliability. There are no panaceas but four steps
may assist. The first step is ongoing education and training,
not only regarding the research base of tests and measures
used in forensic practice, but also regarding advanced
clinical skills. Advanced clinical skills would include
techniques for interviewing these challenging individuals to
ensure the collection of relevant information; these
skills would also include techniques for generating case
formulations to ensure the appropriate application of the
information collected (Cooke, 2008, 2009a; Logan &
Johnstone, 2008). The second step is ensuring the availability
of comprehensive file information. The quality of file
information influences both the magnitude and reliability of
scores (Alterman, Cacciola, & Rutherford, 1993). The third
step is the use of multiple raters in high stakes cases; average
ratings should be eschewed, consensus ratings should be
sought. The fourth step is the implementation of audit systems—
including peer review—for the detection of rater
drift (Cooke, 2009b).
Deriving Inferences About Individuals
from Inferences About Groups
We recognize that some of our conclusions may be surprising—
perhaps even controversial—as there is a
widespread acceptance of the prediction paradigm. However,
should we be surprised that we find it difficult to
predict what any individual will do in the future? Consider
just some of the factors that affect predictive accuracy: The
lack of reliability in the predictor and outcome variables;
the relative weakness of the association between these
variables; the inherent variability across individuals—and
within individuals and their circumstances across time—
and the multitudinous causes that result in violent crime.
Perhaps we have become over-confident. Studies of judgment
under uncertainty have indicated human tendencies
both to be overconfident in predictions (Kahneman &
Tversky, 1973) and overly narrow in CI estimates (Alpert
& Raiffa, 1982). Professionals are not immune from these
biases.
1.6 1.7 1.8 1.9
Height (m)
40
60
80
100
120
Weight (kg)
Predicted Weight
Mean
Prediction
Fig. 7 Group and individual CIs around the prediction of weight
from knowledge of an individual’s height: Simulation with n = 2,000
270 Law Hum Behav (2010) 34:259–274
123
The findings we present about predictions in individual
cases reflect a problem of inference that is long recognized
in psychology and other disciplines more generally (e.g.,
Altman & Royston, 2000; Henderson & Keiding, 2005).
Discussing child development, Lewin (1931; as cited in
Richters, 1997) noted ‘‘An inference from the average to
the particular case is … impossible.’’ (Richters, 1997, p.
199). Discussing the medical application of prognostic
models, Altman and Royston (2000) noted ‘‘…the distinction
between what is achievable at the group and
individual levels is not well understood’’ (p. 454). The
problem pertains even under ideal conditions: Henderson
and Keiding (2005), discussing survival time prediction in
relation to virulent non-small-cell lung cancer, indicated
‘‘…the intrinsic statistical variations in life times are so
large that predictions based on statistical models and
indices are of little use for individual patients. This applies
even when the prognostic model is known to be true and
there is no statistical uncertainty in parameter estimation’’
(p. 703). Why is this so?
Confidence Intervals and Prediction Intervals
As we indicated in our exegesis of the statistical principles
underlying this problem, the CIs for model parameters are
different from the CIs around the prediction for a new case.
The latter always being substantially wider than the former.
Also, prediction intervals are little influenced by the size of
the sample used to develop the statistical model (Steel
et al., 1997). Collecting bigger samples is not a solution.
We demonstrated this empirically by contrasting the
Scottish sample with the MacArthur sample.
The distinction between CIs and prediction intervals is
made in other areas of assessment, e.g., intelligence testing.
An example may demonstrate the pervasiveness of the
prediction problem when applied to the individual. The
Wechsler Abbreviated Scale of Intelligence (WASI; Psychological
Corporation, 1999) is a brief test of intellectual
functioning, which can be used to predict an individual’s
performance on the ‘‘gold standard’’ Wechsler Intelligence
Scale for Children—Third Edition (WISC-III; Wechsler,
1991). Note that these are very reliable tests, note also, they
measure within the same conceptual domain (using very
similar procedures), and that the correlation between the two
tests is very high (Full-scale IQ r = .87). An individual
assessed on the WASI with a Full Scale IQ of 70 (90% CI
66–76) will have a predicted WISC-III Full Scale IQ of 70
(90% Prediction Interval 62–87; Psychological Corporation,
1999). Thus, even in these ideal conditions—the same
conceptual domain, highly reliable tests that are highly
correlated—the prediction interval is 2.5 times greater than
the equivalent CI. It is not surprising that the difference
between the CI and the prediction interval is even greater
when the link between PCL-R scores and future violence is
considered. In the Scottish sample, for a mean PCL-R score
the best estimate of the probability of reoffending violently
is 14%, the CI is between 10% and 20%, whereas the prediction
interval is between 0% and 98%. In this case, the
prediction interval is almost ten times the CI.
This problem of moving from the general to the specific
is not merely a matter of statistics; it is also a matter of
logic (Haje´k & Hall, 2002; Hart et al., 2007). The application
of between-subject information to guide withinsubject
causal inference is subject to the logical fallacy of
division (Rorer, 1990). One form of this fallacy rests on
drawing an invalid conclusion about an individual member
of a group based on the collective properties of the group.
For example, it is obviously fallacious to argue that if, in
general, intelligent people earn more than less intelligent
people then Jules, with an IQ of 120, will earn more than
Jim with an IQ of 100. Equally, it is fallacious to argue that
although, in general, people who score highly on the PCLR
re-offend more than people who do not score highly, Bill
with an PCL-R score of 30 will re-offend more often than
Brian with a PCL-R score of 10. A common defense of the
actuarial approach is founded upon this fallacy. ‘‘If it is
alright for life insurance companies, it should be alright for
psychology.’’ The analogue is false. The actuary makes a
profit by predicting the proportion of insured lives that will
end in a particular time period: The actuary has no interest
in predicting the deaths of particular individuals.
There is a growing awareness in psychology that
between-subject models cannot test or support causal
accounts (e.g., pertaining to earning potential or violence)
that are valid at the individual level (Borsboom, Mellenbergh,
& van Heeran, 2003; Richters, 1997). With a
between-subjects design it is possible to argue legitimately
that within population differences in psychopathy can cause
differences in population differences in violent reoffending.
However, this position cannot be defended at the level of the
individual; this is because there is an unspoken assumption
that the mechanisms that operate at the level of the individual
also explain variations between individuals. Richters
(1997) clarified the basis of the problem:
The extraordinary human capacity for equifinal and
multifinal functioning, however, render the structural
homogeneity assumption untenable. Very similar
patterns of overt functioning may be caused by
qualitatively differing underlying structures both
within the same individual at different points in time,
and across different individuals at the same time
(equifinality) (pp. 206–207).
Individuals are violent for different reasons: Any one
individual may be violent for different reasons on different
occasions.
Law Hum Behav (2010) 34:259–274 271
123
In summary, on the basis of empirical findings, statistical
theory, and logic it is clear that predictions of future
offending cannot be achieved, with any degree of confidence,
in the individual case.
CONCLUSION
We emphasize again that the problems identified in this
article are not unique to the PCL-R. In some sense our
ability to demonstrate these problems with the PCL-R is a
reflection of the success of this test: It is used extensively
and thus large datasets are available; It has been subject to
considerable psychometric evaluation. Other tools used in
forensic settings will be subject to similar limitations. For
example, the precision with which individuals can be
allocated to risk ‘‘bins’’ by actuarial risk tools is influenced
by the reliability of scoring and the underlying distribution
of scores. Faigman (2007) argued that psychology has
ignored the problem of translating scientific research into
findings that help triers of fact; he indicates that psychology
has to take on the ‘‘monumental intellectual challenge’’
(p. 313) of making the inferential leap between populationlevel
findings and individual-level findings relevant to
courts. We ignore this challenge at our peril. Tentative
steps toward meeting this challenge are discussed elsewhere
(Cooke, 2009b).
This article is not without limitations. First, it is based on
males. We know little about the reliability of the diagnosis
and predictive utility in females or, indeed, whether the
instrument functions adequately in females or other populations
(Forouzan & Cooke, 2005; Verona & Vitale, 2006).
Second, the study is focused on adults. The potential for lifechanging
decisions may be even greater when related procedures
are applied to adolescents; less information is
generally available to make a diagnosis in adolescents
(Edens & Petrila, 2006). The methods we used are explicated
in detail in this article so that others can apply them to
their own—hopefully diverse—datasets.
REFERENCES
Alpert, M., & Raiffa, H. (1982). A progress report on the training of
probability assessors. In D. Kahneman, P. Slovic, & A. Tversky
(Eds.), Judgment under uncertainty: Heuristics and biases (pp.
294–305). New York: Cambridge University Press.
Alterman, A. I., Cacciola, J. S., & Rutherford, M. J. (1993). Reliability
of the Revised Psychopathy Checklist in substance abuse patients.
Psychological Assessment, 5, 442–448. doi:10.1037/1040-3590.
5.4.442.
Altman, D. G., & Royston, P. (2000). What do we mean by validating
a prognostic model? Statistics in Medicine, 19, 453–473. doi:
10.1002/(SICI)1097-0258(20000229)19:4453::AID-SIM350[
3.0.CO;2-5.
American Educational Research Association/American Psychological
Association. (1999). Standards for educational and psychological
testing. Washington, DC: American Educational Research
Association.
Boccaccini, M. T., Turner, D. B., & Murrie, D. C. (2008). Do some
evaluators report consistently higher or lower PCL-R scores than
others? Findings from a statewide sample of sexually violent
predator evaluations. Psychology, Public Policy, and Law, 14,
262–283. doi:10.1037/a0014523.
Borsboom, D., Mellenbergh, G. J., & van Heeran, J. (2003). The
theoretical status of latent variables. Psychological Review, 110,
203–219. doi:10.1037/0033-295X.110.2.203.
Bradfield, R. B., Huntzickler, P. B., & Fruehan, G. J. (1970). Errors of
group regression for prediction of individual energy expenditure.
The American Journal of Clinical Nutrition, 23, 1015–1016.
Colditz, G. A. (2001). Cancer culture; epidemics, human behavior,
and the dubious search fro new risk factors. American Journal of
Public Health, 91, 357–359. doi:10.2105/AJPH.91.3.357.
Cooke, D. J. (2008). Psychopathy as an important forensic construct:
Past, present and future. In D. Canter & R. Zukauskiene (Eds.),
Psychology, crime & law. New horizons—International perspectives.
Aldershot: Ashgate.
Cooke, D. J. (2009a). Psychopathy. In E. A. Campbell & J. Brown
(Eds.), Cambridge handbook of forensic psychology. Cambridge:
Cambridge University Press.
Cooke, D. J. (2009b). Strengths and limitations of the Psychopathy
Checklist Revised (PCL-R) in courts and other tribunals (Paper
under preparation).
Cooke, D. J., & Michie, C. (1997). An Item Response Theory
evaluation of Hare’s Psychopathy Checklist. Psychological
Assessment, 9, 2–13. doi:10.1037/1040-3590.9.1.3.
Cooke, D. J., Michie, C., & Hart, S. D. (2006). Facets of clinical
psychopathy: Towards clearer measurement. In C. J. Patrick
(Ed.), Handbook of psychopathy (pp. 91–106). New York: The
Guilford Press.
Cooke, D. J., Michie, C., Hart, S. D., & Clark, D. (2005). Assessing
psychopathy in the United Kingdom: Concerns about crosscultural
generalisability. The British Journal of Psychiatry, 186,
339–345. doi:10.1192/bjp.186.4.335.
Cooke, D. J., Michie, C., & Ryan, J. (2001). Evaluating risk for
violence: A preliminary study of the HCR-20, PCL-R and VRAG
in a Scottish prison sample. Edinburgh: Scotland Office.
Copas, J., & Marshall, P. (1998). The offender group reconviction
scale: A statistical reconviction score for use by probation
officers. Applied Statistics, 47, 159–171. doi:10.1111/1467-9876.
00104.
DeMatteo, D., & Edens, J. F. (2006). The role and relevance of the
Psychopathy Checklist-Revised in court. A case law survey of
U.S courts (1991–2004). Psychology, Public Policy, and Law,
12, 214–241. doi:10.1037/1076-8971.12.2.214.
Douglas, K. S., Vincent, G. M., & Edens, J. F. (2006). Risk for
criminal recidivism: The role of psychopathy. In C. J. Patrick
(Ed.), Handbook of psychopathy (pp. 533–554). New York: The
Guilford Press.
DSPD Programme. (2005). Dangerous and Severe Personality
Disorder (DSPD) High Secure Services for Men. London: DSPD
Programme, Department of Health, Home Office, HM Prison
Service.
Edens, J. F., & Petrila, J. (2006). Legal and ethical issues in the
assessment and treatment of psychopathy. In C. J. Patrick (Ed.),
Handbook of psychopathy (pp. 573–588).New York: The Guilford
Press.
Elmore, J. G., & Fletcher, S. W. (2006). The risk of cancer risk
prediction: ‘‘What is my risk of getting breast cancer? Journal of
the National Cancer Institute, 98, 1673–1675.
272 Law Hum Behav (2010) 34:259–274
123
Faigman, D. L. (2007). The limits of science in the courtroom. In E.
Borgida & S. T. Fiske (Eds.), Beyond common sense: Psychological
science in the courtroom (pp. 303–313). Oxford:
Blackwell.
Fitch, W. L., & Ortega, R. J. (2000). Law and the confinement of
psychopaths. Behavioral Sciences & the Law, 18, 663–678. doi:
10.1002/1099-0798(200010)18:5663::AID-BSL408[3.0.CO;2-V.
Fleiss, J. L. (1981). Statistical methods for rates and proportions.
New York: Wiley.
Forouzan, E., & Cooke, D. J. (2005). Figuring out la femme fatale:
Conceptual and assessment issues concerning psychopathy in
females. Behavioral Sciences and the Law, 23, 765–778.
Gail, M. H., & Benichou, J. (2000). Encyclopedia of epidemiological
methods. Chichester: Wiley.
Guilford, M. C., Rona, R. J., & Chinn, S. (1992). Trends in body mass
index in young adults in England and Scotland from 1973 to 1988.
Journal of Epidemiology and Community Health, 46, 187–190.
doi:10.1136/jech.46.3.187.
Haje´k, A.,&Hall, N. (2002). Induction and probability. In P. Machamer
& M. Silberstein (Eds.), Blackwell guide to the philosophy of
science (pp. 149–172). Oxford: Blackwell.
Hanson, R. K., & Thornton, D. M. (1999). Static 99: Improving
actuarial risk assessments for sex offenders. Ottawa: Public Works
and Government Services Canada.
Hare, R. D. (1991). The Hare Psychopathy Checklist—Revised (1st
ed.). Toronto: Multi-Health Systems.
Hare, R. D. (1993). Without conscience: The disturbing world of the
psychopaths among us (1st ed.). New York: Pocket Books.
Hare, R. D. (1998). The Hare PCL-R: Some issues concerning its use
and misuse. Legal and Criminological Psychology, 3, 101–119.
Hare, R.D. (2003). TheHare Psychopathy Checklist—Revised (2nd ed.).
Toronto: Multi-Health Systems.
Hart, S. D. (1998). The role of psychopathy in assessing risk for
violence: Conceptual and methodological issues. Legal and
Criminological Psychology, 3, 121–137.
Hart, S. D. (2001). Forensic issues. In W. J. Livesley (Ed.), Handbook
of personality disorders: Theory, research, and treatment (pp.
555–569). New York: The Guilford Press.
Hart, S. D., Cox, D. N., & Hare, R. D. (1995). The Hare Psychopathy
Checklist: Screening version (1st ed.). Toronto: Multi-Health
Systems.
Hart, S. D., & Hare, R. D. (1997). Psychopathy: Assessment and
association with criminal conduct. In D. M. Stoff, J. Breiling, &
J. D. Maser (Eds.), Handbook of antisocial behavior (pp. 22–35).
New York: Wiley.
Hart, S. D., Michie, C.,&Cooke, D. J. (2007). The precision of actuarial
risk assessment instruments: Evaluating the ‘‘Margins of Error’’ of
group versus individual predictions of violence. The British
Journal of Psychiatry, 170(Suppl 49), 60–65. doi:10.1192/bjp.
190.5.s60.
Hawthorne, V. M., Murdoch, R. M., & Womersley, J. (1979). Body
weight of men and women aged 40–64 years from an urban area
in the West of Scotland. Community Medicine, 1, 229–235.
Hemphill, J. F., Hare, R. D., & Wong, S. (1998). Psychopathy and
recidivism: A review. Legal and Criminological Psychology, 3,
139–170.
Hemphill, J. F., & Hart, S. D. (2002). Motivating the unmotivated:
Psychopathy, treatment, and change. In M. McMurran (Ed.),
Motivating offenders to change: A guide to enhancing engagement
in therapy (pp. 193–220). Chichester: Wiley.
Henderson, R., Jones, M., & Stare, J. (2001). Accuracy of point
predictions in survival analysis. Statistics in Medicine, 20, 3083–
3096. doi:10.1002/sim.913.
Henderson, R., & Keiding, N. (2005). Individual survival time
prediction using statistical models. Journal of Medical Ethics,
31, 703–706. doi:10.1136/jme.2005.012427.
Kahneman, D.,&Tversky, A. (1973). On the psychology of prediction.
Psychological Review, 80, 237–251. doi:10.1037/h0034747.
Kennaway, R. (1998). Population statistics cannot be used for
reliable individual prediction. Retrieved October 12, 2006, from
http://citeseer.ist.psu.edu/328224.html.
Leistico, A. R., Salekin, R. T., DeCoster, J., & Rogers, R. (2008). A
large-scale meta-analysis relating the Hare measures of psychopathy
to antisocial conduct. Law and Human Behavior, 32,
28–45. doi:10.1007/s10979-007-9096-6.
Logan, C., & Johnstone, L. (2008). Personality disorders: Clinical and
risk formulations (Paper under review).
Lyon, D., & Ogloff, J. R. P. (2000). Legal and ethical issues in
psychopathy assessment. In C. B. Gacono (Ed.), The clinical and
forensic assessment of psychopathy (pp. 139–173). Mahwah, NJ:
Lawrence Erlbaum Associates.
Maden, A., & Tyrer, P. (2003). Dangerous and severe personality
disorders: A new personality concept from the United Kingdom.
Journal of Personality Disorders, 17, 489–496. doi:10.1521/pedi.
17.6.489.25356.
MATHCAD.13. (2005). Mathcad 13 user’s guide. Cambridge, MA:
Mathsoft Engineering and Education, Inc.
Michie, C., & Cooke, D. J. (2006). The structure of violent behavior: A
hierarchical model. Criminal Justice and Behavior, 33, 706–737.
doi:10.1177/0093854806288941.
Monahan, J., Steadman, H., Robbins, P. C., Appelbaum, P., Banks, S.,
Grisso, T., et al. (2005).An actuarialmodel of violence. Psychiatric
Services, 56, 810–815. doi:10.1176/appi.ps.56.7.810.
Monahan, J., Steadman, H., Silver, E., Appelbaum, P., Robbins, P. C.,
Mulvey, E. P., et al. (2001). Rethinking risk assessment: The
MacArthur study of mental disorder and violence (1st ed.). New
York: Oxford University Press.
Mooney, C. Z. (1997). Monte Carlo simulation. Thousand Oaks, CA:
Sage.
Murrie, D. C., Boccaccini, M. T., Johnson, J. T., & Janke, C. (2008).
Does interrater (dis)agreement on Psychopathy Checklist scores in
Sexually Violent Predator trials suggest partisan allegiance in
forensic evaluations? Law and Human Behavior, 32(4), 352–362.
doi:10.1007/s10979-007-9097-5.
Murrie, D. C., Boccaccini, M. T., Turner, D., Meeks, M., Woods, C.,
& Tussey, C. Rater (dis)agreement on risk assessment measures
in sexually violent predator proceedings: Evidence of adversarial
allegiance in forensic evaluation. Psychology, Public Policy, and
Law, in press.
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd
ed.). New York: McGraw-Hill.
Psychological Corporation. (1999). Wechsler Abbreviated Scale of
Intelligence (WASI) manual. San Antonio, TX: Psychological
Corporation.
Quinsey, V. L., Harris, G. T., Rice, M. E., & Cormier, C. A. (1998).
Violent offenders: Appraising and managing risk (1st ed.).
Washington, DC: American Psychological Association.
Richters, J. E. (1997). The Hubble hypothesis and the developmentalist’s
dilemma. Development and Psychopathology, 9, 193–229.
doi:10.1017/S0954579497002022.
Robert, C. P. (2004). Monte Carlo statistical methods. New York:
Springer-Verlag.
Rockhill, B. (2001). The privatization of risk. American Journal of
Public Health, 91, 365–368. doi:10.2105/AJPH.91.3.365.
Rockhill, B., Kawachi, I., & Colditz, G. A. (2000). Individual risk
prediction and population-wide disease prevention. Epidemiologic
Reviews, 22, 176–180.
Rorer, L. (1990). Personality assessment: A conceptual survey. In L.
A. Pervin (Ed.), Handbook of personality: Theory and research
(pp. 693–720). New York: The Guilford Press.
Rose, G. (1992). The strategy of preventative medicine. Oxford:
Oxford Medical Publications.
Law Hum Behav (2010) 34:259–274 273
123
Salekin, R. T., Rogers, R., & Sewell, K. W. (1996). A review and
meta-analysis of the Psychopathy Checklist and Psychopathy
Checklist-Revised: Predictive validity of dangerousness. Clinical
Psychology: Science and Practice, 3, 203–215.
Scott, K. G. (2003). Commentary: Individual risk prediction, individual
risk, and population risk. Journal of Clinical Child and Adolescent
Psychology, 32, 243–245. doi:10.1207/S15374424JCCP3202_9.
Steel, R. G. D., Torrie, J. H., & Dickey, D. A. (1997). Principles and
procedures of statistics: A biometrical approach. New York:
McGraw Hill.
Tam, C. C., & Lopman, B. A. (2003). Determinism versus stochasticism:
In support of long coffee breaks. Journal of Epidemiology and
Community Health, 57, 478. doi:10.1136/jech.57.7.477.
Verona, E., & Vitale, J. (2006). Psychopathy in women: Assessment,
manifestations and etiology. In C. J. Patrick (Ed.), Handbook of
psychopathy (pp. 415–436). New York: The Guilford Press.
Wald, N. J., Hackshaw, A. K., & Frost, C. D. (1999). When can a risk
factor be used as a worthwhile screening test. British Medical
Journal, 319, 1562–1565.
Walsh, T., & Walsh, Z. (2006). The evidentiary introduction of the
Psychopathy Checklist-Revised assessed psychopathy in U.S.
courts: Extent and appropriateness. Law and Human Behavior,
30, 493–507. doi:10.1007/s10979-006-9042-z.
Walters, G. D. (2003). Predicting criminal justice outcomes with the
Psychopathy Checklist and Lifestyle Criminality Screening
Form: A meta-analytic comparison. Behavioral Sciences & the
Law, 21, 89–102. doi:10.1002/bsl.519.
Wechsler, D. (1991). Wechsler Intelligence Scale for Children (3rd
ed.). San Antonio, TX: The Psychological Corporation.
Zinger, I., & Forth, A. E. (1998). Psychopathy and Canadian criminal
proceedings: The potential for human rights abuses. Canadian
Journal of Criminology, 40, 237–276.
274 Law Hum Behav (2010) 34:259–274
123
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.