1、Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa,Daniel H. Mathalon, Ph.D., M.D.Department of Psychiatry Yale University School of Medicine,Inter-rater Reliability of Clinical Interview Based Measures,Ratings of clinical severity for specific symptom domains (e.g, PANSS, BPRS, SA
2、PS, SANS) Continuous scales Use intraclass correlations to assess inter-rater reliability. Diagnostic Assessment Categorical Data / Nominal Scale Data How do we quantify reliability between diagnosticians? Percent Agreement, Chi-Square, Kappa,Rater 2,Rater 1,Category,nij=number of casesfalling into
3、cell=freq of joint event ij,n=total number of cases,pij= nij / n = proportion of casesfalling into particular cell.,Two raters classify n cases into k mutually exclusive categories.,Reliability by Percentage Agreement = ipii = 1/n inii,Percent Agreement Fails to Consider Agreement by Chance,Rater 1,
4、Rater 2,Assume that two raters whose judgments are completely independent (i.e., not influenced by the true diagnostic status of the patient) each diagnose 90% of cases to have schizophrenia and 10% of cases to not have schizophrenia (i.e., Other). Expected agreement by chance for each category obta
5、ined by multiplying the marginal probabilities together. Can get Percentage Agreement of 82% strictly by chance.,Proportion Agreement = .82,.90 x .90 = .81,.10 x .10 = .01,Chi-Square Test of Association as Proposed Solution,Rater 1,Rater 2, Can perform a Chi-Square Test of Association to test null h
6、ypothesis that the two raters judgments are independent. To reject independence, show that observed agreement departs from what would be expected by chance alone. Chi-Square = cells (Observed - Expected)2 / Expected Problem: In example below, we have a perfect association between the Raters with zer
7、o agreement. Chi-Square is a test of Association, not Agreement. It is sensitive to any departure from chance agreement, even when the dependency between the raters judgments involves perfect non-agreement. So, we cannot use Chi-Square Test to assess agreement between raters.,Kappa Coefficient (Cohe
8、n, 1960),Rater 2,Rater1,pi. x p.i .39 .075 .01,High reliability requires that the frequencies along the diagonal should be chance and off diagonal frequencies should be chance. Use marginal frequencies/probabilities to estimate chance agreement.,Proportion agreement observed, po= ipii = 1/n inii,Pro
9、portion agreement expected by chance, pc= ipi. x p.i,Interpretations of Kappa K = P (agreement | no agreement by chance) 1-pc = 1- .475 = .525 of cases where no agreement by chance po - pc = .7- .475 = .225 of cases are those non-chance agreement cases where observers agreed.Kappa is the probability
10、 that judges will agree given no agreement by chance. Can test Ho that Kappa = 0, Kappa is normally distributed with large samples, can test significance using normal distribution. Can erect confidence intervals for Kappa.,Weighted Kappa Coefficient,Rater 2,Rater 1,Can assign weights, wij, to classi
11、fication errors according to their seriousness using ratio scale weights.,po(w) - pc(w),Kappa Rules of Thumb,K .75 is considered excellent agreement. K .46 is considered poor agreement.,Is an intraclass correlation coefficient ( except for factor of 1/n) when weights have following property: wij = 1
12、 - (i - j)2,Weighted Kappa and the ICC,(k - 1) 2,Problems with Kappa,Affected by base rates of diagnoses. Cant easily compare across studies that have different base rates, either in the population, or in the reliability study. Chance agreement is a problem? When the null hypothesis of rater independence is not met (which is most of the time), the estimate of chance agreement is inaccurate and possibly inappropriate).,