Slides - Berkeley Law

```Coding and Intercoder Reliability
Su Li
School of Law, U.C. Berkeley
2/12/2015
Outline
•
•
•
•
Basics of data coding
What’s intercoder reliability?
Why does it matter?
How to measure and report intercoder
reliability?
• How to improve intercoder reliability?
• References
Data Coding Basics
• Start from a codebook
• Exhaustive and mutually exclusive value
options for each variable
• Use multiple variables to code overlapping
values or multiple values for one observation
Example Codebook (White collar
lawyer project)
highest degree, if not law degree)
999 if the information is not available
Note: type in the applicable year (YYYY)
• c_practice_area: Practice area
1. White collar (includes white collar defense, white collar crime, white collar
litigation, etc.)
2. Government or corporate investigations
3. White collar and government/corporate investigations (if the practice area
is described this way)
4. Criminal defense (if the practice area is described this way)
Note: choose from one of the above 4 choices and type in the number. If the
practice area has a different title, type in the title.
• See var14-18 in the WC project codebook
Input data in Stata
Label data in Stata
Recode data in Stata
What’s intercoder reliability
• Intercoder reliability is the widely used term for the extent to which
independent coders evaluate a characteristic of a message or
artifact and reach the same conclusion. (Also known as intercoder
agreement, according to Tinsley and Weiss (2000).
• The intercoder reliability is not exactly the same as the correlation
coefficient that measures the degree to which "ratings of different
judges are the same when expressed as deviations from their
means."
• Rather it measures only "the extent to which the different judges
tend to assign exactly the same rating to each object" (Tinsley &
Weiss, 2000, p. 98);
http://astro.temple.edu/~lombard/reliability/
Why does it matter?
• Coding may involve coders’ judgments which
vary among individuals.
• The quality of research depends on the
coherence of coding judgments.
• Control the coding accuracy at the same time
of monitoring intercoder reliability.
• Practically, make it possible for the division of
labor among multiple coders.
Mathematical measures that are commonly
reported on intercoder reliability
• Popping (1988) identified 39 different "agreement indices" for
coding nominal categories.
• Commonly used ones:
– Percent agreement: PA0=totalAs/n
– Scott's pi (p): p=(PA0-PAe)/(1-PAe) [when PAe=Sigma(pi_squared)]
– Cohen's kappa (k): k=(PA0-PAe)/(1-PAe) [when
PAe=(1/n_squared)*Sigma(pi_squared)]
– Krippendorff's alpha (a): (Krippendorff's Alpha 3.12a software)
• There is no consensus on a single, "best" one.
• Percent agreement is widely used, but is misleading. Tends to over
estimate reliability.
• Cohen’s Kappa is being criticized but still the most frequently used.
• Hand calculations:
http://astro.temple.edu/~lombard/reliability
Example: binary var coding results of
two coders
coder1
0
1total
0
50
3
53
94.34% 5.66% 89.83%
coder2
1
4
2
6
66.67% 33.33% 10.17%
total
54
5
59
91.53% 8.47% 100%
PA0=50+2=52; n=59; PAe (in Scott’s i)=53/59* 53/59+6/59*6/59; PAe (in
Cohen’s Kappa)=PAe(in Scott’s i)*1/(59*59)
Use SPSS to calculate Cohen’s Kappa
CROSSTABS
/TABLES=var1_coder2 BY var1_coder1
/FORMAT=AVALUE TABLES
/STATISTICS=KAPPA
/CELLS=COUNT
/COUNT ROUND CELL.
Use Stata to calculate Cohen’s Kappa
• Kappa varlist; (each column shows the frequency of a value
coded by different coders)
• Kap coder1 coder2 ….(each column is a coder)
(see stata demo)
According to Landis and Koch (1977a, 165)
• below 0.0 Poor
• 0.00 – 0.20 Slight
• 0.21 – 0.40 Fair
• 0.41 – 0.60 Moderate
• 0.61 – 0.80 Substantial
• 0.81 – 1.00 Almost perfect
obs
coder 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
coder 2
1
1
1
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
coder 3
1
1
3
2
2
1
4
3
5
4
2
1
4
4
4
2
3
5
1
1
coder 4
1
1
2
2
1
1
1
3
3
3
4
4
4
4
4
5
5
5
5
1
1
1
1
2
2
3
4
3
3
5
5
4
4
4
2
1
3
5
2
3
1. Coder 1 and
coder 2; coder 1
and coder 3,
differences are
random
2. Coder 1 and
coder 3
differences a
systematic (e.g.
coder 3 alwys
code 2 as 1 and
3 as 4, compared
with coder 2
Acceptance standard: Neuendorf
(2002)
• No coherent standard. Some rules of thumb:
– “Coefficients of .90 or greater would be acceptable to all,
– .80 or greater would be acceptable in most situations,
– Below .8, there exists great disagreement” (p. 145).
– The criterion of .70 is often used for exploratory research.
– More liberal criteria are usually used for the indices
known to be more conservative (i.e., Cohen’s kappa and
Scott’s pi).
Hughes, Marie Adele, Garrett Dennis E. (1990)
Acceptance level
Percent agreement
Scott's pi (p)
recommend to
use or not
does not correct for
chance agreement
NO
correction and
systematic coding error
0.6problem
Acceptable
<0.00 Poor; 0.00-0.20 Slight;
0.21-0.40 Fair; 0.41-0.60
Acceptable (most
Moderate; 0.61-0.80 Substantial; correction and
systematic coding error extensively
0.81-1.00 is Almost Perfect."
Cohen's kappa (k) (Landis&Koch 1977)
problem
discussed)
correction and
Krippendorff's alpha
systematic coding error
(a)
problem
Acceptable
Pearson's
correlation
does not consider
systematic coding bias NO
How to improve Intercoder reliability
(Lombard et. Al. 2002)
In Research Design:
1.
2.
3.
4.
Assess reliability informally during coder training ( detailed instructions, close monitoring etc)
Assess reliability formally in a pilot test.
Assess reliability formally during coding of the full sample.
Select and follow an appropriate procedure for incorporating the coding of the reliability
sample into the coding of the full sample. (e.g. master coder quality control)
In results report:
1.
2.
3.
4.
Select one or more appropriate indices.
Obtain the necessary tools to calculate the index or indices selected.
Select an appropriate minimum acceptable level of reliability for the index or indices to be
used.
Report intercoder reliability in a careful, clear, and detailed manner in all research reports.
http://astro.temple.edu/~lombard/reliability/
Reference
•
http://astro.temple.edu/~lombard/reliability/
•
Lombard, M., Snyder-Duch, J., & Bracken, C. C. (2002). Content analysis in mass
communication: Assessment and reporting of intercoder reliability. Human
Communication Research, 28, 587-604.
•
Tinsley, H. E. A. & Weiss, D. J. (2000). Interrater reliability and agreement. In H. E.
A. Tinsley & S. D. Brown, Eds., Handbook of Applied Multivariate Statistics and
Mathematical Modeling, pp. 95-124. San Diego, CA: Academic Press.
•
Popping, R. (1988). On agreement indices for nominal data. In Willem E. Saris &
Irmtraud N. Gallhofer (Eds.), Sociometric research: Volume 1, data collection and
scaling (pp. 90-105). New York: St. Martin's Press.
•
Richard J. Landis & Gary G. Koch, The Measurement of Observer Agreements for
Categorical Data, Biometrics 33:159-174 (1977)
•
Hughes, Marie Adele, Garrett Dennis E. 1990. Intercoder Reliability Estimation
Approaches in Marketing: A Generalizability Theory Framework for Quantitative
Data. Journal of Marketing Research. Vol. 27, No. 2 (May, 1990), pp. 185-195
```