### Latent Class Modeling in Health Economics

```Some Applications of
Latent Class Modeling
In Health Economics
William Greene
Department of Economics
New York University

Outline
• Theory: Finite Mixture and Latent Class Models
• Applications
o
o
o
Obesity
Self Assessed Health
Efficiency of Nursing Hommes
Latent Classes
• A population contains a mixture of individuals
of different types (classes)
• Common form of the data generating
mechanism within the classes
• Observed outcome y is governed by the
common process F(y|x,j )
• Classes are distinguished by the parameters, j.
A Latent Class Hurdle NB2 Model
• Analysis of ECHP panel data (1994-2001)
• Two class Latent Class Model
o
Typical in health economics applications
• Hurdle model for physician visits
o
o
Poisson hurdle for participation and negative
binomial intensity given participation
Contrast to a negative binomial model
How Finite Mixture Models Work
Find the ‘Best’ Fitting Mixture of Two Normal Densities
LogL =

1000
i=1

2
1  yi - μj
lo g   j=1 π j


σ j  σ j





M a xim u m L ik e lih o o d E stim a te s
C la ss 1
E stim a te
C la ss 2
S td . E rro r
E stim a te
S td . e rro r
μ
7 .0 5 7 3 7
.7 7 1 5 1
3 .2 5 9 6 6
.0 9 8 2 4
σ
3 .7 9 6 2 8
.2 5 3 9 5
1 .8 1 9 4 1
.1 0 8 5 8
π
.2 8 5 4 7
.0 5 9 5 3
.7 1 4 5 3
.0 5 9 5 3

1
 y - 7 .0 5 7 3 7
ˆ
F (y) = .2 8 5 4 7 

 3 .7 9 6 2 8  3 .7 9 6 2 8

1

 y - 3 .2 5 9 6 6

  + .7 1 4 5 3 

 1 .8 1 9 4 1  1 .8 1 9 4 1



Mixing probabilities .715 and .285
Approximation
Actual Distribution
A Practical Distinction
• Finite Mixture (Discrete Mixture):
o
o
o
o
o
Functional form strategy
Component densities have no meaning
Mixing probabilities have no meaning
There is no question of “class membership”
The number of classes is uninteresting – enough to get a good fit
• Latent Class:
o
o
o
o
o
Mixture of subpopulations
Component densities are believed to be definable “groups”
(Low Users and High Users in Bago d’Uva and Jones application)
The classification problem is interesting – who is in which class?
Posterior probabilities, P(class|y,x) have meaning
Question of the number of classes has content in the context of the
analysis
Why Make the Distinction?
• Same estimation strategy
• Same estimation results
• Extending the latent class model
o
o
Allows a rich, flexible model specification for
behavior
The classes may be governed by different processes
Antecedents
• Pearson’s 1894 study of crabs in Naples – finite mixture
of two normals – seeking evidence of two subspecies.
• Some of the extensions I will note here have already
been (implicitly) employed in earlier literature.
o
o
o
Different underlying processes
Heterogeneous class probabilities
Correlations of unobservables in class probabilities with
unobservables in structural (within class) models
• One has not and is not widespread (yet)
o
Cross class restrictions implied by the theory of the model
Switching Regressions
• Mixture of normals with heterogeneous mean
o
o
y ~ N(b0x0,02) if d=0,
y ~ N(b1x1,12) if d=1, P(d=1)=(cz).
d is unobserved (latent switching).
• Becomes a latent class model when regime 0
is a demand function and regime 1 is a supply
function, d=0 if excess supply
• The two regression equations may involve
different variables – a true latent class model
Endogenous Switching (ca.1980)
R eg im e 0 : y i  x i 0  0   i 0
R eg im e 1 : y i  x i 1  1   i 1
Not identified. Regimes
do not coexist.
R eg im e S w itch : d * = z i   u , d = 1 [d * > 0 ]
R eg im e 0 g o vern s if d = 0 , P ro b ab ility = 1 -  ( z i  )
R eg im e 1 g o vern s if d = 1 , P ro b ab ility =
 i0

E n d o g en o u s S w itch in g :  i 0


 i0
 ( z i  )
  0    02

  

~ N  0 , ?

 

 0    

   0 0
1
2
1 1



1  

T h is is a laten t class m o d el w ith d ifferen t p ro cesses in th e tw o classes.
T h ere is co rrelatio n b etw een th e u n o b servab l es th at g o vern th e class
d eterm in atio n an d th e u n o b servab les in th e tw o reg im e eq u atio n s.
Outcome Inflation Models
• Lambert 1992, Technometrics. Quality control problem. Counting defects
per unit of time on the assembly line.
• How to explain the zeros; is the process under control or not?
• Two State Outcome: Prob(State=0)=R, Prob(State=1)=1-R
o
State=0, Y=0 with certainty
o
State=1, Y ~ some distribution support that includes 0, e.g., Poisson.
o
Prob(State 0|y>0) = 0
o
Prob(State 1|y=0) = (1-R)f(0)/[R + (1-R)f(0)]
o
R = Logistic probability
• “Nonstandard” latent class model
• Recent users have extended this to “Outcome Inflated Models,” e.g., twos
inflation in models of fertility; inflated responses in health status.
Split Population Survival Models
• Schmidt and Witte 1989 study of recidivism
• F=1 for eventual failure, F=0 for never fail.
F is unobserved. P(F=1)=, P(F=0)=1- 
• C=1 for a recidivist, observed. Prob(F=1|C=1) = 1.
• Density for time until failure actually occurs is  × g(t|F=1).
• Density for observed duration (possibly censored)
o
o
o
P(C=0)=(1- ) + (G(T|F=1)) (Observation is censored)
Density given C=1 =  g(t|F=1)
G=survival function, t=time of observation.
• Unobserved F implies a latent population split.
• They added covariates to : i =logit(zi).
• Different models apply to the two latent subpopulations.
Variations of Interest
• Heterogeneous priors for the class probabilities
• Correlation of unobservables in class probabilities
with unobservables in regime specific models
• Variations of model structure across classes
• Behavioral basis for the mixed models with implied
restrictions
Applications
• Obesity: Heterogeneous class probabilities,
generalized ordered choice; Endogenous class
membership
• Self Assessed Health: Heterogeneous
subpopulations; endogenous class membership
• Cost Efficiency of Nursing Homes: theoretical
restrictions on underlying models
Modeling Obesity with a
Latent Class Model
Mark Harris
Department of Economics, Curtin University
Bruce Hollingsworth
Department of Economics, Lancaster University
Pushkar Maitra
Department of Economics, Monash University
William Greene
Stern School of Business, New York University
300 Million People Worldwide. International Obesity Task Force: www.iotf.org
Costs of Obesity
• In the US more people are obese than smoke or use
illegal drugs
• Obesity is a major risk factor for non-communicable
diseases like heart problems and cancer
• Obesity is also associated with:
o
o
lower wages and productivity, and absenteeism
low self-esteem
• An economic problem. It is costly to society:
o
o
USA costs are around 4-8% of all annual health care
expenditure - US \$100 billion
Canada, 5%; France, 1.5-2.5%; and New Zealand 2.5%
Measuring Obesity
• An individual’s weight given their height should lie
within a certain range
o Body Mass Index (BMI)
2
o Weight (Kg)/height(Meters)
• WHO guidelines:
o Underweight
BMI < 18.5
o Normal
18.5 < BMI < 25
o Overweight
25 < BMI < 30
o Obese
BMI > 30
o Morbidly Obese
BMI > 40
Two Latent Classes: Approximately Half of European Individuals
Modeling BMI Outcomes
• Grossman-type health production function
Health Outcomes = f(inputs)
• Existing literature assumes BMI is an ordinal, not cardinal,
representation of individuals.
o
o
Weight-related health status
Do not assume a one-to-one relationship between BMI levels and
(weight-related) health status levels
• Translate BMI values into an ordinal scale using WHO guidelines
• Preserves underlying ordinal nature of the BMI index but recognizes
that individuals within a so-defined weight range are of an
(approximately) equivalent (weight-related) health status level
Conversion to a Discrete Measure
• Measurement issues: Tendency to underreport BMI
o
o
women tend to under-estimate/report weight;
men over-report height.
• Using bands should alleviate this
• Allows focus on discrete ‘at risk’ groups
A Censored Regression Model for BMI
Simple Regression Approach Based on Actual BMI:
BMI* = ′x + ,  ~ N[0,2]
Interval Censored Regression Approach
WT = 0 if
BMI* < 25 Normal
1 if 25 < BMI* < 30 Overweight
2 if
BMI* > 30 Obese
 Inflexible reliance on WHO classification
 Rigid measurement by the guidelines
An Ordered Probit Approach
A Latent Regression Model for “True BMI”
BMI* = ′x + ,  ~ N[0,σ2], σ2 = 1
“True BMI” = a proxy for weight is
unobserved
Observation Mechanism for Weight Type
WT = 0 if
BMI* < 0 Normal
1 if 0 < BMI* <  Overweight
2 if  < BMI*
Obese
Heterogeneity in the BMI Ranges
• Boundaries are set by the WHO narrowly defined for all individuals
• Strictly defined WHO definitions may consequently push individuals into
inappropriate categories
• We allow flexibility at the margins of these intervals
• Following Pudney and Shields (2000) therefore we consider Generalised
Ordered Choice models - boundary parameters are now functions of
observed personal characteristics
Generalized Ordered Probit Approach
A Latent Regression Model for True BMI
BMIi* = ′xi + i , i ~ N[0,σ2], σ2 = 1
Observation Mechanism for Weight Type
WTi = 0 if
BMIi* < 0
Normal
1 if 0 < BMIi* < i(wi) Overweight
2 if (wi) < BMIi*
Obese
Latent Class Modeling
• Several ‘types’ or ‘classes. Obesity be due to genetic reasons
(the FTO gene) or lifestyle factors
• Distinct sets of individuals may have differing reactions to
various policy tools and/or characteristics
• The observer does not know from the data which class an
individual is in.
• Suggests a latent class approach for health outcomes
(Deb and Trivedi, 2002, and Bago d’Uva, 2005)
Latent Class Application
• Two class model (considering FTO gene):
o More classes make class interpretations much
more difficult
o Parametric models proliferate parameters
• Endogenous class membership: Two classes allow us
to correlate the equations driving class membership
and observed weight outcomes via unobservables.
• Theory for more than two classes not yet developed.
Heterogeneous Class Probabilities
• j = Prob(class=j) = governor of a detached
natural process. Homogeneous.
• ij = Prob(class=j|zi,individual i)
Now possibly a behavioral aspect of the process, no
longer “detached” or “natural”
• Nagin and Land 1993, “Criminal Careers…
Endogeneity of Class Membership
C lass M em bership: C * =  z i  u i ,
B M I|C lass= 0,1
B M I* =
C = 1[C * > 0] (P robit)
 c x i   c , i , B M I grou p = O P [B M I*,  (  c w i )]
 0   1
 ui 
E ndogeneity: 
~
N

  , 
  c ,i 
 0    c
c 

1 
B ivaria te O rdered P robit (one variable is binary).
Full inform ation m axim um likelihood.
Model Components
• x: determines observed weight levels within classes
For observed weight levels we use lifestyle factors such as
marital status and exercise levels
• z: determines latent classes
For latent class determination we use genetic proxies such as
age, gender and ethnicity: the things we can’t change
• w: determines position of boundary parameters within classes
For the boundary parameters we have: weight-training
intensity and age (BMI inappropriate for the aged?)
pregnancy (small numbers and length of term unknown)
Data
• US National Health Interview Survey (2005);
conducted by the National Center for Health
Statistics
• Information on self-reported height and
weight levels, BMI levels
• Demographic information
• Split sample (30,000+) by gender
Outcome Probabilities
•
•
•
Class 0 dominated by normal and overweight probabilities ‘normal weight’ class
Class 1 dominated by probabilities at top end of the scale ‘non-normal weight’
Unobservables for weight class membership, negatively correlated with those
determining weight levels:
Normal
Class 1
Overweight Obese
Class 0
Normal
Overweight
Obese
Classification (Latent Probit) Model
BMI Ordered Choice Model
•
•
•
•
•
•
Conditional on class membership, lifestyle factors
Marriage comfort factor only for normal class women
Both classes associated with income, education
Exercise effects similar in magnitude
Exercise intensity only important for ‘non-normal’ class:
Home ownership only important for .non-normal.class, and negative: result of
differing socieconomic status distributions across classes?
Effects of Aging on Weight Class
Effect of Education on Probabilities
Effect of Income on Probabilities
Inflated Responses in Self-Assessed Health
Mark Harris
Department of Economics, Curtin University
Bruce Hollingsworth
Department of Economics, Lancaster University
William Greene
Stern School of Business, New York University
Introduction
• Health sector an important part of developed countries’
economies: E.g., Australia 9% of GDP
• To see if these resources are being effectively utilized, we
need to fully understand the determinants of individuals’
health levels
• To this end much policy, and even more academic research, is
based on measures of self-assessed health (SAH) from survey
data
SAH vs. Objective Health Measures
Favorable SAH categories seem artificially high.
 60% of Australians are either overweight or obese (Dunstan et. al, 2001)
 1 in 4 Australians has either diabetes or a condition of impaired glucose
metabolism
 Over 50% of the population has elevated cholesterol
 Over 50% has at least 1 of the “deadly quartet” of health conditions
(diabetes, obesity, high blood pressure, high cholestrol)
 Nearly 4 out of 5 Australians have 1 or more long term health conditions
(National Health Survey, Australian Bureau of Statistics 2006)
 Australia ranked #1 in terms of obesity rates
Similar results appear to appear for other countries
SAH vs. Objective Health
Our objectives
1. Are these SAH outcomes are “overinflated”
2. And if so, why, and what kinds of
people are doing the overinflating/mis-reporting?
HILDA Data
The Household, Income and Labour Dynamics in Australia
(HILDA) dataset:
1. a longitudinal survey of households in Australia
2. well tried and tested dataset
3. contains a host of information on SAH and other health
measures, as well as numerous demographic variables
Self Assessed Health
• “In general, would you say your health is: Excellent, Very
good, Good, Fair or Poor?"
• Responses 1,2,3,4,5 (we will be using 0,1,2,3,4)
• Typically ¾ of responses are “good” or “very good” health; in
our data (HILDA) we get 72%
• Similar numbers for most developed countries
• Does this truly represent the health of the nation?
Recent Literature - Heterogeneity
• Carro (2012)
o
o
Ordered SAH, “good,” “so so,” bad”
Two effects: Random effects (Mundlak) in latent
index function, fixed effects in threshold
• Schurer and Jones(2011)
o
o
Heterogeneity, panel data,
“Generalized ordered probit:” different slope
vectors for each outcome.
Kerkhofs and Lindeboom, Health Economics, 1995
• Subjective Health Measures and State Dependent Reporting Errors
• Incentive to “misreport” depends on employment status: employed,
unemployed, retired, disabled
• Ho = an objective, observed health indicator
• H* = latent health = f1(Ho,X1)
• Hs = reported health = f2(H*,X2,S)
o
o
o
S = employment status, 4 observed categories
Ordered choice,
Boundaries depend on S,X2; Heterogeneity is induced by incentives produced by
employment status
A Two Class Latent Class Model
True Reporter
Misreporter
Reporter Type Model
r *  x r  r   r
r
= 1 if r* > 0 T rue reporter
0 if r*  0 M isreporter
r is unobserved
Y=4
Y=3
Y=2
Y=1
Y=0
Pr(true,y) = Pr(true) * Pr(y | true)
• Mis-reporters choose either good or very good
• The response is determined by a probit model
m *  x m  m   m
Y=3
Y=2
Observed Mixture of Two Classes
P r( y )  P r( true ) P r( y | true )  P r( m isreporter ) P r( y | m isreporter )
Who are the Misreporters?
Priors and Posteriors
M = M isreporter, T = T rue reporter
P riors : P r ( M )   (  x r  ),
P r ( T )   ( x r  )
P osteriors:
N oninflated outcom es 0, 1, 4
P r( M | y  0,1, 4)  0, P r (T | y  0,1, 4)   (  x r  )
Inflated outcom es 2, 3
P r( M | y  2) 
P r( y  2 | M )P r ( M )
P r( y  2 | M )P r ( M )  P r( y  2 | T )P r ( T )
General Results
0.4
0.35
Sample
0.3
Predicted
0.25
Mis-Reporting
0.2
0.15
0.1
0.05
0
Poor
Fair
Good
Very Good
Excellent
Whither the EM Algorithm?
• An Algorithm, not a model
o
o
E step: Compute posterior probabilities, ij
M step: In each class, estimate class specific
parameters using a (class and individual) weighted
log likelihood, using the posteriors as weights.
• Cannot impose cross class restrictions
• Cannot model endogeneity
Latent Class Efficiency Studies
• Battese and Coelli – growing in weather
“regimes” for Indonesian rice farmers
• Kumbhakar and Orea – cost structures for U.S.
Banks
• Greene (Health Economics, 2005) – revisits
WHO Year 2000 World Health Report
Studying Economic Efficiency
in Health Care
• Hospital and Nursing Home
o
o
Cost efficiency
Role of quality (not studied today)
• Agency for Health Reseach and Quality
(AHRQ)
Stochastic Frontier Analysis
• logC = f(output, input prices, environment) + v + u
• ε = v+u
o
o
v = noise – the usual “disturbance”
u = inefficiency
• Frontier efficiency analysis
o
o
o
Estimate parameters of model
Estimate u (to the extent we are able – we use E[u|ε])
Evaluate and compare observed firms in the sample
Nursing Home Costs
•
•
•
•
44 Swiss nursing homes, 13 years
Cost, Pk, Pl, output, two environmental variables
Estimate cost function
Estimate inefficiency
Estimated Cost Efficiency
Inefficiency?
• Not all agree with the presence (or
identifiability) of “inefficiency” in market
outcomes data.
• Variation around the common production
structure may all be nonsystematic and not
controlled by management
• Implication, no inefficiency: u = 0.
A Two Class Model
• Class 1: With Inefficiency
o
logC = f(output, input prices, environment) + vv + uu
• Class 2: Without Inefficiency
o
o
logC = f(output, input prices, environment) + vv
u = 0
• Implement with a single zero restriction in a
constrained (same cost function) two class
model
• Parameterization: λ = u /v = 0 in class 2.
LogL= 464 with a common frontier
model, 527 with two classes
Conclusion
Latent class modeling provides a rich, flexible
platform for behavioral model building.
Thank you.
```