Educational Data Mining Overview

Report
Using Data-Driven Discovery Techniques for the Design
and Improvement of Educational Systems
John Stamper
Pittsburgh Science of Learning Center
Human-Computer Interaction Institute
Carnegie Mellon University
4/8/2013
The Classroom of the Future
Which picture represents the
“Classroom of the Future”?
2
The Classroom of the Future
The answer is both!
Depends of how much money you have...
… but maybe not what you think…
3
The Classroom of the Future
Rich vs. Poor
– Poor kids will be forced to rely on “cheap” technology
– Rich kids will have access to “expensive” teachers
We are seeing this today!
– Waldorf school in Silicon Valley – no technology
– NGLC Wave III Grants
– MOOCs
– Growth of adaptive technology companies
– Online instruction
– … and more…
4
What does this mean?
My view is that we cannot stop this, I believe we
must accept that economics will force this route.
We should focus on improving learning technology
• New ways to improve teacher-student access
• Add more adaptive features to learning software
Adaptive learning, at scale, using data!
5
Educational Data Mining
• “Educational Data Mining is an emerging
discipline, concerned with developing
methods for exploring the unique types of
data that come from educational settings, and
using those methods to better understand
students, and the settings which they learn
in.”
– www.educationaldatamining.org
6
Types of EDM methods
(Baker & Yacef, 2009)
• Prediction
– Classification
– Regression
– Density estimation
• Clustering
• Relationship mining
–
–
–
–
Association rule mining
Correlation mining
Sequential pattern mining
Causal data mining
• Distillation of data for human judgment
• Discovery with models
7
Emerging Communities
• Society for Learning Analytics Research
– First conference: LAK2011
• International Educational Data Mining Society
– First conference: EDM2008
– Publishing JEDM since 2009
• Plus an emerging number of great people
working in this area who are (not yet) closely
affiliated with either community
Emerging Communities
• Joint goal of exploring the “big data” now
available on learners and learning
• To promote
– New scientific discoveries & to advance learning
sciences
– Better assessment of learners along multiple
dimensions
• Social, cognitive, emotional, meta-cognitive, etc.
• Individual, group, institutional, etc.
– Better real-time support for learners
EDM Methods to discuss
• Prediction – understand what the student
knows
• Discovery with models – improve
understanding of the structure of knowledge
10
LearnLab
Pittsburgh Science of Learning Center (PSLC)
• Created to bridge the Chasm between science &
practice
– Low success rate (<10%) of randomized field trials
• LearnLab = a socio-technical bridge between lab
psychology & schools
– E-science of learning & education
– Social processes for research-practice engagement
• Purpose: Leverage cognitive theory and computational
modeling to identify the conditions that cause robust student
learning
11
LearnLab: Data-driven improvement
infrastructure
Ed tech
+ wide use = Research in practice
Algebra Cognitive Tutor
+
=
Chemistry Virtual Lab
English Grammar Tutor
Educational Games
• 2004-14, ~$50 million
• Tech enhanced courses,
assessment, & research
• School cooperation
• In vivo experiments
Interaction data is
surprisingly revealing
Online interactions
=> state tests
• Accurate assessment
during learning
• Detect student
work ethic,
engagement …
R = .82
Learning Curve
Analysis
• Discover better
models of what
is hard to learn
Flat curve => improvement opportunity
DataShop
• Central Repository
– Secure place to store & access research data
– Supports various kinds of research
• Primary analysis of study data
• Exploratory analysis of course data
• Secondary analysis of any data set
• Analysis & Reporting Tools
– Focus on student-tutor interaction data
– Data Export
• Tab delimited tables you can open with your favorite
spreadsheet program or statistical package
• Web services for direct access
14
14
Repository
•
•
•
•
•
Allows for full data management
Controlled access for collaboration
File attachments
Paper attachments
Great for secondary analyses
How big is DataShop?
15
How big is DataShop?
Domain
Files
Language
Papers
Datasets
Student Actions
Students
Student Hours
64
11
78
6,237,523
6,499
6,877
222
53
189
75,754,530
37,218
173,175
Science
92
19
93
13,849,756
16,939
45,465
Other
18
12
50
8,604,016
13,018
31,111
396
95
410
104,445,825
73,674
256,630
Math
Total
As of April 2013
16
What kinds of data?
• By domain based on studies from the Learn Labs
• Data from intelligent tutors
• Data from online instruction
• Data from games
The data is fine grained at a transaction level!
17
Web Application
Getting to DataShop
• Explore data through the DataShop tools
• Where is DataShop?
– http://pslcdatashop.org
– Linked from DataShop homepage and learnlab.org
• http://pslcdatashop.web.cmu.edu/about/
• http://learnlab.org/technologies/datashop/index.php
19
19
DataShop Terminology
• KC: Knowledge component
– also known as a skill/concept/fact
– a piece of information that can be used to
accomplish tasks
– tagged at the step level
• KC Model:
– also known as a cognitive model or skill model
– a mapping between problem steps and knowledge
components
20
Getting the KC Model Right!
The KC model drives instruction in adaptive
learning
– Problem and topic sequence
– Instructional messages
– Tracking student knowledge
21
What makes a good KC Model?
• A correct expert model is one that is consistent with
student behavior.
• Predicts task difficulty
• Predicts transfer between instruction and test
The model should fit the data!
22
Good KC Model => Good Learning
Curve
• An empirical basis for determining when a
cognitive model is good
• Accurate predictions of student task
performance & learning transfer
– Repeated practice on tasks involving the same skill
should reduce the error rate on those tasks
=> A declining learning curve should emerge
23
A Good Learning Curve
24
How do we make KC Models?
25
Traditionally CTA has been used
But Cognitive Task Analysis has some issues…
– Extremely human driven
– It is highly subjective
– Leading to differing results from different analysts
And these human discovered models are usually
wrong!
26
If Human centered CTA is not the
answer
How should these models be designed?
They shouldn’t!
The models should be discovered not designed!
27
Solution
– We have lots of log data from tutors and other systems
– We can harness this data to validate and improve
existing student models
28
Human-Machine Student Model Discovery
DataShop provides easy interface to add and modify
KC models and ranks the models using AFM
29
29
Human-Machine Student Model
Discovery
3 strategies for discovering improvements to the
student model
– Smooth learning curves
– No apparent learning
– Problems with unexpected error rates
30
A good cognitive model
produces a learning
curve
Without decomposition, using
just a single “Geometry” skill,
no smooth learning curve.
But with decomposition,
12 skills for area,
a smooth learning curve.
Is this the correct or “best”
cognitive model?
(Rise in error rate because
poorer students get
assigned more problems)
Inspect curves for individual
knowledge components (KCs)
Many curves show a
reasonable decline
Some do not =>
Opportunity to
improve model!
32
No apparent Learning
33
Problems with Unexpected Error Rates
34
Inspect problems to hypothesize new KC labels
• Here scaffolding is originally absent, but other problems
have fixed scaffolding
– They start with columns for square & area
These strategies suggest an
improvement
– Hypothesized there were additional skills involved
in some of the compose by addition problems
– A new student model (better BIC value) suggests
the splitting the skill.
36
Redesign based on Discovered Model
Our discovery suggested changes needed to be
made to the tutor
– Resequencing – put problems requiring fewer
skills first
– Knowledge Tracing – adding new skills
– Creating new tasks – new problems
– Changing instructional messages, feedback or
hints
37
Study : Current tutor is control
• Current fielded tutor only uses scaffolded
problems
Study: Treatment
• Scaffolded, given areas, plan-only, &
unscaffolded
• Isolate practice
on problem
decomposition
Study Results
• Much more efficient & better learning on
targeted decomposition skills
Instructional time (minutes) by step type
30
20
Post-test % correct by item type
1
Area and other steps
0.95
Composition steps
0.9
0.85
10
0.8
0.75
0
Control: Original tutor Treatment: Modelbased redesign
Composition
Area
0.7
Control: Original Treatment: Modeltutor
based redesign
Translational Research Feedback Loop
Design
Discover
Deploy
Data
Can a data-driven process be
automated & brought to scale?
Yes!
• Combine Cognitive Science, Psychometrics,
Machine Learning …
• Collect a rich body of data
• Develop new model discovery algorithms,
visualizations, & on-line collaboration
support
42
DataShop’s “leaderboard” ranks discovered cognitive models
100s of datasets coming from ed tech in math, science, & language
Some models are machine generated (based on
human-generated learning factors)
Some models are human generated
43
Metrics for model prediction
• AIC & BIC penalize for more parameters,
fast & consistent
• 10 fold cross validation
• Minimize root mean squared error (RMSE) on
unseen data
44
Automated search for better models
Learning Factors Analysis (LFA)
(Cen, Koedinger, & Junker, 2006)
• Method for discovering & evaluating cognitive models
• Finds model “Q matrix” that best predicts student learning
data
• Inputs

Data: Student success on tasks over time

Factors hypothesized to explain learning
• Outputs

Rank order of most predictive Q matrix

Parameter estimates for each
Simple search process example:
modifying Q matrix by input factor to
get new Q’ matrix
•
Q matrix factor Sub split by factor Neg-result
•
•
Produces new Q matrix
Two new KCs (Sub-Pos & Sub-Neg) replace old KC (Sub)
• Redo opportunity counts
LFA: Best First Search Process
•
Original
Model
BIC = 4328
Split by Embed
4301
4320
4322
Split by Backward
4322
4313
•
Split by Initial
4312
4322
4325
50+
Search algorithm guided by a
heuristic: AIC
Start with single skill cog
model (Q matrix)
4320
4324
15 expansions later
4248
Cen, H., Koedinger, K., Junker, B. (2006). Learning Factors Analysis:
A general method for cognitive model evaluation and improvement. 8th
International Conference on Intelligent Tutoring Systems.
Scientist “crowd”sourcing:
Feature input comes “for free”
Union of all hypothesized KCs in
human generated models
Scientist generated models
48
Validating Learning Factors Analysis
• Discovers better cognitive models in 11 of 11
datasets …
Koedinger, McLaughlin, & Stamper (2012). Automated student model improvement.
In Proceedings of the Fifth International Conference on Educational Data Mining.
[Conference best paper.]
Data from a variety of educational
technologies & domains
Statistics Online Course
English Article Tutor
Algebra Cognitive Tutor
Numberline Game
50
Applying LFA across domains
Variety of domains
& technologies
11 of 11 improved
models
9 of 11 equal
or greater learning
Can we go even bigger?
52
Competitions?
KDD Cup Competition

Knowledge Discovery and Data Mining (KDD) is the most
prestigious conference in the data mining and machine
learning fields

KDD Cup is the premier data mining challenge

2010 KDD Cup called “Educational Data Mining Challenge”

Ran from April 2010 through June 2010
54
KDD Cup Competition
Competition goal is to predict student responses given tutor data
provided by Carnegie Learning
Dataset
Students
Steps
File size
Algebra I 2008-2009
3,310
9,426,966
3 GB
Bridge to Algebra 2008-2009
6,043
20,768,884
5.43 GB
55
KDD Cup Competition
 655 registered participants
130 participants who submitted predictions
 3,400 submissions
KDD Cup Competition

Advances in prediction, cognitive modeling, new methods
applied to EDM

Spawned a number of workshops and papers

The datasets are now in the “wild” and showing up in non
KDD conferences

New competitions to continue momentum
57
Marigames.org
• Two stage competition with $100,000 in
prizes
– $50,000 Game Development
– $50,000 Educational Data Mining
• Goal is to go beyond individual datasets
• This requires common data formats
58
Take aways
• The amount of data coming from educational
technology is growing exponentially
• Huge potential for EDM to improve educational
systems
• Optimal instructional design requires discoveries
(The student is not like me)
• These methods require common forms of data for
analysis (standards!)
59
Opportunities
• New Learning Science and Engineering
professional masters degree at Carnegie
Mellon University
• New concentration in Learning Analytics, MA
in Cognitive Studies in Education at Teachers
College, Columbia University
• Other programs in the works
60
Thank you
Special Thanks to:
Ken Koedinger, Director LearnLab
Ryan Baker, President IEDMS
Steve Ritter, Carnegie Learning
61
http://pslcdatashop.org
Questions?
[email protected]
http://dev.stamper.org
62

similar documents