Predict - WordPress.com

Report
Predictive Analytics in the Enterprise
Carlos Bossy
Principal Consultant
CBIP, Data Mining, MCTS, MCITP BI
Quanta Intelligence
Introduction
Experience
o
o
o
10 years Business Intelligence Consultant
Data Warehouse, Integration, Data Mining, Predictive Models
15 years Software Development as Programmer thru CTO
Speaker/Author
• Speaker at PASS Summit in October
• Local and Nationwide Events and Users Groups
• Published Chapter on Big Data for Pro SQL Server 2012 Practices
Contact Info
o
o
o
Email: [email protected]
Blog: www.carlosbossy.com
Twitter: @carlosbossy
Today’s Session
Early
Morning
Late
Morning
Predictive
Modeling
Overview
Tools and
Algorithms
SQL Server
Data Mining
After
Lunch
Use Models
with DMX –
C#, SSIS, SSRS
Mid
Afternoon
Late
Afternoon
Demo – Text
Mining with
SVM
Exercise 1
Demo - Neural
Network Code
Case Studies
Predictive
Model by
Class
Demo\Lab 1 Model
Development
Demo\Lab 2 –
Model
Development
Exercise 2
Demo 3 – R,
WEKA,
RapidMiner
Introduction
Predictive Analytics
Predictive Analytics
Data Mining
Predictive
Modeling
Math!
Just a little….
Predictive Analytics Defined
Predictive Analytics is the discovery of predictive information,
sometimes hidden, from databases (often large) using related
and unrelated data attributes by applying algorithmic analysis,
and the creation of models that generate predictive results.
Predictive Modeling is the process
of creating a model to best predict
the probability of an outcome.
Data Scientist
A Data Scientist incorporates techniques and theories from
many fields, including mathematics, statistics, data
engineering, pattern recognition and learning, advanced
computing, visualization, uncertainty modeling, data
warehousing, and high performance computing with the
goal of extracting meaning from data.
Data science: a novel term used
interchangeably with competitive
intelligence or business analytics.
Predictive Analytics
Gartner Hype Cycle
The End
What do we want to Achieve?
Model
• Enterprise
• Embedded
• Demo
Data
Prediction
Predictive Analytics
A vital component in the BI Data Architecture and
available in real-time to Applications, Reports, ETL
Measures
(past)
Predict
(unknown)
ROI of Business Analytics Projects when
Incorporating Predictive Analytics (Source: IDC)
145% vs 89%
Yesterday <-----> Today
Legacy
o Manageable Volumes of Data
Today

o The ease and power of SQL
o Domain Experts could grasp
and analyze a complete
Database
o Limited CPU Horsepower



o Finite Data



Explosion of Data doubles every 3
years (Moore’s Law)
Data Volumes can’t be
comprehended by humans
Uncover complex and difficult to
find patterns for competitive edge
Improve professional judgment of
Domain Expert (small but valuable)
Knowledge Discovery
Converting Data to Information
N --> Infinity
Applications
o
o
o
o
o
o
o
o
o
Credit Risk Analysis
Churn Analysis
Customer Retention
Targeted Marketing
Market Basket Analysis
Sales Forecasting
Stock Predictions
Medical Diagnosis
Bioscience Research
o
o
o
o
o
o
o
o
o
Surveys
Insurance Rate Quotes
Credit Card Fraud
Web Site Events
Loan Applications
Hiring and Recruiting
Cross-Marketing
Attendance
Anything!
Terms to Know
Class Variable is the data we are predicting.
Instance and Attribute is similar to a row and column used as input to
a predictive model.
Support refers to the number of instances of a particular pattern.
Overfitting (Overtraining) is a flaw in a predictive model where the
model reflects the training data too closely.
Confusion Matrix (Classification Matrix) is a table showing predicted
values and their accuracy.
Cross Validation is a way to develop and test a model by using a
percentage of the data to test and another .
Modeling Methodology / Approach
Methodology / Approach
Problem Statement
What do you want
to predict?
Is data available?
Set a Target
Improve on
current results
Develop Solution
Integrate into
Applications or
Reports
Methodology / Approach
Gather Data
Explore
Data
Monitor
Prepare
Data
Deploy
Training /
Test Sets
Test
Train
Choose
Algorithms
Event Management Model
How many people will register for an Event?
•
•
•
Simultaneous events managed: > 100
Mean Error: 15%
Target Error: 5%
Pass 1: 60 days
Pass 2: 30 days
Pass 3: 7 days
Medical Case Management
Workplace Injuries
Can a case benefit from Nurse Case Management
intervention?
95% of Claims < $2500
1% > $10,000
Predict Cost at Intake
Gather Data
Predictive
Analytics DB
• Data Warehouse
• OLTP
• Cube
• External
• User Data
Prepare Data
• Transform the data to fit the
Problem/Algorithm/Solutions
• Get derivations from SMEs, Data Analysis, Groupings
• Discrete vs. Continuous
• Bins and Buckets (discretize)
• Analyze Quantities: logarithmic, binary, linear
• Moving Averages: simple, exponential, convergence,
divergence
• Input to Algorithm in Tabular Form
Select Training and Testing Sets
Training
Testing
Choose Algorithm(s)
o
Time Series
o
Random Forest
o
Naïve Bayes
o
Model Tree
o
Association
o
Genetic
o
Clustering
o
Support Vector Machine
o
Regression Tree
o
Many more …
o
Logistic Regression
o
Clustering
o
Sequence Clustering
o
Neural Networks
Choose Algorithms
Clustering
Classification
Regression
Association
Decision Tree
Yearly Income = 64,387.383+5,984.161*(Cars-2.125)+9,372.900*(Age-60.313)+0.763*(ID17,759.125)+20,740.055*(Children-0.125)
Model Training and Testing
Apply
Algorithms to
Training
Dataset
Test Model
using Test
Dataset
Iterate
Model Training and Testing
Predictive
Model
Applications
SQL Server Data Mining
SQL Server 2005 and 2008
• Analysis Services
SQL Server 2012
• Analysis Service Multi-Dimensional
Development
• Business Intelligence Development Studio
• Excel Data Mining Add-in
Event Registrations
Model Development
7 year record of
every individual
registration
transaction
Best Algorithm
Decision Tree
Daily DW ETL
Predicts each Event
and Stores Result
Personal
Business
Account/Payment
Other Events
Transformed data to
daily tabular
snapshot for
modeling
Time Series
Decision Trees
Neural Networks
Logistic Regression
Defined Seasons,
Moving Averages
Achieved 3.5% Mean
Error Rate
(Goal 5%)
Event Registrations
Daily Snapshot in Tabular Form
Event ID
Season
1Spring
2Winter
3Spring
4Winter
5Winter
Start Date
3/17/2010
2/5/2011
5/11/2010
11/14/2011
1/23/2011
End Date Initial Reg Date Early Fee Date
3/20/2010 10/15/2009 11/15/2009
2/7/2011
12/1/2010
5/14/2010
2/15/2010
3/1/2010
11/15/2011
9/1/2011
1/23/2011
11/1/2010
Person ID Event ID Company ID
Job Title
1001
1
2510 CTO
1002
4
2010 Consultant
1003
2
2948 Architect
1004
1
3992 DBA
1005
4
2899 Consultant
Date
Event ID Registrations
Reg Date
Paid in Full
1/19/2010
Y
11/1/2011
N
12/18/2010
Y
2/15/2010
Y
10/21/2011
N
Registrations
To Date
5 Day Average
10 Day
Average
Discount Fee
Date
1/1/2010
1/5/2011
3/31/2010
8/1/2011
10/14/2010
Standard Fee
Date
Early Fee Discount Fee Standard Fee Late Fee Attendance
3/15/2010
895.00
1095.00
1295.00 1395.00
1225
2/5/2011
995
1295
880
5/11/2010
1250
1495
1795
2410
11/7/2011
695
850
950
650
1/23/2011
295
395
350
Events Attended Travel Booked
3
N
0
N
1
Y
1
N
4
Y
Group
N
Y
N
N
N
Average Events
Attended
Travel
Booked
Groups
Paid in Full
Predicted Reg
Absolute
Error
Percentage
Error
11/14/2009
1
9
110
4.0
2.9
94
1.4
45
8
1158
67
5.8%
11/15/2009
1
23
133
8.1
5.0
115
1.6
56
12
1201
24
2.0%
11/16/2009
1
2
135
8.0
4.9
116
1.6
56
12
1268
43
3.4%
11/17/2009
1
5
140
8.6
5.2
121
1.5
57
12
1251
26
2.1%
11/18/2009
1
4
144
9.0
5.1
125
1.5
60
13
1239
14
1.1%
Event Registrations
Predictive Model Outcome
30 Day Prediction
Goal:
1425 - 1575
Before:
1275 - 1725
Event
Attendance:
1500
After:
30 Day Prediction
1447 - 1553
More Terms to Know (Advanced)
Ensemble modeling combines the results of two or more
independent models to improve the quality of predictions.
Boosting (Bootstrapping) is a form of ensemble modeling that trains
subsequent models by heavily weighting poorly classified instances.
Bagging is also a form of ensemble modeling that trains each model
in the ensemble using a random subset of the training data. The
models then vote to obtain the best prediction.
Stacking trains a model by using as input the output of predictions of
other models.
Feature Selection is a process of selecting a subset of relevant data
to reduce the amount of data a model must process to train. It can
generalize a model and help avoid overfitting.
Worker Injury Claims
Data
Unbalanced
Objectives
Target
• 15 Year History
• Closed Cases
• 18 Month Lag
• 1 out of 20 is high cost Case
• Identify High Cost Cases
• Rate of False Positives Unimportant
• Need to avoid True Negatives
• 85% success identifying high cost
cases at Intake
Workers’ Compensation
Model Development
Initial Model
Successfully
identified
79% of
Cases
Subsequent
Development
Balanced
Data for
Training
Used for
Reporting
Ensemble
Model
Development
3 predictions
for each
Model
Intake
Allowance
Non-RX
Mean
Probability
of each
model
10%
Threshold
RX
Duration
Claim Scoring
Achieved
90% Success
Rate
Claim is
scored
anytime it is
Updated
Prescription
35% Trues
Neural
Network
Prediction
Probability
Score
Always
Stored
Worker Injury Claims
Tabular Data View
Case
Person ID
1
1004
2
1001
3
1098
Injury Date
Open Date Medical Date
3/17/2011
3/20/2011
3/22/2011
2/5/2011
2/5/2011
2/5/2011
8/19/2011
8/30/2011
9/12/2011
Person Company
Employment
ID
ID
Job Title
Date
Department
1001
1004 Safety Technician
1/23/2002 Maintenance
1002
1098 Assembly Supervisor
2/25/2005 Parts Dept
1003
1001 Custodian
12/2/2008 Maintenance
Rx Date
3/29/2011
2/6/2011
9/15/2011
Duration
Birth Date
10/13/1972
8/21/1968
10/13/1981
Gender
M
F
M
Body Part
Injured
8 Shoulder
75 Ankle
89 Ribs
Height
5.9
5.7
6.1
Injury Type
Sprain
Break
Bruise
Weight
175
140
190
More Terms and Things to Know
Balancing of the training dataset is done to ensure class values
occur in a similar proportion.
Scoring is the output of a model (the predicted value).
Model Disruption happens when an event occurs that causes a
model to no longer make accurate predictions.
Machine Learning is analogous to algorithmic predictive modeling.
Expert Systems are developed using expertise and rules, not
algorithms.
Unstructured data such as text and images can be modeled to
obtain information from the data.
Working Session
Group Model
A Predictive Model
for Prospective Hires
Group Model
Developing our Predictive Model
1)
2)
3)
4)
5)
6)
7)
8)
Problem Statement
Gather Data
Explore Data
Prepare Data
Training / Test Sets
Choose Algorithms
Train
Test
Group Model
Problem Statement
Should we hire a
job candidate?
What salary should we offer
to a potential employee?
Group Model
Where’s the Data?
Group Model
Building the Model
Create
Model Development
Exercise #1
New Project
- SSAS Multidimensional and Data Mining Project
Set Properties
- Server/Instance
- Add Data Source with appropriate Impersonation Mode
- Add Data Source View
Mining Structure
- Defines the data we will use
- Data must have a key
- Ensure data types and content attribute are correct for each column
- Column usage must be declared (Key, Input, Predict, PredictOnly)
Mining Model
- One created through wizard
- Create more mining models for each mining structure
Process and Run
- Builds Models
Model Development
Exercise #1
BI Studio (SQL Server Data Tools)
Visualization
• Viewer (different for each algorithm)
• Diagrams
• Mining Legend
• Mining Model Viewer
• Mining Accuracy Chart
o Lift Chart (Compare Models to random guesses and ideal model
and to each other)
o Classification Matrix (Confusion Matrix)
o Cross Validation (90-10)
• Mining Model Prediction
Model Development
Exercise #1
Predict class of Iris plant from sepal length and width, petal length and
width.
Dataset: iris table
Create: Mining Structure with 30% holdout
Add Mining Models:
• Decision Tree
• Neural Network
• Logistic Regression
• Cluster
Process / Test / Compare
Model Development
Exercise #1
Visualize and evaluate the quality of the models.
Dataset: iris table
Visuals
- Mining Model Viewer
- Mining Legend
Evaluate
- Classification Matrix
- Cross Validation
- Mining Model Prediction
Model Development
Exercise #1
Fine-tune the mining models by changing settings and parameters.
Dataset: iris table
Decision Tree
- Change Minimum Support and view Classification Matrix
- Explain Complexity Penalty for pruning and to prevent overfitting
Cluster
• Cluster Count
• Minimum Support
• Change cluster count to 3
• Add another cluster mining model with cluster count set to 0
• Add one more with cluster count = 0 and modeling cardinality = 50
Tools
What Analytics, Big Data, Data mining, Data Science software
have you used in the past 12 months for a real project?
Software
Rapid-I RapidMiner/RapidAnalytics Free
R
Excel
Weka / Pentaho
Python
Rapid-I RapidAnalytics/RapidMiner Commercial
SAS
MATLAB
StatSoft Statistica
IBM SPSS Statistics
Microsoft SQL Server
Tableau
IBM SPSS Modeler
KNIME free edition
SAS Enterprise Miner
2013 / 2012
39.2% / 26.7%
37.4 / 30.7
28.0 / 29.8
14.3 / 14.8
13.3 / 14.9
12.0 / 12.0
10.7 / 12.7
9.9 / 10.0
9.0 / 14.0
8.7 / 7.8
7.0 / 5.0
6.3 / 4.4
6.1 / 6.8
5.9 / 21.8
5.9 / 5.8
Tools – Excel
Data Exploration and Preparation
•
•
•
•
•
•
Explore Data (data profiling)
Discretization of Continuous Data
Identify\Exclude Outliers
Re-label (rename\map columns)
Random Sampling to reduce size of Training Set
Over Sampling for Balancing
Algorithm Differences
Unsupervised: Cluster Algorithm
Supervised: A class variable (prediction) is specified
Classification: Predict a discrete variable
Regression: Predict a continuous variable
Time Series: Date driven predictions
Association: Market Basket predictions
Lunch
1 Hour
Neural Network Algorithm
New Weight = Weight + Error * Input * Learning Constant
Multi-Layered…Feed Forward…Back Propagation
Neural Network Demo
Using C#
Vegetable Classification
If we know a vegetable is Orange, Green, Brown,
Yellow, Small, Large, then is it a Pumpkin, Carrot,
Zucchini, Yam, Potato?
Data Preparation
NFL Data – Game Average
Team
Yds
Pass
Rush Yds 1stD
Yds
G
Pts
3rdM 3rdD% 4thM 4thD% Pen PYds TOP
Denver Broncos
3
42.3 486.7 374.7
112.0
26.0
21
53.8
0
N/A
26 233 29:53
Green Bay
Packers
3
32.0 454.7 326.7
128.0
26.0
12
36.4
1
33.3
13 139 28:35
Chicago Bears
3
31.7 330.7 225.7
105.0
18.7
18
43.9
2
100.0
11 119 31:13
Seattle Seahawks 3
28.7 379.7 247.0
132.7
21.7
15
38.5
1
50.0
23 207 33:11
Dallas Cowboys
3
27.7 348.3 242.7
105.7
21.0
13
35.1
0
N/A
16 139 34:08
Detroit Lions
3
27.3 410.7 336.0
74.7
22.7
12
32.4
1
33.3
27 260 31:16
Using a Model
SSRS
SSIS
.NET
Framework
Common Mistakes
Lack of Relevant Data
- No history
- Data doesn’t predict the desired class value
- Poor quality data
Common Mistakes
Avoid Overfitting
- High quality training set
- Cross-validation
- Pruning
Increasing the Complexity
Penalty helps avoid Overfitting.
Model Development
Exercise #2
Training Data Set: CollegePlans table
Test Data Set: CollegePlansTest table
Predict whether a student will attend college based on demographic
and family characteristics. When complete, turn it into a new problem
by predicting a continuous value.
Purpose: Learn how to work with a test data set.
Create Mining Structure with 0% holdout
Add mining model using a Neural Network
Process / Test
Model Development
Exercise #2
Test by making mining model predictions using DMX and test dataset.
Use Predict and Predict Probability functions in DMX.
Change Prediction: Add additional Neural Network mining model
predicting IQ.
Test continuous prediction of IQ.
Add mining models using Decision Trees and compare to each other and
to NN.
Competitive Tools
Rapid Miner is gaining popularity as a full-featured
GUI to well-known algorithms.
WEKA is frequently used in academia.
R is a programming language with a large library of
statistical and predictive modeling algorithms.
Text Mining
What
about
text
mining?
Word List
Document
Categorization
Novelty
Sentiment
Quiz
What is the purpose of the training set?
What is the difference between Supervised and
Unsupervised learning?
What are some techniques to help avoid Overfitting?
Model Development
Exercise #3
Training Data Set: churn table
Test Data Set: churn_test table
Predict whether a phone customer will cancel their service.
Purpose: Learn how to discretize and balance input data.
Create Mining Structure with 0% holdout.
Add mining models: Decision Tree – Cluster – Logistic Regression
Process
Model Development
Exercise #3
Evaluate using the classification matrix with the training dataset. Note
the results and the percent correct each model produced.
Evaluate using Lift Chart.
Evaluate with Profit Chart by entering the following parameters:
• Population: 10,000
• Fixed Cost: 100,000
• Individual Cost: 50
• Revenue per Individual: 500
Repeat evaluations using the test data set with the settings above.
Balance the training data set. Re-create new mining structure, process
and test the models again.
Deploy - Monitor - Use
The predictive models are in an Analysis Services database, so
deployment is similar to cube deployment.
Deploy to Server
1) Project – Properties – Deployment – Server
2) Build – Deploy
Backup\Restore Analysis Services Database
Execute XMLA Script
Deploy - Monitor - Use
Predictive models will degrade over time, and may be completely
disrupted by expected or unexpected events.
Track
When a the predictions is made, save the input and output to a
database table.
Actuals
When the actual result is known, store it with the prediction.
Mean Error
Report the mean error periodically.
Replace\Refactor the model when the error is worse by a
statistically significant number or a negative trend appears.
Deploy - Monitor - Use
Exercise: Use one of the models you’ve developed in a SSRS
report, an SSIS package, or in .NET code.
SSRS
Connect to your Analysis Services database and write a query
using DMX.
SSIS
Use a Data Mining Query data flow task to make predictions as
data is flowing through an ETL job.
.NET
Use ADOMD (Microsoft.AnalysisServices.AdomdClient) to query
the predictive model using a dynamic DMX query.
Model Development
Exercise #4
Training Data Set: movies, actors, tech tables
Test Data Set: none
Predict which movie to recommend based on a customer’s
demographic info, movie buying history, and favorite actors.
Purpose: Learn how to use the association algorithm and nested data.
Create Mining Structure with 0% holdout.
Add mining models: Association
Process / Evaluate
Model Development
More Labs
Predict auto characteristics
auto db table
Predict political affiliation
VotingRecords db table
Make your own model
BYOD
Summary
Data Science is
a changing filed
that is evolving
very fast
New Skill Sets
are required on
top of old ones
Thank You / Questions
Carlos Bossy
Quanta Intelligence
Email: [email protected]
Blog: www.carlosbossy.com
Twitter: @carlosbossy

similar documents