Report

Extending and Customizing IBM SPSS Statistics with R, Python, and .NET Jon Peck Senior Software Engineer, IBM [email protected] November, 2010 Business Analytics software © 2010 IBM Corporation Business Analytics software IBM SPSS Statistics IBM ® SPSS ® Statistics has an extensive command language (syntax) for data acquisition, manipulation, and statistical and graphical procedures Programmability and scripting dramatically extend these built-in capabilities Allow custom user interfaces and output to be produced Converting large SAS applications is likely to require the use of programmability 2 © 2010 IBM Corporation Business Analytics software Agenda Programmability introduction Four examples – Automating repetitive work: applySyntaxToFiles – Integrating programs and scripting: SPSSINC MODIFY TABLES – Adding a procedure from R: SPSSINC QUANTILE REGRESSION – Adding a procedure in Python: SPSSINC TURF 3 © 2010 IBM Corporation Business Analytics software Programmability increases your power, flexibility, and productivity Generalization – React flexibly to metadata, results, and the environment – Benefit: Write fewer similar jobs Automation – Embed program logic in jobs – Benefit: Less manual work Extension – Tap existing R or Python statistical modules – Add your own or extend standard procedures and transformations – Benefit: More capabilities Integration – Connect IBM SPSS Statistics inputs and outputs to other agents – Benefit: Make IBM SPSS Statistics part of a larger production process More productivity and more fun 4 © 2010 IBM Corporation Business Analytics software IBM SPSS Statistics embeds three programming languages Plug-ins let you extend capabilities using –Python –R –.NET languages (Windows only) Free plug-in downloads SPSS Developer Central web site provides articles, SPSS-written modules, plug-ins and user contributions –New SPSS Community on IBM myDeveloperWorks 5 © 2010 IBM Corporation Business Analytics software My first Python program GET FILE="c:/data/important.sav". BEGIN PROGRAM PYTHON. import spss print "Hello, IBM" END PROGRAM. DESCRIPTIVES .... Python or R program code goes in the normal Statistics syntax window 6 © 2010 IBM Corporation Business Analytics software Programmability combines SPSS Statistics with Python, R, or .NET A program in the input stream can communicate with IBM SPSS Statistics and control it and use Python or R facilities and modules (internal mode) spss.Submit("GET FILE='c:/data/cars.sav'.") A Python or .NET application can embed IBM SPSS Statistics inside itself (external mode) –User interface does not appear There is a lower level C API available in an SDK 7 © 2010 IBM Corporation Business Analytics software Programmability functionality is fully integrated into IBM SPSS Statistics Programs run in the regular syntax stream Users can define IBM SPSS Statistics syntax for program and scripts via Extension mechanism. Users can create dialog boxes and menus using the Custom Dialog Builder. –Not just for extensions or programs Python and R output appears in the Viewer –plain text –pivot tables –charts 8 © 2010 IBM Corporation Business Analytics software Python and R Programmability API's cover these areas State information of Statistics Get/Set variable dictionary information Get/Set data Get Viewer output (via xmlworkspace) Create tables/charts/text objects in Viewer Run Statistics commands (Python only) 9 © 2010 IBM Corporation Business Analytics software Python and VB scripting API's cover user interface and output Programmability is a backend (SPSS Processor) domain Scripting is mainly a frontend (user interface, including output) domain Managing output Viewer and objects – tables: formatting, pivoting, editing, … – objects: visibility, order, titles, outline text,… General user interface control Almost anything you can do via the user interface Not available for R 10 © 2010 IBM Corporation Business Analytics software .NET plug-in embeds Statistics inside another program Example: Statistical Explorer Statistics, graphs, and data management via Statistics Two pages of VB.NET code 11 © 2010 IBM Corporation Business Analytics software Python and R are open source software Programmability plug-ins are an optional installation – They are free (but require a Statistics license) – They make possible tapping the work of the Python and R communities – Python and R have license agreements – IBM Non-warrenty license agreement – For R, GPL license 12 © 2010 IBM Corporation Business Analytics software Extension commands eliminate need for user to learn Python or R Extension mechanism lets you define IBM SPSS Statistics-style syntax for programs IBM SPSS Statistics takes care of validation and parsing Passes user input to a program in an easy-to-digest form Automatically loaded when IBM SPSS Statistics starts –Look to the user like built in commands Easy to distribute to others 13 © 2010 IBM Corporation Business Analytics software Some statistical extensions on Dev Central Extension Name Description PLS Partial least squares (P) PROPOR Confidence intervals for proportions (P) SPSSINC APRIORI Association rules (R) SPSSINC BREUSCH PAGAN Residual heteroscedasticity tests (R) SPSSINC HETCOR Polychoric and polyserial correlation (P+R) SPSSINC MFP GLM Fractional polynomial generalized linear models (R) SPSSINC QQPLOT2 Empirical Q-Q plots (R) SPSSINC QUANTREG Quantile regression (R) SPSSINC RAKE Adjust weights to control totals (P) SPSSINC RANFOR & SPSSINC RANPRED Random forests (R) SPSSINC RASCH Rasch models (R) SPSSINC ROBUST REGR Robust regression (R) SPSSINC TOBIT REGR Tobit regression (R) SPSSINC TURF TURF analysis (P) 14 © 2010 IBM Corporation Business Analytics software Some non-statistical extensions on Dev Central Extension Name Description FUZZY Case-control exact and approximate matching (P) GATHERMD Gather data file metadata (P) HIDECOLS Hide pivot table columns (P) SCRIPTEX SCRIPT commands with parameters (P) SETSMACRO Syntax for using variable sets (P) SPSSINC ANON Anonomize data (P) SPSSINC COMPARE DATASETS Compare two sav files (P) SPSSINC CREATE DUMMIES Create dummy variables for categories (P) SPSSINC GETURI DATA Read data from the Internet (P) SPSSINC MERGE TABLES Merge two pivot tables (P) SPSSINC MODIFY OUTPUT Set Viewer outline titling and styling (P) SPSSINC MODIFY TABLES Set pivot table cell and label styling (P) SPSSINC TRANS Apply Python functions to cases (P) SPSSINC TRANSLATE Translate Viewer output (P) TEXT Create block of text in Viewer (P) 15 © 2010 IBM Corporation Business Analytics software You can create and share your own additions to IBM SPSS Statistics –Write Python or R functions to implement the functionality or tap existing packages Can each • Use input API's to get data to Python or R • Use output API's to create pivot tables –For extensions, be a single line of code • Define the syntax in an xml file • Use tools in extension.py (Python) or spsspkg (R) to receive parsed output and pass to implementing function • New in v18: R version of extension.py –Use the Custom Dialog Builder to create the interface • The CDB is not just for extensions –Test and document! –Package and distribute –Contributions to Developer Central are welcome Documentation is at SPSS Developer Central 16 © 2010 IBM Corporation Business Analytics software Extension commands: validation and mapping from syntax to Python or R function parameters is handled for you Example: SPSSINC BREUSCH PAGAN – implemented using an R package SPSSINC_BREUSCH_PAGAN.xml specifies the syntax to the Statistics parser The R mapping code in SPSSINC_BREUSCH_PAGAN.R respecifies the syntax and invokes the executing routine with parsed parameters – overlaps with xml syntax definition but provides additional features SPSSINC BREUSCH PAGAN DEPENDENT = salary ENTER = educ jobcat /OPTIONS MISSING=LISTWISE /SAVE RESIDUALSDATASET=resids COEFSDATASET=coefs. 17 © 2010 IBM Corporation Business Analytics software An XML file defines the syntax to the SPSS Universal Parser 18 © 2010 IBM Corporation Business Analytics software Python or, in this case, R code gets the parsed syntax, which is turned into function arguments 19 © 2010 IBM Corporation Business Analytics software Expand the audience by creating IBM SPSS Statistics syntax and dialog boxes 20 © 2010 IBM Corporation Business Analytics software Example I Generalize and automate work You have syntax files and need to process datasets not known in advance every day applySyntaxToFiles function applies a syntax file to each file in input specification 21 © 2010 IBM Corporation Business Analytics software Use programmability to automate routine processes Apply standard processing to an unknown set of files Produce processed data and reports 22 © 2010 IBM Corporation Business Analytics software Use a program to drive processing begin program. import spss, spssaux3 spssaux3.applySyntaxToFiles(inputspec="c:/temp/parts/*.sav", syntax = "c:/myjobs/dailychecks.sps", outputdatadir = "c:/temp/processed", outputfiledir = "c:/temp/processed", logfile ="c:/temp/processed/report.txt") end program. dailychecks.sps could apply data cleaning rules, modify data, and create reports Could be run daily through Production Mode or C&DS job scheduler or used interactively Extended version available as SPSSINC PROCESS FILES 23 © 2010 IBM Corporation Business Analytics software Example II Automate dynamic or static formatting of tables Use integrated scripting for better table presentation 24 © 2010 IBM Corporation Business Analytics software SPSSINC MODIFY TABLES extension command manipulates table formatting and structure • TableLooks provide static formatting for entire areas of a table – data cells – row and column layers • You want tables with formatting beyond tableLooks • Many users copy tables to Excel and manually format them • Basic and Python Scripting provide programmatic way to do formatting • SPSSINC MODIFY TABLES provides syntax for extensive formatting – Eliminates need to know scripting – Uses Extension mechanism for programs and Python scripting 25 © 2010 IBM Corporation Business Analytics software Use dynamic highlighting to make crosstab table easier to read SPSSINC MODIFY TABLES SUBTYPE='Crosstabulation' DIMENSION=ROWS SELECT='Std. Residual' /STYLES TEXTSTYLE=BOLD BACKGROUNDCOLOR=255 0 0 APPLYTO='abs(x) >2'. 26 © 2010 IBM Corporation Business Analytics software Custom dialog boxes are easy to create Dialog created with Custom Dialog Builder Generates extension command syntax Easy to distribute 27 © 2010 IBM Corporation Business Analytics software Use static formatting to call out parts of a table SPSSINC MODIFY TABLES subtype='variables in the equation' SELECT="B" "Sig." /STYLES TEXTCOLOR = 0 0 255 BACKGROUNDCOLOR=0 255 0. 28 © 2010 IBM Corporation Business Analytics software Format CTABLES totals to call them out SPSSINC MODIFY TABLES SUBTYPE="Custom Table" SELECT = "Total" DIMENSION=ROWS /STYLES BACKGROUNDCOLOR=255 255 88 TEXTSTYLE = BOLD 29 © 2010 IBM Corporation Business Analytics software Use custom functions for special effects SPSSINC MODIFY TABLES SUBTYPE='Report' SELECT="<<ALL>>" /STYLES APPLYTO=DATACELLS TEXTCOLOR=255 255 255 TEXTSTYLE=BOLD CUSTOMFUNCTION="customstylefunctions.washColumnsBlue". def washColumnsBlue(obj, i, j, numrows, numcols, section, more): mincolor=150. maxcolor=255. increment = (maxcolor - mincolor)/(numcols-1) colorvalue = round(mincolor + increment * j) obj.SetBackgroundColorAt(i,j, RGB((mincolor, mincolor, colorvalue))) 30 © 2010 IBM Corporation Business Analytics software It is possible to get carried away with this 31 © 2010 IBM Corporation Business Analytics software Example III Extend IBM SPSS Statistics by tapping the work of the R and Python communities Add R procedures seamlessly to IBM SPSS Statistics 32 © 2010 IBM Corporation Business Analytics software R R is a programming language for statistics –leading edge statistics –many contributed statistics and graphics packages –free R is not so easy to learn –Documentation by experts for experts –Feels like a complex programming language – because it is –Syntax is a lot like C –Error in optim(rho, f, control = control, hessian = TRUE, method = “BFGS”) : initial value in ‘vmmin’ is not finite • Good for programmers(?); bad for users R holds data in memory R for SAS and SPSS Users, Bob Muenchen, AddisonWesley, 2008 33 © 2010 IBM Corporation Business Analytics software R procedures can be accessed from IBM SPSS Statistics using the R plug-in The R plug-in makes it easy to use R packages –IBM SPSS Statistics datasets and Viewer output can be processed by R using plug-in –Graphical, text, and table output appear in the Viewer • Pivot tables can be created with R code –New IBM SPSS Statistics datasets can be created from R –R communicates with IBM SPSS Statistics via API's in plug-in –Integration requires writing a little R wrapper code –IBM SPSS Statistics can provide • dialog box interface • IBM SPSS Statistics-style syntax • pivot table output Plug-in is downloadable from Developer Central 34 © 2010 IBM Corporation Business Analytics software Quantile regression models conditional quantiles Ordinary regression models conditional mean Median regression is 50th quantile Estimating quantiles is useful with varying spread, asymmetries, outliers Areas of application include –empirical finance • value at risk • mutual fund investment styles • credit scoring –school quality –demand analysis –others 35 © 2010 IBM Corporation Business Analytics software SPSS QUANTILE REGRESSION extension embeds R quantreg package 36 © 2010 IBM Corporation Business Analytics software Pivot tables and plots appear in the Viewer 37 © 2010 IBM Corporation Business Analytics software New datasets appear in Data Editor windows 38 © 2010 IBM Corporation Business Analytics software Example IV Extend IBM SPSS Statistics by adding procedures in Python TURF analysis 39 © 2010 IBM Corporation Business Analytics software TURF Analysis is popular in market research Total Unduplicated Reach and Frequency (TURF) Find the highest coverage of positive responses for a small number of questions Example: How do you reach the largest audience by advertising on a few kinds of sports? • football, cricket, basketball, cycling, ... Example: What ice cream flavors should you offer in your shops that have three dispensing machines? Example: What phone features should you promote? –multi-line, voicemail, paging, internet ... Simple FREQUENCIES does not account for overlap 40 © 2010 IBM Corporation Business Analytics software TURF calculations are demanding Must compute all possible set unions of positive responses (up to a maximum number of variables). Each set is a list of case ID’s with positive response on a question. This problem is computationally explosive Calculations for best 10 combinations of variables Variables 3 6 12 24 48 Set Union Calculations 4 57 4070 4,540,361 8,682,997,422 Is a scripting language like Python too slow? 41 © 2010 IBM Corporation Business Analytics software Extension command SPSSINC TURF is implemented in Python Provides –Dialog box interface –IBM SPSS Statistics style syntax –The computations –Pivot table output Fewer than 300 lines of Python code –Plus dialog box definition –Plus extension command syntax definition Executes requests involving a few million set comparisons in a few minutes Initial version written in two days 42 © 2010 IBM Corporation Business Analytics software Analysis of phone data Telco survey (9 variables 1000 cases) dialog created with Custom Dialog Builder 43 © 2010 IBM Corporation Business Analytics software Results show the combination of features – best reach Pivot table created from Python code Best singles are conference calling, call forwarding, and call waiting 44 © 2010 IBM Corporation Business Analytics software The best three are not the top three one at a time Calculations completed in a few seconds 45 © 2010 IBM Corporation Business Analytics software Where we have been today Python and R integration Unification of programs and scripts Custom Dialog Builder Extensions SPSS Developer Central is your friend 46 © 2010 IBM Corporation Business Analytics software Questions ? 47 ? © 2010 IBM Corporation Business Analytics software Programmability increases your power, flexibility, and productivity with IBM SPSS Statistics Generalization and automation –applySyntaxToFiles –SPSS MODIFY TABLES Extension –SPSSINC QUANTREG using R –SPSSINC TURF using Python –Many new extension commands available Integration –applySyntaxToFiles as part of a process And it's still more fun 48 © 2010 IBM Corporation Business Analytics software Contact Jon K Peck, Ph. D. Senior Software Engineer IBM SPSS [email protected] blog: insideout.spss.com 49 © 2010 IBM Corporation