Using the *R* Actor in Kepler for quality control

John Porter, University of Virginia, [email protected]
R Basics
 R is an open source statistical language
 “Atomic” types: logical, integer, real, complex,
string (or character) and raw
 Data in R is stored in one of several types of objects
 Scalar : myVar <- 10
 Vectors: myVec <- c(10,20,30)
 Lists: myList <- c(10,”E”,12.3)
 Matrix: myMat <- cbind(myVec1,myVec2)
 Data Frames: myDf<-data.frame(myVec,MyList)
 Factors: myFac <- as.factor(myList)
R Workspaces
 All the variables and functions defined during
a session are part of the “Workspace”
 R Workspaces can be saved for later use
 When you come back, everything is the same as
when the workspace was saved
Most Commonly Used Object
 Vectors – contain a single column of one of
the “atomic” types
 Often created using the concatenate function
myVec <- c(10,20,30)
 Individual elements can be accessed using
myVec[2] is 20
Data Frames
 Data Frames – table-style objects that
contain named vectors inside them
myDF$RAIN refers to the “RAIN”
vector, as does myDF[ ,2]
myDF[135,3] is 121.8
Reading Data into Data
 A common way of creating data frames is to
read in a comma-separated-value (csv) file
myDf <- read.csv(“C:/ft_monro.csv”,header=TRUE)
Note, regardless of operating system, R wants “/” – not “\”
Sample R Program for QA/QC
# Select the Data File
infile1 <- file(“C:/downloads/ft_monroe.csv", open="r")
# Read the data
dataTable1 <-read.csv(infile1, ,skip=1 ,sep=","
,quot='"' , col.names=c( "YEAR", "RAIN", "RAIN_CM",
"NOTES" ), check.names=TRUE)
# Run basic summary statistics
Quick Exercise – Run these in R
# anything after a # sign on a line is just a COMMENT - it
won't do anything
varA <- 10 # sets up a vector with one element containing a
varA # listing an object's name prints out the values
varB <- c(10,20,30) # sets up a vector with 3 elements. c() is
the concatenation function
varB[2] # now let's display ONLY the second element
# now let's do some math!
mySumAB <- varA + varB # adding them together.
# Note there is only 1 value in varA
# note the single value in varA repeated in the addition
R Data Structures
 A lot of the “magic” in R is because of the
object-oriented approach used
 R objects contain a lot more than just the
data values
 A command that does one thing to a scalar
(single value) does something else with a
vector (a list of values) – all because R
functions “understand” the difference!
 Conversions are possible between different
modes or types of objects using conversion
 as.numeric(varA)
 makes varA a number – if it can!
 as.integer( )
 as.character( )
 as.factor()
 as.matrix()
Using Data Frames
A <- c(10,20,30)
B <- c(4,6,3)
C <- c(‘A’,’B’,’C’) # put letters in quotes
Df <-data.frame(C,A,B)
Df # list whole data frame
Df$A # list the A vector
Df[,3] # list the 3rd vector (B)
Df[1,] # list all columns for row 1
Df[Df$A > 10,] # list rows where A>10
Data Frames
 Results of Data
Frame manipulations
R Help
R has a number of ways of calling up help
 ??sqrt - does a “fuzzy” search for functions
like “sqrt”
 ?sqrt – does an exact search for the function
sqrt() and displays documentation
 There are also manuals and extensive on-line
tutorials (but Google is frequently the best
way to find help)
R & Kepler
 Kepler uses the “RExpression Actor” to run R
code from inside Kepler
 Typically run with an SDF Director with a
single iteration for most analyses
 You only need them done once!
 Don’t forget to set the iteration count – the
default is to loop forever!
The default RExpression has no inputs
and two outputs
graphicsFileName & output
Typical connections for basic
RExpression Actor
Adding Ports
 To make Rexpression actors really useful, it is
helpful to be able to have them
intercommunicate with other Kepler actors
beyond simply listing output or showing
 To allow this intercommunication we need to
add additional Input and Output ports
 The names of the ports will automatically be
connected to objects with the same name in the R
Hook up some input and output actors
R Program to Test
Remember – names of
ports translate into names
of objects in R
Results of Running Workflow
R Listing Output
R for Checking EML Data
But there are some TRICKS you should know!
Trick 1 – select the right
object type for the EMLactor
 By Default the EML Actor only connects to
the output ports the FIRST LINE OF DATA “as
 If you want to have an output port represent
the data as a VECTOR you need to select “As
Column Vector”
 If you want to get a Data Frame instead of
individual columns, you need to select “As
ColumnBased Record”
Setting Data Output Format
in EML actor
Trick 2 – Trap R errors
 Normally if there is a problem with your R
program you get a cryptic message from
try() and geterrmessage() in
Runs the “errorplot()”* function and
reports any error messages that occur
when you run it
* There is no “errorplot()” function in R
Now we get an informative
Correct the command and see
the output
QA/QC – Quality Assurance
and Quality Control
 Error types
 Errors of Commission – data contains wrong
 Errors of Omission – data that should be there is
 We will mostly be talking today about errors
of commission
Porter’s Rule of Data
 There is no non-trivial dataset that
does not contain some errors
 Goal of QA/QC: reduce errors to the
maximum possible extent, or at least to the
level that they don’t adversely effect the
conclusions reached through analysis of the
QA/QC – Possible Tests
 Identification and removal of duplicates
 Correct Domain
 Numerical Range (e.g., -20 < Temperature < 50)
 Correct Codes (e.g., HOGI, not HOG1)
 Graphs
 Time-series plots
 Plots between variables
 Detections of “spikes” in time series
 Customized criteria (e.g., month specific
range checks)
Exercise – A succession of
workflows for QA
 Open your Virtual Machine
 Open a Web Browser and go to:
 Open the file
 Extract All Files to directory C:\
 You should then have a C:\localData
directory containing the files for this
A dead-simple workflow
Kepler Stuff to Note
 Annotations allow you to add titles and other
useful instructions to your workflow display
Kepler Stuff to Note
 Parameters let you easily show and change
values that will be used elsewhere in the
Kepler Parameters
 Customize Name lets you set the NAME of
the parameter and what should display on
the screen
 Remember the
name – that is
how you will
refer to the
parameter later.
Using a Parameter Value
 Add a $ to the front of a parameter in a
Kepler settings box to insert the value of the
parameter – so the Data File: is
Brief Exercise
 Experiment with editing connections in this
workflow to display different graphs
Then open the 3_ft_monro_badData.kar workflow – it
has a corrupted version of this data
R stuff to Note
 This workflow uses both
a Data Frame (table) and
vectors (single columns)
 In the dataFrame you can subset lines using:
dataFrame[(dataFrame$RAIN < 0), ]
 Be sure to put the trailing comma!
 dataFrame$RAIN < 0 generates a logical vector of
TRUE and FALSE values – one for each line
QA/QC in R
print("Here are Duplicated Data Lines")
print("now list out of range checks")
dataFrame[(dataFrame$RAIN < 0 | dataFrame$RAIN_CM
< 0),]
dataFrame[(dataFrame$RAIN > 150 |
dataFrame$RAIN_CM > 300),]
print("now list unit conversion errors")
Examine the workflow on the
bad data and change it!
 Try setting different values for the range
 Try different graphs (as you did for the good
 Try listing all the data that was NOT
duplicated (note in R the “not “ operator is
 use R help and Google as needed
R+Kepler vs. R Alone
 Given that “R” runs just fine alone, why use
 Allows use of OTHER Kepler actors, Data Turbine
 E.g., EMLData, editors, graphical tools
 Allows code to be segmented for easier editing in
the future
 Reusability – ability to copy and paste parts of
Kepler workflows
 Use spatial arrangement to help guide the user
 Downsides
 Complicates debugging
A more complex and
general workflow
Workflow Steps
 Read an EML metadata file
 Convert it using a XSLT stylesheet into an R
Edit the R program to point to the data
Ingest the data into a data frame
Summarize the data
“Tweak “ the data to add a date-time vector
for time plots and fix some conversion
problems and re-summarize the data
Run some plots
Passing R Workspaces
 This workflow, instead of passing data from
actor-to-actor, passes the name of the R
 Subsequent actors re-open the R Workspace
without needing to ingest the data again
 This is very efficient, but this method only
works for connecting R actors
R code for passing on
R workspaces
Set Port Variable to
the name of the
Saving workspace
for later use
Remember to save
the workspace!
Loading the Saved
Name of Port connected to
WorkingDir port (above)
A conversion problem
Temperature and
Humidity values have
some severe problems
reading in!
What happened?
R Factors
 Factors are the way R deals with categorical or
nominal data (e.g., typically, non-numeric data)
 Internally Factors are made up of two vectors:
 Values – the actual values stored in the factor – often
referred to as “levels”
 Indexes – an integer vector containing numbers that are
used to specify the ORDERing of the values
 DANGER – sometimes when you read in data from a
file, errors or odd characteristics of the data will
cause R to read a column of (mostly) numbers as a
Factor instead of as a numeric vector!
This is the mean of the
INDEXES not the
 After
ranges are
 But Max_T
is still
Your Final Challenge
 As it’s name suggests this data file has some
corrupted data (plus the normal errors)
 Edit the “Tweaks” actor to add additional
checks or add additional plots to identify the
problems with the data
 If you don’t cause Kepler to abort the
workflow due to errors at least once, you
aren’t trying hard enough! So make additions
in a change-test-repeat cycle

similar documents