Using the *R* Actor in Kepler for quality control

Report
John Porter, University of Virginia, [email protected]
USING THE “R” ACTOR IN
KEPLER FOR QUALITY CONTROL
R Basics
 R is an open source statistical language
 “Atomic” types: logical, integer, real, complex,
string (or character) and raw
 Data in R is stored in one of several types of objects
 Scalar : myVar <- 10
 Vectors: myVec <- c(10,20,30)
 Lists: myList <- c(10,”E”,12.3)
 Matrix: myMat <- cbind(myVec1,myVec2)
 Data Frames: myDf<-data.frame(myVec,MyList)
 Factors: myFac <- as.factor(myList)
R Workspaces
 All the variables and functions defined during
a session are part of the “Workspace”
 R Workspaces can be saved for later use
 When you come back, everything is the same as
when the workspace was saved
Most Commonly Used Object
Types
 Vectors – contain a single column of one of
the “atomic” types
 Often created using the concatenate function
myVec <- c(10,20,30)
 Individual elements can be accessed using
indexes
myVec[2] is 20
Data Frames
 Data Frames – table-style objects that
contain named vectors inside them
myDF$RAIN refers to the “RAIN”
vector, as does myDF[ ,2]
myDF[135,3] is 121.8
Reading Data into Data
Frames
 A common way of creating data frames is to
read in a comma-separated-value (csv) file
read.csv
myDf <- read.csv(“C:/ft_monro.csv”,header=TRUE)
Note, regardless of operating system, R wants “/” – not “\”
Sample R Program for QA/QC
# Select the Data File
infile1 <- file(“C:/downloads/ft_monroe.csv", open="r")
# Read the data
dataTable1 <-read.csv(infile1, ,skip=1 ,sep=","
,quot='"' , col.names=c( "YEAR", "RAIN", "RAIN_CM",
"NOTES" ), check.names=TRUE)
attach(dataTable1)
# Run basic summary statistics
summary(as.factor(NOTES))
summary(as.numeric(YEAR))
summary(as.numeric(RAIN))
summary(as.numeric(RAIN_CM))
Quick Exercise – Run these in R
# anything after a # sign on a line is just a COMMENT - it
won't do anything
varA <- 10 # sets up a vector with one element containing a
10
varA # listing an object's name prints out the values
varB <- c(10,20,30) # sets up a vector with 3 elements. c() is
the concatenation function
varB
varB[2] # now let's display ONLY the second element
# now let's do some math!
mySumAB <- varA + varB # adding them together.
# Note there is only 1 value in varA
mySumAB
# note the single value in varA repeated in the addition
R Data Structures
 A lot of the “magic” in R is because of the
object-oriented approach used
 R objects contain a lot more than just the
data values
 A command that does one thing to a scalar
(single value) does something else with a
vector (a list of values) – all because R
functions “understand” the difference!
Conversions
 Conversions are possible between different
modes or types of objects using conversion
functions
 as.numeric(varA)
 makes varA a number – if it can!
 as.integer( )
 as.character( )
 as.factor()
 as.matrix()
 as.data.frame()
Using Data Frames
A <- c(10,20,30)
B <- c(4,6,3)
C <- c(‘A’,’B’,’C’) # put letters in quotes
Df <-data.frame(C,A,B)
Df # list whole data frame
Df$A # list the A vector
Df[,3] # list the 3rd vector (B)
Df[1,] # list all columns for row 1
Df[Df$A > 10,] # list rows where A>10
Data Frames
 Results of Data
Frame manipulations
R Help
R has a number of ways of calling up help
 ??sqrt - does a “fuzzy” search for functions
like “sqrt”
 ?sqrt – does an exact search for the function
sqrt() and displays documentation
 There are also manuals and extensive on-line
tutorials (but Google is frequently the best
way to find help)
R & Kepler
 Kepler uses the “RExpression Actor” to run R
code from inside Kepler
 Typically run with an SDF Director with a
single iteration for most analyses
 You only need them done once!
 Don’t forget to set the iteration count – the
default is to loop forever!
The default RExpression has no inputs
and two outputs
graphicsFileName & output
Typical connections for basic
RExpression Actor
Adding Ports
 To make Rexpression actors really useful, it is
helpful to be able to have them
intercommunicate with other Kepler actors
beyond simply listing output or showing
graphs
 To allow this intercommunication we need to
add additional Input and Output ports
 The names of the ports will automatically be
connected to objects with the same name in the R
program
Hook up some input and output actors
R Program to Test
Remember – names of
ports translate into names
of objects in R
Results of Running Workflow
R Listing Output
“myOutValue”
displayed
R for Checking EML Data
But there are some TRICKS you should know!
Trick 1 – select the right
object type for the EMLactor
 By Default the EML Actor only connects to
the output ports the FIRST LINE OF DATA “as
field”.
 If you want to have an output port represent
the data as a VECTOR you need to select “As
Column Vector”
 If you want to get a Data Frame instead of
individual columns, you need to select “As
ColumnBased Record”
Setting Data Output Format
in EML actor
Trick 2 – Trap R errors
 Normally if there is a problem with your R
program you get a cryptic message from
Kepler
try() and geterrmessage() in
R
Runs the “errorplot()”* function and
reports any error messages that occur
when you run it
* There is no “errorplot()” function in R
Now we get an informative
message
Correct the command and see
the output
QA/QC – Quality Assurance
and Quality Control
 Error types
 Errors of Commission – data contains wrong
values
 Errors of Omission – data that should be there is
missing
 We will mostly be talking today about errors
of commission
Porter’s Rule of Data
Quality
 There is no non-trivial dataset that
does not contain some errors
 Goal of QA/QC: reduce errors to the
maximum possible extent, or at least to the
level that they don’t adversely effect the
conclusions reached through analysis of the
data
QA/QC – Possible Tests
 Identification and removal of duplicates
 Correct Domain
 Numerical Range (e.g., -20 < Temperature < 50)
 Correct Codes (e.g., HOGI, not HOG1)
 Graphs
 Time-series plots
 Plots between variables
 Detections of “spikes” in time series
 Customized criteria (e.g., month specific
range checks)
Exercise – A succession of
workflows for QA
 Open your Virtual Machine
 Open a Web Browser and go to:
 http://tinyurl.com/7po5ffb
 Open the LocalData.zip file
 Extract All Files to directory C:\
 You should then have a C:\localData
directory containing the files for this
exercise
1_Ft_Monroe_simple_summary.kar
A dead-simple workflow
Kepler Stuff to Note
 Annotations allow you to add titles and other
useful instructions to your workflow display
Kepler Stuff to Note
 Parameters let you easily show and change
values that will be used elsewhere in the
workflow
Kepler Parameters
 Customize Name lets you set the NAME of
the parameter and what should display on
the screen
 Remember the
name – that is
how you will
refer to the
parameter later.
Using a Parameter Value
 Add a $ to the front of a parameter in a
Kepler settings box to insert the value of the
parameter – so the Data File: is
c:/localData/ft_monro.csv
Brief Exercise
 Experiment with editing connections in this
workflow to display different graphs
Then open the 3_ft_monro_badData.kar workflow – it
has a corrupted version of this data
R stuff to Note
 This workflow uses both
a Data Frame (table) and
vectors (single columns)
 In the dataFrame you can subset lines using:
dataFrame[(dataFrame$RAIN < 0), ]
 Be sure to put the trailing comma!
 dataFrame$RAIN < 0 generates a logical vector of
TRUE and FALSE values – one for each line
QA/QC in R
summary(dataFrame)
print("Here are Duplicated Data Lines")
dataFrame[duplicated(dataFrame),]
print("now list out of range checks")
dataFrame[(dataFrame$RAIN < 0 | dataFrame$RAIN_CM
< 0),]
dataFrame[(dataFrame$RAIN > 150 |
dataFrame$RAIN_CM > 300),]
print("now list unit conversion errors")
dataFrame[(abs((dataFrame$RAIN*2.54)dataFrame$RAIN_CM)>0.1),]
Examine the workflow on the
bad data and change it!
 Try setting different values for the range
checks
 Try different graphs (as you did for the good
data)
 Try listing all the data that was NOT
duplicated (note in R the “not “ operator is
“!“)
 use R help and Google as needed
R+Kepler vs. R Alone
 Given that “R” runs just fine alone, why use
Kepler?
 Allows use of OTHER Kepler actors, Data Turbine
 E.g., EMLData, editors, graphical tools
 Allows code to be segmented for easier editing in
the future
 Reusability – ability to copy and paste parts of
Kepler workflows
 Use spatial arrangement to help guide the user
 Downsides
 Complicates debugging
A more complex and
general workflow
4_BasicEMLQA
Workflow Steps
 Read an EML metadata file
 Convert it using a XSLT stylesheet into an R





program
Edit the R program to point to the data
Ingest the data into a data frame
Summarize the data
“Tweak “ the data to add a date-time vector
for time plots and fix some conversion
problems and re-summarize the data
Run some plots
Passing R Workspaces
 This workflow, instead of passing data from
actor-to-actor, passes the name of the R
Workspace
 Subsequent actors re-open the R Workspace
without needing to ingest the data again
 This is very efficient, but this method only
works for connecting R actors
R code for passing on
R workspaces
Set Port Variable to
the name of the
workflow
Saving workspace
for later use
Remember to save
the workspace!
Loading the Saved
Workspace
Name of Port connected to
WorkingDir port (above)
A conversion problem
Temperature and
Humidity values have
some severe problems
reading in!
What happened?
R Factors
 Factors are the way R deals with categorical or
nominal data (e.g., typically, non-numeric data)
 Internally Factors are made up of two vectors:
 Values – the actual values stored in the factor – often
referred to as “levels”
 Indexes – an integer vector containing numbers that are
used to specify the ORDERing of the values
 DANGER – sometimes when you read in data from a
file, errors or odd characteristics of the data will
cause R to read a column of (mostly) numbers as a
Factor instead of as a numeric vector!
Factors
This is the mean of the
INDEXES not the
VALUES/Levels
 After
conversion
data
ranges are
much
better!
 But Max_T
is still
suspicious!
Your Final Challenge
 As it’s name suggests this data file has some
corrupted data (plus the normal errors)
 Edit the “Tweaks” actor to add additional
checks or add additional plots to identify the
problems with the data
 If you don’t cause Kepler to abort the
workflow due to errors at least once, you
aren’t trying hard enough! So make additions
in a change-test-repeat cycle

similar documents