SAS Data Step Powerpoint Presentation

Report
Understanding SAS Data Step
Processing
Alan C. Elliott
stattutorials.com
Reading Raw Data
• Using the following SAS program:
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32;
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;
Alan C. Elliott, stattutorials.com
Overview of SAS Data Step
Compile Phase
(Look at Syntax)
Execution Phase
(Read data, Calculate)
Output Phase
(Create Data Set)
Alan C. Elliott, stattutorials.com
Compile Phase
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32;
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;
Alan C. Elliott, stattutorials.com
SAS Checks the syntax of
the program.
• Identifies type and
length of each variable
• Does any variable need
conversion?
If everything is okay,
proceed to the next step.
If errors are discovered, SAS
attempts to interpret what
you mean. If SAS can’t
correct the error, it prints an
error message to the log.
Create Input Buffer
• SAS creates an input buffer
• INPUT BUFFER contains data as it is read in
DATALINES;
0001 24 37.3
0002 35 38.2
;
1
2
3
4
0
0
0
1
INPUT BUFFER
5
6
7
2
4
8
Alan C. Elliott, stattutorials.com
9
10 11 12
3
7
.
3
Execution Phase
• PROGRAM DATA VECTOR (PDV) is created and
contains information about the variables
_N_
_ERROR_
1
0
ID
AGE
TEMPC
TEMPF
.
.
.
• Two automatic variables _N_ and _ERROR_ and a
position for each of the four variables in the DATA
step.
• Sets _N_ = 1 _ERROR_ = 0 (no initial error) and
remaining variables to missing.
Alan C. Elliott, stattutorials.com
Buffer to PDV
1
2
3
4
0
0
0
1
5
6
7
2
4
8
Buffer
9
10
11
12
3
7
.
3
PDV
Reads 1st record
_N_
_ERROR_
ID
AGE
TEMPC
TEMPF
1
0
0001
24
37.3
.
Initially
missing
Processes the code TEMPF=TEMPC*(9/5)+32;
If there is an executable statement…
_N_
_ERROR_
ID
AGE
TEMPC
TEMPF
1
0
0001
24
37.3
99.14
Alan C. Elliott, stattutorials.com
Calculated
value
Output Phase
• The values in the PDV are written to
the output data set (NEW) as the first
observation:
_N_
_ERROR_
ID
AGE
TEMPC
TEMPF
1
0
0001
24
37.3
99.14
Write data to data set.
ID
AGE
TEMPC
0001
24
37.3
TEMPF
99.14
Alan C. Elliott, stattutorials.com
From
PDV
This is the first record
in the output data set
named “NEW.”
Note that _N_ and
_ERROR_ are
dropped.
Exceptions to Missing in PDV
_N_
_ERROR_
1
0
ID
AGE
TEMPC
TEMPF
.
.
.
Initial values usually
set to missing in PDV
• Some data values are not initially set to missing in
the PDV
–
–
–
–
variables in a RETAIN statement
variables created in a SUM statement
data elements in a _TEMPORARY_ array
variables created with options in the FILE or INFILE
statements
• These exceptions are covered later.
Alan C. Elliott, stattutorials.com
Next data record read
• Once SAS finished reading the first data
record, it continues the same process, and
reads the second record…sending results to
output data set (named NEW in this case.)
ID
AGE
TEMPC
TEMPF
0001
24
37.3
99.14
0002
35
38.2
100.76
• …and so on for all records.
Alan C. Elliott, stattutorials.com
Descriptor Information
• For the data set, SAS creates and maintains a
description about each SAS data set:
– data set attributes
– variable attributes
– the name of the data set
– member type, the date and time that the data set
was created, and the number, names and data
types (character or numeric) of the variables.
Alan C. Elliott, stattutorials.com
Data Set Description
proc datasets ;
contents data=new;
run;
Alternate program
proc contents data= new;
run;
Contents output… (abbreviated)
#
Name
1
NEW
Member
Type
DATA
File Size
5120
Alan C. Elliott, stattutorials.com
Last
Modified
20Nov13:0
8:59:32
Description output continued…
Data Set Name
Member Type
Engine
Created
Last Modified
WORK.NEW
DATA
V9
Wed, Nov 20, 2013
08:59:32 AM
Wed, Nov 20, 2013
08:59:32 AM
Protection
Data Set Type
Label
Data Representation WINDOWS_64
Encoding
wlatin1 Western
(Windows)
Observations
Variables
Indexes
Observation Length
2
4
0
32
Deleted
Observations
Compressed
Sorted
0
Alan C. Elliott, stattutorials.com
NO
NO
Description output continued…
Alphabetic List of Variables and Attributes
#
Variable
Type
Len
2
AGE
Num
8
1
ID
Char
8
3
TEMPC
Num
8
4
TEMPF
Num
8
Alan C. Elliott, stattutorials.com
Original Program
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32;
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;
Alan C. Elliott, stattutorials.com
Original Program
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32;
DATALINES;
0001 24 37.3
Obs
0002 35 38.2
1
;
2
run;
proc print;run;
Program output
ID
AGE
0001
0002
24
35
Alan C. Elliott, stattutorials.com
TEMP
C
37.3
38.2
TEMP
F
99.14
100.76
Example of Error
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32
DATALINES;
0001 24 37.3
0002 35 38.2
;
run;
proc print;run;
Missing Semi-colon
proc datasets ;
contents data=new;
run;
Alan C. Elliott, stattutorials.com
76
77
78
79
DATA NEW;
INPUT ID $ AGE TEMPC;
TEMPF=TEMPC*(9/5)+32
Error found during compilation
DATALINES;
--------22
80
0001 24 37.3
---180
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, , /, <, <=, <>, =, >, ><, >=, AND, EQ, GE,
GT, IN, LE, LT, MAX, MIN, NE, NG, NL, NOTIN, OR, ^=, |, ||, ~=.
ERROR 180-322: Statement is not valid or it is used out of proper order.
81
82
83
0002 35 38.2
;
run;
ERROR: No DATALINES or INFILE statement.
Alan C. Elliott, stattutorials.com
Summary - Compilation Phase
• During Compilation
– Check syntax
– Identify type and length of each new variable (is a
data type conversion needed?)
– creates input buffer if there is an INPUT statement for
an external file
– creates the Program Data Vector (PDV)
– creates descriptor information for data sets and
variable attributes
– Other options not discussed here: DROP; KEEP;
RENAME; RETAIN; WHERE; LABEL; LENGTH; FORMAT;
ARRAY; BY; ATTRIB; END=, IN=, FIRST, LAST, POINT=
Alan C. Elliott, stattutorials.com
Summary – Execution Phase
1.
2.
3.
4.
5.
6.
7.
8.
The DATA step iterates once for each observation being
created.
Each time the DATA statement executes, _N_ is
incremented by 1.
Newly created variables set to missing in the PDV.
SAS reads a data record from a raw data file into the input
buffer (there are other possibilities not discussed here).
SAS executes any other programming statements for the
current record.
At the end of the data statements (RUN;) SAS writes an
observation to the SAS data set (OUTPUT PHASE)
SAS returns to the top of the DATA step (Step 3 above)
The DATA step terminates when there is no more data.
Alan C. Elliott, stattutorials.com
End
Alan C. Elliott, stattutorials.com

similar documents