ISQS 6339, Data Management & Business
Intelligence
Data Preparation for
Analytics Using SAS
Zhangxi Lin
Texas Tech University
ISQS 6347, Data & Text Mining
Outline
An overview of data preparation for analytics
SAS Programming Essentials
Running SAS programs
Mastering fundamental concepts
SAS program debugging
Make use of SAS Enterprise Guide for programming
ISQS 6347, Data & Text Mining
Structure and Components of
Business Intelligence
ISQS 6347, Data & Text Mining
Overview: From Data
Warehousing to Data Analysis
Previous major topics in data warehousing (using SQL Server
2008)
Dimensional model design
ETL
Cubes design and OLAP
Data analysis topics (using SAS)
Data preparation
Analytic business questions
Data format and data conversion
Data cleansing
Data exploratory
Data analysis
Data visualization
ISQS 6347, Data & Text Mining
Components of the SAS
System
Reporting
And
Graphics
Data Access
And
Management
User
Interface
Analytical
Base SAS
Application
Development
Visualization
And Discovery
Business
Solutions
Web
Enablement
ISQS 6347, Data & Text Mining
SAS Programming Essentials
Find more information from
http://support.sas.com
ISQS 6347, Data & Text Mining
Data-driven Tasks
The functionality of the SAS System is built around
four data-driven tasks common to virtually any
applications
Data access
Data management
Data analysis
Data presentation
ISQS 6347, Data & Text Mining
Turning Data into
Information
Process of delivery meaningful information
80% data-related
Access
Scrub
Transform
Mange
Store and retrieve
20% analysis
ISQS 6347, Data & Text Mining
Turning Data into
Information DATA
Data
Step
SAS
Data Sets
PROC
Steps
Information
ISQS 6347, Data & Text Mining
Design of the SAS System
MultiVendor Architecture
90%
independent
PC
10%
dependent
Workstation
Servers/
Midrange
Mainframe
ISQS 6347, Data & Text Mining
Super
Computer
10
...
Design of the SAS System
MultiEngine Architecture
DB2
Teradata
SAP
dBase
DATA
DATA
ORACLE
SYBASE
Microsoft Excel
ISQS 6347, Data & Text Mining
11
SAS Programming Level
I
Fundamentals (ch1-3)
Producing list reports (ch4)
Enhancing output (ch5)
Creating data sets (ch6)
Data step programming (ch7)
Reading data
Creating variables
Conditional processing
Keeping and dropping variables
Reading Excel files
Combining SAS data sets (ch8)
Producing summary reports (ch9)
SAS graphing (ch10)
ISQS 6347, Data & Text Mining
12
Course Scenario
In
this course, you work with business data
from International Airlines (IA). The various
kinds of data that IA maintains are listed below:
flight data
passenger data
cargo data
employee data
revenue data
ISQS 6347, Data & Text Mining
13
Course Scenario
The
following are some tasks that you will
perform:
importing data
creating a list of employees
producing a frequency table of job codes
summarizing data
creating a report of salary information
ISQS 6347, Data & Text Mining
14
SAS Programs
A SAS program is a sequence of steps that the user
submits for execution.
Raw
Raw
Data
Data
DATA steps are typically used to create SAS
data sets.
DATA
DATA
Step
Step
SAS
Data
Set
SAS
Data
Set
PROC
PROC
Step
Step
Report
Report
PROC steps are typically used to process
SAS data sets (that is, generate reports
and graphs, edit data, and sort data).
ISQS 6347, Data & Text Mining
15
SAS Programs
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
DATA
Step
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
PROC
Steps
16
Step Boundaries
SAS steps begin with either of the following:
DATA statement
PROC statement
SAS detects the end of a step when it encounters
one of the following:
a RUN statement (for most steps)
a QUIT statement (for some procedures)
the beginning of another step (DATA statement
or PROC statement)
ISQS 6347, Data & Text Mining
17
Step Boundaries
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
18
Running a SAS Program
You can invoke SAS in the following ways:
interactive windowing mode (SAS windowing
environment)
interactive menu-driven mode (SAS Enterprise Guide,
SAS/ASSIST, SAS/AF, or SAS/EIS software)
batch mode
noninteractive mode
ISQS 6347, Data & Text Mining
19
Preparation of SAS
Programming
Data sets: \SAS-Programming
Create a user defined library reference
Statement
LIBNAME libref SAS-data-library <options>;
Example
LIBNAME ia c:\workshop\winsas\prog1;
Two-levels of SAS files names
Libref.fielname
ISQS 6347, Data & Text Mining
20
SAS Programming Essentials
Demon: c02s2d1
Exercise: c02ex1
ISQS 6347, Data & Text Mining
21
Browsing the Descriptor
Portion
General form of the CONTENTS procedure:
PROC
PROCCONTENTS
CONTENTS DATA=SAS-data-set;
DATA=SAS-data-set;
RUN;
RUN;
Example:
proc contents data=work.staff;
run;
ISQS 6347, Data & Text Mining
c02s3d1
22
SAS Data Sets: Data Portion
The data portion of a SAS data set is a rectangular table
of character and/or numeric data values.
JobTitle
Salary
TORRES
LANGKAMM
SMITH
WAGSCHAL
TOERMOEN
JAN
SARAH
MICHAEL
NADJA
JOCHEN
Pilot
Mechanic
Mechanic
Pilot
Pilot
50000
80000
40000
77500
65000
Character values
Variable
values
FirstName
Variable
names
LastName
Numeric
values
Variable names are part of the descriptor portion, not the
data portion.
ISQS 6347, Data & Text Mining
23
SAS Variable Values
There are two types of variables:
character
contain any value: letters, numbers, special
characters, and blanks. Character values are
stored with a length of 1 to 32,767 bytes. One
byte equals one character.
numeric
stored as floating point numbers in 8 bytes
of storage by default. Eight bytes of floating point
storage provide space for 16 or 17 significant
digits. You are not restricted to
8 digits.
ISQS 6347, Data & Text Mining
24
SAS Data Set and Variable
SAS names have these characteristics:
Names
can be 32 characters long.
can be uppercase, lowercase, or mixed-case.
are not case sensitive.
must start with a letter or underscore.
Subsequent characters can be letters,
underscores, or numerals.
ISQS 6347, Data & Text Mining
25
Valid SAS Names
Select the valid default SAS names.
data5mon
ISQS 6347, Data & Text Mining
26
...
Valid SAS Names
Select the valid default SAS names.
data5mon
ISQS 6347, Data & Text Mining
27
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
ISQS 6347, Data & Text Mining
28
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
ISQS 6347, Data & Text Mining
29
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
ISQS 6347, Data & Text Mining
30
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
ISQS 6347, Data & Text Mining
31
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
five months data
ISQS 6347, Data & Text Mining
32
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
five months data
ISQS 6347, Data & Text Mining
33
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
five months data
fivemonthsdata
ISQS 6347, Data & Text Mining
34
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
five months data
fivemonthsdata
ISQS 6347, Data & Text Mining
35
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
five months data
fivemonthsdata
FiveMonthsData
ISQS 6347, Data & Text Mining
36
...
Valid SAS Names
Select the valid default SAS names.
data5mon
5monthsdata
data#5
five months data
fivemonthsdata
FiveMonthsData
ISQS 6347, Data & Text Mining
37
...
Missing Data Values
A value must exist for every variable for each observation.
Missing values are valid values.
LastName
FirstName
JobTitle
Salary
TORRES
LANGKAMM
SMITH
WAGSCHAL
TOERMOEN
JAN
SARAH
MICHAEL
NADJA
JOCHEN
Pilot
Mechanic
Mechanic
Pilot
50000
80000
.
77500
65000
A character missing
value is displayed as
a blank.
ISQS 6347, Data & Text Mining
A numeric
missing value
is displayed
as a period.
39
Browsing the Data Portion
The
PRINT procedure displays the data
portion
of a SAS data set.
By
default, PROC PRINT displays the
following:
all observations
all variables
an Obs column on the left side
ISQS 6347, Data & Text Mining
40
Browsing the Data Portion
General form of the PRINT procedure:
PROC
PROCPRINT
PRINT DATA=SAS-data-set;
DATA=SAS-data-set;
RUN;
RUN;
Example:
proc print data=work.staff;
run;
ISQS 6347, Data & Text Mining
c02s3d1
41
SAS Data Set Terminology
SAS documentation and text in the SAS windowing
environment use the following terms interchangeably:
SAS
SASData
DataSet
Set
SAS
SASTable
Table
Variable
Variable
Column
Column
Observation
Observation
Row
Row
ISQS 6347, Data & Text Mining
42
SAS Syntax Rules
SAS statements have these characteristics:
usually begin with an identifying keyword
always end with a semicolon
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
43
SAS Syntax Rules
SAS statements are free-format.
One or more blanks or special characters can
be used to separate words.
They can begin and end in any column.
A single statement can span multiple lines.
Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining
44
...
SAS Syntax Rules
SAS statements are free-format.
One or more blanks or special characters can
be used to separate words.
They can begin and end in any column.
A single statement can span multiple lines.
Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining
46
...
SAS Syntax Rules
SAS statements are free-format.
One or more blanks or special characters can
be used to separate words.
They can begin and end in any column.
A single statement can span multiple lines.
Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining
47
...
...
SAS Syntax Rules
SAS statements are free-format.
One or more blanks or special characters can
be used to separate words.
They can begin and end in any column.
A single statement can span multiple lines.
Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining
48
...
...
SAS Syntax Rules
SAS statements are free-format.
One or more blanks or special characters can
be used to separate words.
They can begin and end in any column.
A single statement can span multiple lines.
Several statements can be on the same line.
Unconventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc means data=work.staff;
class JobTitle; ISQS 6347,
var
Salary;run;
Data & Text Mining
49
SAS Syntax Rules
Good spacing makes the program easier to read.
Conventional Spacing
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
50
SAS Comments
Type /* to begin a comment.
Type your comment text.
Type */ to end the comment.
/* Create work.staff data set */
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
/* Produce listing report of work.staff */
proc print data=work.staff;
run;
ISQS 6347, Data & Text Mining
c02s3d2
51
Syntax Errors
Syntax errors include the following:
misspelled keywords
missing or invalid punctuation
invalid options
daat work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff
run;
proc means data=work.staff average max;
class JobTitle;
var Salary;
run;
ISQS 6347, Data & Text Mining
52
Debugging a SAS
Program
c02s4d1.sas
userid.prog1.sascode(c02s4d1)
c02s4d2.sas
userid.prog1.sascode(c02s4d2)
This demonstration illustrates how to submit a
SAS program that contains errors, diagnose
the errors, correct the errors, and save the
corrected program.
ISQS 6347, Data & Text Mining
53
Recall a Submitted Program
Program statements accumulate in a recall buffer
each time you issue a SUBMIT command.
daat work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff
run;
proc means data=work.staff average max;
class JobTitle;
var Salary;
run;
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff mean max;
class Jobtitle;
var Salary;
ISQS 6347, Data & Text Mining
run;
Submit
Number 1
Submit
Number 2
54
Recall a Submitted Program
Issue the RECALL command once to recall the most
recently submitted program.
Submit
Number 1
Issue RECALL
once.
Submit
Number 2
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff mean max;
class JobTitle;
var Salary;
run;
Submit Number 2 statements
are recalled.
ISQS 6347, Data & Text Mining
55
Recall a Submitted Program
Issue the RECALL command again to recall Submit
Number 1 statements.
Submit
Number 1
Issue RECALL
again.
Submit
Number 2
daat work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff
run;
proc means data=work.staff average max;
class JobTitle;
var Salary;
run;
data work.staff;
infile 'raw-data-file';
input LastName $ 1-20 FirstName $ 21-30
JobTitle $ 36-43 Salary 54-59;
run;
proc print data=work.staff;
run;
proc means data=work.staff mean max;
class JobTitle;
var Salary;
run;
56
ISQS 6347, Data & Text Mining
Exercise 8: Basic SAS
Programming
Define library IA and Out
Go through all SAS programs in Chapter 2-5.
Write a SAS program to read a dataset created by
yourself or simply use Person0.txt in
\\TechShare\coba\d\ISQS3358\OtherDatasets\ .
The dataset is output to your library Out.
Try to apply whatever SAS features in Chapter 5 of Prog-I
to general a nice looking report.
Go through all exercises for Ch 2, 3, 4, 5, 6 (answer keys
are available, so no need to submit the results)
ISQS 6347, Data & Text Mining
57
Hands-on exercise
Write a SAS program to calculate the number
of dates passed in 2012 to 3/3/2012. The
input is in the format: date9.
01JAN2012 03MAR2012
Answer: 62 days
ISQS 6347, Data & Text Mining
58
Making Use of SAS Enterprise
Guide Code
Import a text file
Example: Orders.txt
Import an Excel file
Example: SupplyInfo.xls
ISQS 6347, Data & Text Mining
59
Learn from Examples
SAS Help
Contents -> Learning to use SAS -> Sample SAS
Programs -> Base SAS
Base Usage Guide Examples
Chapter 3, 4
ISQS 6347, Data & Text Mining
60
Import an Excel Sheet
proc import out=work.commrex
datafile ="C:\Lin\Shared\ISQS6339\Commrex_3358.xls" dbms=excel
replace;
sheet="Company";
getnames=yes;
mixed=no;
scantext=yes;
usedate=yes;
scantime=yes;
run;
proc print data=work.commrex;
run;
ISQS 6347, Data & Text Mining
61
Excel SAS/ACCESS LIBNAME
Engine
libname xlsdata 'C:\Lin\Shared\ISQS6339\Commrex_3358.xls';
proc print data=xlsdata.New1;
run;
ISQS 6347, Data & Text Mining
62
Exercise 8: SAS Data Step
Programming
http://zlin.ba.ttu.edu/6339/ExerciseSASProgramming.htm
ISQS 6347, Data & Text Mining
63