0% found this document useful (0 votes)

43 views333 pages

LWDMTX5 001

The document provides course notes for 'Text Analytics Using SAS® Text Miner,' developed by Terry Woodfield and others, focusing on text analytics concepts and applications using SAS Enterprise Miner. It includes lessons on data and text mining, algorithmic considerations, and practical demonstrations with various text analytics techniques. The course aims to equip learners with the skills to analyze textual data effectively using SAS tools while ensuring compliance with data privacy standards.

Uploaded by

Prakshi Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views333 pages

LWDMTX5 001

Uploaded by

Prakshi Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 333

Text Analytics Using SAS®

Text Miner

Course Notes
Text Analytics Using SAS® Text Miner Course Notes was developed by was written by Terry
Woodfield with a significant contribution from Rich Perline based on a previous edition of the course.
Additional contributions were made by Peter Christie and George Fernandez. Instructional design,
editing, and production support was provided by the Learning Design and Development team.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Text Analytics Using SAS® Text Miner Course Notes

Copyright © 2019 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States
of America. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise,
without the prior written permission of the publisher, SAS Institute Inc.

Book code E71427, course code LWDMTX5/DMTX5, prepared date 24May2019. LWDMTX5_001

ISBN 978-1-64295-272-8
For Your Information iii

Table of Contents

Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner.............. 1-1

1.1 Data Mining and Text Mining ..................................................................................... 1-3

Demonstration: Text Analytics Illustrated with a Simple Data Set ....................... 1-8

1.2 Working with Data Sources ..................................................................................... 1-61

1.3 Using SAS Text Miner.............................................................................................. 1-70

Demonstration: Spell Checking ......................................................................... 1-89

Practice............................................................................................................ 1-120

1.4 Chapter Summary ................................................................................................. 1-121

1.5 Solutions ................................................................................................................ 1-122

Solutions to Practices ...................................................................................... 1-122

Solutions to Student Activities (Polls/Quizzes) ............................................... 1-124

Lesson 2 Overview of Text Analytics......................................................................... 2-1

2.1 Using the Text Import Node ....................................................................................... 2-3

Demonstration: Using the Text Import Node ...................................................... 2-10

2.2 A Forensic Linguistics Application ........................................................................... 2-29

Demonstration: Stylometry for Forensic Linguistics .......................................... 2-33

2.3 Information Retrieval ............................................................................................... 2-38

Demonstration: Retrieving Medical Information................................................. 2-41

Practice.............................................................................................................. 2-46

2.4 Text Categorization .................................................................................................. 2-49

Demonstration: Categorizing Aviation Safety Reports ...................................... 2-51

Practice.............................................................................................................. 2-71

2.5 Chapter Summary ................................................................................................... 2-72

2.6 Solutions .................................................................................................................. 2-73

iv For Your Information

Solutions to Practices ........................................................................................ 2-73

Solutions to Activities and Questions ................................................................ 2-88

Lesson 3 Algorithmic and Methodological Considerations in Text Mining .......... 3-1

3.1 Methods for Parsing and Quantifying Text ................................................................ 3-3

3.2 Dimension Reduction with SVD .............................................................................. 3-19

Demonstration: Experimenting with the SVD Dimensions ................................ 3-26

Practice.............................................................................................................. 3-31

3.3 Chapter Summary ................................................................................................... 3-39

Solutions to Practices ........................................................................................ 3-41

Solutions to Activities and Questions ................................................................ 3-43

Lesson 4 Additional Ideas and Nodes ....................................................................... 4-1

4.1 Some Predictive Modeling Details ............................................................................ 4-3

Demonstration: Experimenting with the Effects of Global Weights on
Predictive Power ...................................................................... 4-20

Demonstration: Developing a More Interpretable Model (Optional) .................. 4-25

Practice.............................................................................................................. 4-28

4.2 Text Profile Node ..................................................................................................... 4-29

Demonstration: Profiling News Articles Using the Text Profile Node ................ 4-35

Practice.............................................................................................................. 4-42

4.3 High-Performance (HP) Text Miner Node (Optional) .............................................. 4-43

Demonstration: Predictive Modeling with the HP Text Miner Node ................... 4-50

Demonstration: Using PROC HPTMINE ........................................................... 4-57

Demonstration: Predictive Modeling Using High-Performance Nodes ............. 4-63

4.4 Chapter Summary ................................................................................................... 4-65

4.5 Solutions .................................................................................................................. 4-66

Solutions to Practices ........................................................................................ 4-66
Solutions to Activities and Questions ................................................................ 4-70
For Your Information v

To learn more…
For information about other courses in the curriculum, contact the
SAS Education Division at 1-800-333-7660, or send e-mail to
[email protected]. You can also find this information on the web at
http://support.sas.com/training/ as well as in the Training Course
Catalog.

For a list of SAS books (including e-books) that relate to the topics
covered in this course notes, visit https://www.sas.com/sas/books.html or
call 1-800-727-0025. US customers receive free shipping to US
addresses.
vi For Your Information
Lesson 1 Introduction to SAS®
Enterprise Miner™ and SAS® Text
Miner

1.1 Data Mining and Text Mining ................................................................................................. 1-3

Demonstration: Text Analytics Illustrated with a Simple Data Set.......................................... 1-8

1.2 Working with Data Sources ................................................................................................. 1-61

1.3 Using SAS Text Miner ........................................................................................................... 1-70

Demonstration: Spell Checking ............................................................................................ 1-89

Practice ............................................................................................................................... 1-120

1.4 Chapter Summary ............................................................................................................... 1-121

1.5 Solutions .............................................................................................................................. 1-122

Solutions to Practices ......................................................................................................... 1-122

Solutions to Student Activities (Polls/Quizzes) ................................................................... 1-124

1-2 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-3

1.1 Data Mining and Text Mining

Preliminary Remarks

Course Data

Text analytics is a vigorous field of research with many applications. The purpose of this course is
to teach you how to solve analytic problems that include relevant textual data. This is done using
SAS Enterprise Miner (EM), a very general data mining product that incorporates text analytic tools
among many other statistical and machine-learning tools.
Access to real business data is always problematic. Because text fields often contain confidential
information, access to business data that include text is even more difficult. Most data sets used in
this course are publicly available. All data used in this course are either artificially created or
modified in some way. Modifications include the following:
• deletion of sensitive entries
• deletion of potentially embarrassing or misleading entries
• editing or deletion of entries with named individuals or business organizations
• editing of text fields having obscure or confusing references
• resolution of ambiguities that might be incorrect
• modification or deletion of entries to promote educational goals
Because of these modifications, the data should not be used for any purpose other than education.
All publicly available data sets are introduced with a reference to the source of the actual data. You
should acquire data directly from the source if you want to use the data for business or scientific
purposes.

SAS Enterprise Miner and Best Practices

SAS Enterprise Miner works with the hierarchy ProjectDiagramProcess FlowNode. A typical
organization will have multiple projects, even if the same data are used. For example, in an
insurance company, a medical cost containment project and a fraud detection project might share
data that contain features of an insurance accident claim. Many demonstrations in this course would
be classified as projects, yet there is only one Enterprise Miner project for the entire course. The
project has many diagrams. Think of each diagram as a separate project. This “best practice” for
teaching economizes class time by eliminating transition periods between opening and closing
projects. Whereas it is an education best practice to use a single Enterprise Miner project to
accommodate many actual projects, a better practice for a typical business setting would be to use
a new Enterprise Miner project for each new business project.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-4 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Objectives
• Describe what text analytics is.
• Describe how SAS Text Miner is used with SAS Enterprise Miner.
• Briefly describe concepts related to document analysis.
• Identify some of the main items in SAS Enterprise Miner, including
the SEMMA tools palette.
• Illustrate text analytics with a simple document collection.

3
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Analytics
• You use the terms text analytics, text data mining, and text mining
synonymously in this course.
• Text analytics uses natural language processing techniques and other
algorithms for turning free-form text into data that can then be analyzed
by applying statistical and machine learning methods.
• Text analytics encompasses many sub areas, including stylometry, entity
extraction, sentiment analysis, content filtering, and content
categorization.
• Text analytics spans many fields of study, including information retrieval,
web mining, search engine technology, and document classification.

4
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

This course focuses on the use of SAS Text Miner, a fully integrated component of SAS Enterprise
Miner. SAS has a rich set of other text analytic products. SAS Text Miner can be regarded as the
most focused on discovery and prediction.

Visit http://support.sas.com for information about the latest text analytic offerings. Other courses
present topics related to products such as SAS Enterprise Content Categorization and SAS
Contextual Analysis.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-5

The two major components of data mining are pattern discovery (or exploratory analysis) and
predictive modeling. Text analytics generally covers the two broad areas of information retrieval
and text categorization, and these two areas often map into one of the two data mining
components. Because the abundance of text mining application areas can be overwhelming, some
clarification can be achieved by looking at what all text mining projects have in common. Reference
material on text analytics often includes different sub-categories of text mining. For example, Miner
et al. (2012) list 20 specialty areas ranging from document matching to web content mining. You can
also think about the two process-based categories: (1) projects that are almost exclusively text
analytics projects, such as automatic classification of tech support calls, and (2) projects with a more
general purpose, such as predicting insurance fraud, where text analytics supplies some of the
component parts of the solution.

Text Mining
Text mining as presented here has the following characteristics:
• operates with respect to a corpus of documents
• uses one or more dictionaries or vocabularies to identify relevant terms
• accommodates a variety of algorithms and metrics to quantify the
contents of a document relative to the corpus
• derives a structured vector of measurements for each document relative
to the corpus
• uses analytical methods that are applied to the structured vector of
measurements based on the goals of the analysis (for example, groups
documents into segments)

5
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-6 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Mining Preliminaries

• The core ingredient of any text mining solution is a well-defined process
to turn unstructured text into a set of numbers.
• Experimenting with search engine software illustrates how text mining
algorithms work, and provides insight into the history of text mining.
• The primary application of search engine software is information retrieval.
• Text mining has many applications that go beyond information retrieval.
Fortunately, information retrieval technology generalizes to address many
other applications.

6
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

If you are not familiar with text analytics, one of the first books you should read is by Berry and
Browne (2005), who describe the mathematics behind search engine technology. Chakraborty et al.
(2013) provide a practical guide to text analytics using SAS software.

Discussion
What corpus will be used for your first
text mining project?

[Example: insurance adjuster notes for

workers compensation insurance claims.]

7
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

A simple example using 47 artificial documents shows how SAS Text Miner turns documents into
numbers. Using your experience with search engines as an information retrieval tool will help you
understand how the process works.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-7

The Weather-Animal-Sports Corpus

• Forty-seven documents have been labeled by a human judge.
• The label variable is Target_Subject.
• The possible values for the target labels are A (animals), S (sports), and W
(weather).
• The documents are stored as a SAS character variable with the name
TextField.

8
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-8 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Analytics Illustrated with a Simple Data Set

This demonstration illustrates some text analytics concepts using a simple document collection.
Note: In this class, the project that you use and the diagrams are already set up, at least partially.
However, for each demonstration, you can rebuild your own version of each diagram. In
some cases, you can make additions to an existing diagram. Complete instructions are
provided even if the process flow has already been created for you.

Search Engines and Information Retrieval

If you have ever used a search engine, you have probably employed algorithms used by SAS Text
Miner. One of the more popular algorithms is called Latent Semantic Indexing. SAS Text Miner uses
a Latent Semantic Analysis algorithm, an extension of Latent Semantic Indexing developed to permit
applications beyond information retrieval.
During the development of this course, a search was initiated using the simple search keyword lions.
A specific commercial search engine was used, and the top 25 recommendations were considered.
This search engine provided the following top five recommendations.
1. A Wikipedia link on the word lions

2. The Twitter link for the Detroit Lions of the National Football League (NFL)
3. A link to the Detroit Lions website
4. A link to an ESPN web page dedicated to news coverage of the Detroit Lions

5. A link to a news article about a game played the day of the query featuring the Detroit Lions
Recommendations 10, 11, and 12 linked to articles related to African lions, as were most of the
remaining recommendations in the top 25. Other recommendations included links for the Lions Club,
a non-profit civic organization. Using a different commercial search engine, and examining the top 25
recommendations as before, the third recommendation was a link to an organization that promotes
humane treatment of captured African lions housed by zoos or animal preserves, and the fourth
recommendation was a link to the Lions Club. Recommendation 1 linked to Wikipedia, and
recommendations 2 and 5 linked to information about the Detroit Lions NFL team. The different
results can be attributed to many factors. Search engines are dynamic, and they dynamically react
to a user’s personal search history as well as to the dynamically changing Internet. One reason for
different results is that the browser used to initiate the searches had no history of any searches using
the second search engine, so the second search engine had no data to learn user preferences.

A general problem in text mining is word sense disambiguation. A search engine cannot
disambiguate lions into the proper category of African lions or any animal classified as a lion, the
professional sports team Detroit Lions, or the civic organization the Lions Club. The use of a search
history can help the search engine guess what information the user is seeking. If a high percentage
of recent searches were for NFL scores, American football scores, sports teams, or the sport of
American football, then results like those encountered above are likely to be satisfactory to the
searcher. In fact, on the day of the query, the browser had been used to make numerous searches
using the first search engine to get updates on scores and to find game schedules for the NFL.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-9

To see how search engines dynamically learn from individual users, you can perform an experiment
with friends or colleagues. Search on a simple term like lions. Have others perform the same search
using the same search engine on their individual computers. The top 25 results should be different,
at least in ranking. Consider the top five results listed above. A search by the same person on the
same computer using the same browser, but on a different day produced different results. The
results also differed from those obtained by a colleague using a different computer. Langville and
Meyer (2006) provide some insight into search technology for the popular Google search engine.
As mentioned before, Berry and Browne (2005) introduce search engine mathematics, and their
book is highly recommended.
Using the simple 47 document corpus described above, the term lion or lions appears in seven
documents. All seven documents are classified as animals documents.

Zooming in on the text, you can verify that these are lions documents.

The above results were obtained using the Filter Viewer in the Text Filter node. The Filter Viewer
accommodates queries and is explained later. SAS Text Miner provides several strategies for
retrieving information from a document collection. The simple example was chosen to illustrate basic
concepts and avoid the complexities of word sense disambiguation. Lion always refers to the animal
lion in the simple collection.
If you want to find information on the Internet, you might use Google or Bing or some other search
software. This demonstration shows you ways to find information in a document collection using
SAS Text Miner.

Starting SAS Enterprise Miner

Your instructor will explain how to start SAS Enterprise Miner for your computing environment.
On Microsoft Windows clients, you can use Start  All Programs  SAS  SAS Enterprise Miner
Client. If you are using the workstation version, then select SAS Enterprise Miner Workstation.
The version number is included in the selection (for example, SAS Enterprise Miner Client 15.1).
Alternatively, there might be a desktop icon for Enterprise Miner. If so, simply double-click the icon,
and the SAS Enterprise Miner login window appears. There is no login window for the workstation
version of Enterprise Miner.
There might be differences in the SAS Enterprise Miner main menu depending on version number.
The version number is given in the menu.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-10 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Select Recent Projects, and open the project named DMTX51.

Note: Additional course material describing the SAS Enterprise Miner interface is forthcoming.
This demonstration focuses on the use of SAS Text Miner to illustrate fundamental concepts
of text mining, but you will need to know how to use SAS Enterprise Miner to create an
appropriate process flow, modify properties, and examine results.
In the DMTX51 project, select the diagram WeatherAnimalsSports. The completed diagram
is shown below. The steps that follow describe how to create the process flow in the diagram.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-11

SAS Enterprise Miner Tutorial

SAS Enterprise Miner provides an analytical laboratory for solving business and scientific problems.

SAS Enterprise Miner: Tutorial

Menu bar and shortcut buttons

10
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

SAS Enterprise Miner: Tutorial

Project panel

11
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-12 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

SAS Enterprise Miner: Tutorial

Properties panel

12
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

SAS Enterprise Miner: Tutorial

Help panel
13
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-13

SAS Enterprise Miner: Tutorial

Diagram workspace

14
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

SAS Enterprise Miner: Tutorial

Process flow

15
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-14 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

SAS Enterprise Miner: Tutorial

Node

16
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

SAS Enterprise Miner: Tutorial

SEMMA tools palette

17
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-15

SAS Enterprise Miner: Tutorial

SEMMA tools palette

Assess
Model
Modify
Explore
Sample

18
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The available SAS Enterprise Miner tools are contained in the tools palette. The most commonly
used tools are arranged according to a process for data mining referred to as SEMMA. This
is an acronym for the following:
Sample You sample the data by creating one or more data tables. The samples should be large
enough so that you have confidence in the reliability of the results.
Explore You explore data to better understand relationships, anomalies, and problems.
Modify You modify the data by cleaning, selecting, and transforming the variables considered for
modeling.
Model You model the data using the available analytical tools.
Assess You assess and compare alternative models to find the best results that you can obtain. t

Additional tools (nodes) are available on the Utility tab. Additional nodes can be licensed, such
as the Credit Scoring nodes and the Text Mining nodes.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-16 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

SAS Enterprise Miner: Tutorial

SEMMA tools palette

Sample

Explore

Modify

Model

Assess

19
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The Help panel provides a terse description of a particular property. You can access the full
documentation for SAS Enterprise Miner and SAS Text Miner by clicking Help  Contents.

SAS Enterprise Miner: Tutorial

20
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Details about features and algorithms, such as singular value decomposition, help you understand
how to use the software.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-17

SAS Enterprise Miner: Tutorial

21
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Having covered the basics of the SAS Enterprise Miner interface and SEMMA methodology, you
need to understand how to construct a diagram and a process flow. To create a new diagram, select
File  New Diagram, or right-click the Diagrams entry in the project panel and select Create
Diagram. Enter the name of the diagram in the window that appears.
A process flow usually begins with one of the following: Input Data Source node, File Import node,
or Text Import node. These three nodes bring the data into a process flow.

Click and drag a

data source into
the diagram
workspace.

Position the node

in the diagram
and then release
the mouse
button.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-18 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

A process flow is created by dragging nodes into the diagram workspace and then releasing the
node at the desired location. To connect two nodes, move the cursor to the right of the first node until
you see a pencil icon. Then click and drag the arrow until it touches the second node.

When the arrow touches the second node, you can release the mouse button.

Your instructor will provide additional information, such as how to move or delete groups of nodes.
Information about data sources in SAS Enterprise Miner will be provided after the conclusion of the
demonstration.

Every node properties panel supports the following properties:

Imported Data enables you to browse or explore imported data.

Exported Data enables you to browse or explore exported data.
Notes provides access to a simple text editor so that you can make notes related to the
node.
All three of these common properties have an ellipsis button that activates an interactive window that
gives you access to the property features.

Clicking an ellipsis button

brings up an interactive
window.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-19

Consider the exported data property of the Text Cluster node. You can see the choices for browsing
or exploring the exported data.

In the above window, only the TRAIN data set exists. If you select the TRAIN data, you get access to
the three buttons at the bottom.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-20 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The browse button opens a scrollable view of the data. The explore button provides a common
interface for exploring data.

The guide Getting Started with SAS® Enterprise Miner™ is available from
http://support.sas.com. The guide provides examples of exploring data with SAS Enterprise Miner.
Because an exploration is dynamic, you do not want to copy a large amount of data over the network
for exploration. Consequently, the explore window shows only explorations based on a sample of the
data. Use Options  Preferences to control how explorations are performed. In particular, you
probably want to change Sample Method to Random, and Fetch Size to Max. Following is an
example of a plot obtained using the Plot Wizard and the 3D Scatter Plot option for the TRAIN data
exported by the Text Cluster node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-21

Sports Cluster

Animals Cluster

Weather Cluster

Things You Need to Know Before Creating the First Project

Dictionary Any list of words, with an optional role or weight attribute (or both) for each word.
Term A token (character string) or group of tokens (multi-word terms) having a specific
meaning in a given language.
Start List A dictionary of relevant terms to be used in the analysis.
Stop List A dictionary of irrelevant terms to be ignored for the analysis.
Stemming Mapping variations of a term into a single parent term. Variations can be caused by
verb tense (present tense/past tense), noun and verb singular/plural context,
grammatical gender of a verb or noun (so-called romance languages).

Synonyms More general than the usual definition of synonyms used in a language arts class,
for text mining, a synonym is any child term mapped to the same parent term, where
the mapping can be the result of true synonymity, mapping of misspelled words to
correctly spelled words, or stemming.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-22 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

For any corpus, you must specify either a start list or a stop list. A default stop list is provided
in the language that you are using for the analysis.
There are additional things that you could know but are not required to know to complete the project,
such as why certain properties are selected. For example, the Text Cluster node Max SVD
Dimensions property defaults to 100, but you will change this to 3 in the demonstration. The
detailed explanations of properties are provided in subsequent chapters.

Weather-Animals-Sports Project

1. Insert a File Import node in the diagram. The File Import node is on the Sample tab. This is the
first node on the left that you see in the diagram above. In this demonstration, you use a data set
that is completely stored in a single Microsoft Excel spreadsheet. Importing a file is one way of
getting relatively small text mining data sets into SAS Enterprise Miner. On the properties panel
for the File Import node, specify the import file as the data set
D:\workshop\winsas\DMTX51\WeatherAnimalsSports.xls. Run the File Import node.

Note: In some classroom configurations, the import file is located at

D:\workshop\DMTX51\WeatherAnimalsSports.xls. Your instructor will provide the
specific pathname if it is different from one of the two provided.

2. To see the data set after the File Import node is run, go to the Exported Data line of the
properties panel. Click the ellipsis button ( ). Then select the Train data and click Browse near
the bottom of the window. You see the rows of the data set. The first seven rows are shown
below.

The data set has two fields: Target_Subject (with values A, S, W) and TextField, which consists
of short sentences. As indicated earlier, the sentences are about one of three subjects: animals
(A), sports (S), and weather (W).

Note: It is important to understand that the Target_Subject field was created by a person
interpreting the content of each TextField. It was not created automatically by the Text
Miner nodes. Consequently, the labeling of each document is subject to human error.
3. Read through a few of the rows and make sure that you understand the nature of the data set
and how it is structured. The variable TextField is what is referred to as a document. All the rows
of TextField together (47 rows of data) are referred to as the corpus.
4. You can save the SAS table derived from the Excel file by attaching a Save Data node from the
Utility tab to the File Import node. Use the properties of the Save Data node to name the table
and specify the library where the table is to be stored.
5. Attach a Text Parsing node to the File Import node. This node has the language processing
algorithms and has many different options that can be set by the user. For this demonstration,
use the default settings. Run the Text Parsing node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-23

6. Attach a Text Filter node to the Text Parsing node. Change the Minimum Number of
Documents value in the properties panel to 2. This option filters out terms that are not used
in at least two documents in the corpus collection. Because you use a very small data set,
the default value 4 is too large and eliminates too many terms. Run the Text Filter node.
7. Open the Filter Viewer in the properties panel. This is also called the Interactive Filter Viewer.

8. Look at the two main windows that open in the Filter Viewer. You see what is shown in the
display below.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-24 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The (default) gray color indicates

that the document table is not
the active table. To make it the
active table, simply click
anywhere in the table.

The (default) blue color indicates

that the term table is the active
table. All menu selections will be
associated with this table.

The first window, labeled Documents, simply lists each document and any other variables
in the training data set (in this case, only the variable Target_Subject). The second window,
labeled Terms, gives information about each of the terms that came out of the Text Parsing
node. A term does not have to be a single word. The term table contains the corpus dictionary—
that is, it contains every term in the document collection defined by the training data set, after
certain parts of speech have been eliminated. To be technically accurate, the term table is
related to the corpus dictionary, but it actually represents a mapping of the corpus dictionary into
a set of terms influenced by Text Parsing and Text Filter properties. For example, you can force
certain terms, such as addresses, compound words, or dates, into the term table.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-25

For this example, the entire 47 document collection is used as a training data set. SAS
Enterprise Miner accommodates the use of three data sets for analysis: train, validate, and test.
You can use the Data Partition node on the Sample tab to partition a raw data table into any or
all of the three analysis tables. The purpose of the three data tables is explained later. The
validate and test data sets are used to verify that scoring models or algorithms generalize to new
data. Because this is an exploratory analysis, only a training data set is used.
The Terms window contains the following information:
FREQ = number of times the term appears in the entire corpus.
#DOCS = total number of documents in which the term appears.
KEEP = whether the term is kept for calculations. The keep status reflects whether a
term is in the start list (KEEP=Y) or in the stop list (KEEP=N).
WEIGHT = a term weight. Term weights are explained later. The default term weight is
mutual information when a categorical target variable is present in the data.

ROLE = part of speech of the term (if the Different Parts of Speech property is set
to Yes).
ATTRIBUTE = one of abbr, alpha, mixed, num, or punct. Attributes num (number) and punct
(punctuation) are ignored by default.
Go to the Terms window and confirm that the word the is not listed. (If the TERM column
is not already in alphabetical order, you can sort a column by clicking on the heading.)

The table is
sorted by
term.

The term “the”

is missing.

Why does the most common word in the English language not appear in the list? To understand
why, click the Text Parsing node so that the properties panel for that node is visible. Look
at the properties near the bottom. You can see that there is an Ignore Parts of Speech property.
By default, this excludes certain terms that are very common.

On the Text
Parsing node
properties panel

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-26 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

In particular, 'Det' represents Determiner, which is a class of common words and phrases such
as the, that, and an. These are eliminated unless you modify this property. Eliminating a word
because of the Ignore Parts of Speech property is different from adding a word to the stop list.
Words in the stop list appear in the term table, but are assigned a weight of zero. Words that are
ignored are excluded from the term table. The distinction is that words in the stop list can be
moved to the start list dynamically without re-parsing the document collection, whereas words
excluded from the term table can be added only by modifying properties in the Text Parsing node
and re-running the node. Because parsing typically consumes 80% to 90% of the processing
time for a Text Miner process flow, you want to avoid re-parsing, especially for very large
document collections.
Go back to the Text Filter node. Why are some of the terms kept (KEEP is selected  KEEP=Y),
but others are not kept (KEEP is cleared KEEP=N)? There are several reasons why a word
is not kept, and these can depend on settings in both the Text Parsing node and the Text Filter
node. One reason is that the word does not appear in enough documents, such as what
happens for the word antelope. You previously set the Minimum Number of Documents
property to 2 for the Text Filter node. Because antelope occurs in only one document, it is not
kept.
Another reason a term is not kept is if it appears on a stop list specified in the Text Parsing node.
The default stop list for the English language is SASHELP.ENGSTOP. If you open the stop list
by clicking the ellipsis icon, you see a list of many terms that are excluded from further
computations.

Default stop list in

the Text Parsing
node properties
panel

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-27

If you open SASHELP.ENGSTOP from the Text Parsing node, you see that all is listed as a term
not to be used, as in the display below. Therefore, all is selected as KEEP=N in the Text Filter
node.

The term all is in the default stop

list SASHELP.ENGSTOP.

The interface for accessing the stop list is dynamic. You can edit the table directly using the two
buttons on the upper left. Use the starburst button to add terms, and use the delete button
to delete selected terms. You can append to the table using the Add Table button, or you can
replace the table with a different table using the Replace Table button. The table interface
is common to the following properties.

Node Property

Text Parsing Multi-word Term

Text Parsing Synonyms

Text Parsing Start List

Text Parsing Stop List

Text Topic User Topics

The variables included in a table might be different, but the interface is the same.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-28 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

If you select Replace Table, you have the option to specify that no table is to be used.

Selecting this
option specifies
that no table is to
be used.

9. Return to the Text Filter Viewer. In the query window, type lions. Click the Apply button.

Five documents exhibit the word lions. The TEXTFILTER_RELEVANCE score is a function of the
word frequency normalized by the highest frequency encountered. For example, the word lions
appears twice in the first document, and one time in the remaining documents. Thus, the last
four documents have a relevance score of ½=0.5. The calculation is more complicated for
compound queries.

Note: The TEXTFILTER_RELEVANCE formula is not documented, so the interpretation as a

relative frequency is based on observing the results for simple one-word queries. For
compound queries, the formula is clearly not a simple frequency-based calculation.

10. Change the query to lion, and click Apply.

Only two documents are returned. This verifies that the query is based on searching for words,
not for sequences of characters. If the search was for any occurrence of the letters l-i-o-n, then
seven documents would be found. The search feature applies what could be called a token-
based search contrasted with a character-string-based search.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-29

Note: A token is a string of characters separated from other tokens by a separator, where a
separator is usually a blank (space) or mark of punctuation. Words in a document are
tokens.

Queries are not case sensitive.

The operator ># can be used to find documents containing the word or any synonym or stemmed
version of the word. Thus, >#lion returns all documents that contain lion or lions as words.

Seven documents contain one or both of lion or lions. The query is a compound query, looking
for multiple words. Note that all seven documents are classified as animals documents.
You can use the TEXTFILTER_RELEVANCE score to rank order documents with respect
to a query. Consequently, the Filter Viewer provides search capabilities similar to those found
in commercial Internet-based search engines.
Note: The TEXTFILTER_RELEVANCE score 0.987 for the third and fourth selected documents
shows that compound queries use a more complex calculation than simply comparing
word frequencies. Otherwise, a simple frequency-based calculation would produce
a score of 0.5. As mentioned above, the actual formula used is not documented.

Fewer query operators are supported by the Filter Viewer than a typical Internet search engine.
Although the Filter Viewer provides a useful mechanism for information retrieval, because
it is based on searching for tokens, it is not as powerful or as efficient as a linear-algebra-based
search, such as Latent Semantic Indexing. The Text Topic node facilitates linear algebra type
queries.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-30 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Note: The linear algebra approach to information retrieval derives a numeric vector for each
document and translates the query into a numeric vector. Calculations, like vector inner
products, are used to evaluate how well a document satisfies a query. The highest
scoring documents as scored by the vector inner product are returned by the search
engine. If a commercial search engine uses a linear algebra approach, the actual search
software probably incorporates many other tools and features in addition to linear
algebra calculations. Translating the query into a numeric vector might use proprietary
algorithms that use recent search history information. Initial results can be reweighted
based on current search trends influenced by all users of the search engine. In a general
sense, many search engines “learn” by recording which links a user clicked and which
links were ignored. There might be success/fail measures related to how long it took a
user to abandon a link and return to the original search results.
Many text-based software products support a FIND function, and SAS Text Miner is no
exception. You can select Edit  Find, and enter a search term. If you enter lion, the Find
feature steps through all seven documents found above. Find jumps to the document in the
Documents or Terms window, but it does not subset the document collection like the Search
window. The Find feature is a character- string-based find as compared to a token-based find.
Thus, if you enter lion in the Find Text window, you will identify documents with the words lion,
lions, lioness, ganglion, and so on.
SAS programmers might suspect that the query is equivalent to using a SAS function such
as INDEX or FIND. See “Text Mining Basics for SAS Programmers” at the end of this chapter
to learn more about how the Text Filter Viewer finds documents and terms.
11. The next few steps introduce the two main analytic text mining tools, the Text Clustering node
and the Text Topic node. Attach a Text Clustering node to the Text Filter node. The Text
Clustering node takes the 47 documents in the example data set and separates them into
mutually exclusive and exhaustive groups (that is, clusters). The number of clusters
to be used is under user control. You modify four of the default settings.

a. Change SVD Resolution from Low to High.

b. Change Max SVD Dimensions from 100 to 3.

c. Change Exact or Maximum Number (of clusters) to Exact.

d. Change Number of Clusters from 40 to 3.

The settings resemble the ones below.

Use these
indicated settings
for the Text
Cluster node.

Regarding the Text Cluster properties, remember that you are using a very small and simple data
set. You know that there are basically three types of documents (animals, sports, weather). It is
reasonable to think in terms of creating a small number of clusters (for example, 3 to 5). Use 3.
In practice, with real and complex text data, you want to experiment with these parameters. Run
the node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-31

12. Open the Text Cluster node results and examine the left side of the Clusters window as shown.

Exactly three clusters were created, as requested in the properties panel. The Descriptive
Terms column shows up to 15 terms that are given to help the user interpret the types of
documents that are put into each cluster. (The number can be changed.) These terms are
selected by the underlying algorithm as being the most important for characterizing the
documents placed into a given cluster. Reading these, you can see that Cluster 1, which has
16 documents, has terms such as favorite zoo, big cat, and so on. These documents are likely
about animals. The + indicates that a term has multiple versions either from stemming or from
having synonyms. Cluster 2 has 14 documents that are likely related to sports. Cluster 3 has
17 documents that likely deal with weather.

Note: Stemming is the process of mapping a collection of terms into a single term based on
verb tense or noun/verb singular/plural considerations. For example, you have already
seen that lions is a stemmed version of lion. Similarly, cooks, cooked, and cooking are
all stemmed versions of cook. The Text Parsing node distinguishes between all forms
of a word, but treats the term and all stemmed versions of the term as a single term
when counting words and exploring word associations.
13. To see the new variables that were generated by the Text Cluster node, close out of the results.
Select Exported Data from the properties panel.

Then select the TRAIN data set and click Explore.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-32 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The upper right window (Sample Statistics) shows a list of variables that were exported from
the Text Cluster node.

Several new variables have been added to the original variables Target_Subject and TextField:

TextCluster_SVD1-TextCluster_SVD3 – These are numeric variables calculated from a

singular value decomposition of the (usually weighted) document-term frequency matrix. Each
document is represented by its values on these three new variables. The values are also
normalized so that for each document all the squared SVD values sum to 1.0. These are the
variables that are used to cluster the documents. Because these SVD columns are added to the
exported data, they can be used as input variables for any supervised or unsupervised learning
node supported by Enterprise Miner.
TextCluster_cluster_ – This is the Cluster ID, a categorical variable. In this example, it is simply
a number from 1 to 3 because three clusters were created. The clusters were generated by
performing a cluster analysis on the three TextCluster_SVD variables. The interpretation of the
clusters begins with looking at the descriptive terms given for each cluster, as you did earlier.
The assigned numbers act as labels and are arbitrary. Any modification of the imported data,
such as sorting differently, or adding a new document, can cause the labels to change, even
if cluster membership does not change.
TextCluster_prob1 - TextCluster_prob3 – These variables are the probabilities of membership
in each cluster for a given document. The sum of these probabilities is 1. A document is assigned
to the cluster where it has the highest membership probability. These values are added when
Expectation-Maximization clustering is used. Details of the two clustering algorithms supported
by the Text Cluster node are provided later.
_document_ – This is a document ID.
When you have successfully run the Text Cluster node, you can score training, validation, test,
and score data sets. A very general interpretation of scoring identifies seven of the eight columns
added by the Text Cluster node as scores. The document ID is an ID. Hence, it is not a score.
In particular, the SVD values provide numeric features that can be exploited by other analytical
tools, but they have no direct use or interpretation with respect to labeling documents.
The process of scoring is explained after the demonstration.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-33

14. SAS Enterprise Miner provides many ways to do further explorations. The StatExplore node
provides basic statistics and crosstabulations for input variables. On the Utility tab, drag
a Metadata node into the diagram and attach it to the Text Cluster node. Change the role
of Target_Subject and TextCluster_cluster_ to Input.

15. Run the Metadata node. On the Explore tab, drag a StatExplore node into the diagram and
attach it to the Metadata node. Select the Cross-Tabulation property and add the
Target_Subject*TextCluster_cluster_ crosstabulation. To do so, select each input, select
the right arrow, and then select Save when both inputs are selected.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-34 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

16. Run the StatExplore node and look at the results.

Misclassified
document

This crosstabulation shows that cluster 1 (which was seen previously to have descriptive terms
such as favorite zoo, big cat, and so on) consists of 16 documents defined that have to do with
animals (A) as labeled by the human reader. Cluster 3 (hot weather, winter day, and so on)
consists of 17 documents, and 16 of them were defined as weather-related, and one cluster 1
document was animal related. Cluster 2 (basketball team, play, and so on) consists of 14
documents with a target value always equal to S. The three clusters line up almost perfectly
with the labels given to the documents. There is a single misclassified document. It would
be wonderful if real data worked out this well, but do not expect that!
Because the data set is so small, you can simply examine the exported data to find the single
misclassified document. In the Text Cluster properties panel, select Exported Data. Select the
TRAIN data set and then click Browse. Scroll to the bottom. Notice that document 44 is the
misclassified document (_TextCluster_cluster_=3, Target_Subject=A).

Document 44 mentions weather and animals, so it is an ambiguous document with respect

to the three categories used. The human judge says that it is an animals document. The
computer judge says that it is a weather document. It could be classified as both, so using
mutually exclusive clustering is inadequate to properly score document 44.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-35

17. The Text Topic node is used to identify topics in a document collection. Although a cluster
is a mutually exclusive category (that is, each document can belong to one and only one cluster),
a document can have more than one topic or it can have none of the derived or user-specified
topics. Attach a Text Topic node directly to the Text Cluster node. Make one change to the
default properties by specifying 3 as the number of multi-term topics to create. Just as the
number of clusters created is a parameter with which you want to experiment when you use the
Text Cluster node, this parameter for the number of topics to create is typically something that
you might try with different values. In this example, the artificial data set was purposely created
with three different topics, so a reasonable value to start with would be 3 to 5 and not the default
value of 25. You use 3.

18. Run the node. Then click the ellipsis for Topic Viewer on the properties panel. The Topic Viewer
is an interactive group of three windows. The Topics window shows the topics created by the
node.

The three topics created by the algorithm also have key descriptive terms to guide interpretation.
The five most descriptive terms for each topic are shown. By default, the first topic is selected
when you open the viewer. In this example, the first topic has descriptive terms starting with
snow, hot, …, and seems to relate to weather. The second topic has descriptive terms lion, tiger,
…. This is evidently a topic related to animals. The descriptive terms for the third topic (baseball,
basketball, …) are interpretable as having to do with sports. With this simple data set, the
algorithm did very well in identifying what are known to be the three underlying topics in the
documents. However, the # Docs column indicates that the node did a poor job of classifying
documents. At most, 25 documents have been associated with one or more of the three topics,
leaving at least 22 documents with no topic assignment.

The Text Topic node adds a number of variables to the exported data set.
TextTopic_raw1 - TextTopic_raw3 – These are numeric variables that indicate the strength
a particular topic has within a given document. Three topics were generated because this was
specified on the properties panel. These variables are the same as the topic weight values for
the documents given in the Documents window of the interactive Topic Viewer. Each of these
variables (topics) has a label (the five most descriptive terms) to identify it and help the user
interpret the topic.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-36 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

TextTopic_1 - TextTopic_3 – These are binary variables defined for each document and
constructed from the TextTopic_raw1 - TextTopic_raw3 values based on the document cutoff
values given in the Topic table. For example, TextTopic_1 is set to 1 if a document has a
TextTopic_raw1 value greater than the cutoff value for this particular topic, which in the table
above is given as Cutoff=0.411. Otherwise, TextTopic_1 is set to 0. The labels for the TextTopic
variables are the same as for the TextTopic_raw raw variables, except that they have _1_0_ as
prefixes. This indicates that they are binary variables. Each label shows the five most descriptive
terms that are identified with that topic.

In the Topics window, there is a column labeled Term Cutoff. For each created topic, the
algorithm computes a topic weight for every term in the corpus. This measures how strongly the
term represents or is associated with the given topic. Terms that have topic weights above the
term cutoff appear in yellow in the Terms window shown below, and terms with topic weight
below the cutoff appear in gray.

The term cutoff for the first topic is 0.178. You can manually change this value, and you are
compelled to do so when you see that several terms below the cutoff seem to be associated with
weather.

One choice is to select a term cutoff between the topic weights for winter day and animal.
For example, you could change the term cutoff from 0.178 to 0.080.
Another choice is to modify the topic weight for animal, perhaps changing it to zero because
the word is not associated with weather. After “zeroing out” unrelated terms, you could then
select a smaller cutoff. You could select a term cutoff value of 0.050, and use this value to
replace the original cutoff value of 0.178. The following screen capture reflects the latter choice.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-37

When you click Recalculate, the weights and counts are updated.
If you make any change to a topic, the Topic Viewer changes the category from Multiple
(computer- software-generated multi-term topic) to User (custom user-defined topic). If you want
to be technically accurate, there are three categories: (1) software-derived topics; (2) user-
defined topics; and (3) user-influenced topics. The third category is not identified by SAS Text
Miner, but it is useful conceptually to differentiate between pure domain knowledge topics and
computer-generated topics that have been modified based on domain knowledge.
Pure domain knowledge topic dictionaries maintained by an organization are often independent
of any specific corpus, at least in the early stages of text mining integration. You should
recognize that language and knowledge are dynamic, so dictionaries must be viewed as
dynamic resources that should be routinely updated based on the latest data and domain
knowledge. When data contradict domain knowledge, you need to modify domain knowledge,
or improve the data, or both.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-38 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Note: Text analytics dictionaries, including synonym and topic dictionaries, should be treated
like software in a development environment. You should incorporate something like a
source code control system that permits changing dictionaries while keeping track of the
changes through something like software version control. You can use software metrics
for concepts such as “maturity” to see whether dictionaries have stabilized. Just like what
you often witness with new software projects, you are likely to see many changes in the
early stages of text analytics implementation. When text mining becomes a routine
activity, the number of changes to text analytic dictionaries should drop off substantially.
SAS does not provide version control software.

19. The manually entered changes add seven terms to the topic definition, but only increase
the document count from seven to nine. You know that there are 16 weather documents,
so additional changes are required to improve the topic definitions. The Document Cutoff for
the first topic is 0.411. Examine the document table. The 14 terms and their corresponding
weights do a good job of rank ordering the documents with respect to weather.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-39

20. The last document selected to be associated with topic 1 has a topic weight of 0.414, and is
shaded in yellow. Documents below the cutoff are shaded in gray. If you changed the cutoff from
0.411 to 0.060, you would identify 17 documents as exhibiting topic 1, and all 17 would be in
cluster 3 derived by the Text Cluster node, including the ambiguous document that has been
labeled A. Change the cutoff to 0.060 and click Recalculate.

21. Because you are using domain knowledge to exploit the successful rank ordering of documents
into topic 1, the weather topic, you might as well edit the topic description. Click the cell for topic
1, and replace snow,+hot,weather,+cold,winter with Weather.

The following topic table reflects changes to Term Cutoff and Document Cutoff for the
remaining two topics. These changes improve the identification of documents related to the
sports and animals topics.

22. You can make one more change to get seemingly perfect results. Select the animals topic.
Change the topic weight for monkey to 0.150. You change a topic weight by clicking the topic
weight cell that you want to change, and then use the edit keys (Backspace and Delete)
if necessary to type the replacement value. When you recalculate, you will see that 17
documents are classified as exhibiting the animals topic. There is one document that is flagged
as both a weather topic and an animals topic. This ambiguous document is the same one that
was identified by the StatExplore node. Close the Topic Viewer, and save the changes that you
made.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-40 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

23. To exploit the custom topic capabilities of the Text Topic Viewer, copy and paste the Text Topic
node and attach the copied node to the Text Cluster node. Rename the new Text Topic node
“Custom Topics.” Change the number of multi-term topics to zero. Run the new node, and then
open the Topic Viewer. The following table shows that the ambiguous document, the one
beginning with “If you like hot weather,” is just above the document cutoff.

24. Examine the weather topic. You can see that the ambiguous document receives a relatively high
topic weight.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-41

25. Close the Topic Viewer. Select User Topics and observe the custom topic table created by your
original edits in the Topic Viewer.

You can see the name of the table: EMWS2.TextTopic2_INITTOPICS. The table is managed
by Enterprise Miner, and you have no easy way to export the table. However, because you know
the table name, you can use a SAS Code node to make a permanent copy of the table. If you
do so, then the table can be used for other projects. You specify the table with the User Table
interface. You can replace an existing table, or add (append) to an existing table.

Note: If you created the process flow exactly as described, the custom topics table name would
actually be EMWS2.TextTopic_INITTOPICS because the custom table is derived from
the original Text Topic node that was placed in the diagram.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-42 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

26. The final part of this demonstration is to use the Score node to score a new data set. Following
the top part of the diagram shown at the beginning of this demonstration, bring in a new File
Import node. Rename it File Import Score Data. The import file for scoring is
D:\workshop\winsas\DMTX51\Score_WeatherAnimalsSports.xls. (Pathname variations are
possible as stated above.) In the properties panel, change the role of the data set to Score.

27. Run the node and look at the Exported Data window. This SCORE data set has 16 documents.
They are related to the three subjects (animals, sports, or weather). (As is usually the case with
a data set to be scored, there is no target field on this data set.) The object now is to classify
these documents using the previous analysis. To do that, bring in a Score node and connect
it to the output of the File Import Score Data node and also to the output of the last Text Topic
node (Custom Topics).

28. Run the Score node. Then go to the Exported Data window through the properties panel. Select
the SCORE data to view and click Browse.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-43

29. When the Browse window appears, move the column headings so that TextField is the first
column heading and the other scored segment values are to the right of the text field. Recall that
the clusters lined up as 1=animals, 2=sports, 3=weather. The custom topic segment variables
are clearly labeled. Read through the 16 rows and check to see whether any of the
classifications looks incorrect to you.

Do any of the topic indicators disagree with the cluster segments? Which observations appear
to be misclassified? In particular, document 1 is classified as cluster 1 (animals), but the three
binary text topic variables are zero, indicating that the first document exhibits none of the three
topics. Document 7 is misclassified by the cluster ID and by the topic flags. Document 16 is
ambiguous and could be both weather and animals. It is classified as a weather document by the
cluster ID and the weather binary variable, but it could also be classified as an animals
document.

Because of language challenges such as word sense disambiguation, the underlying text mining
and modeling algorithms make mistakes. In this case, the very small number of training
examples that were used likely influenced the quality of the results. Overall, the results look very
promising

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-44 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

1.01 Multiple Choice Question

Using the Explore feature of the Exported Data property, a researcher notes
that a training data bar chart for a binary target variable shows 3,462 cases
coded as 1 and 16,538 cases coded as 0. Which choice best describes the
situation leading to this experience?
a. The training data has 3,462 “successes” and 16,538 “failures.”
b. A sample of the training data has 3,462 “successes” and 16,538
“failures.”
c. The data has been oversampled.
d. The researcher forgot to set the random sample property for
explorations.

22
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The description of text mining provided earlier mentions dictionaries or vocabularies. The terms
dictionary and vocabulary are interchangeable in the literature on text mining. One author might
describe how “a person’s vocabulary is like DNA in uniquely identifying an individual.” Another author
might provide a dictionary of action verbs to help score documents to achieve some analytical
objective. The document collection has a dictionary or vocabulary that is the union of all the terms
contained in each document. Consequently, text mining references use dictionary or vocabulary
to refer to the collection of terms that are used in the analysis. For convenience, this course will tend
to use dictionary rather than vocabulary to refer to terms used in an analysis.

The demonstration provided some details about start lists, stop lists, and synonym tables.
These tables have various names in the text mining literature.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-45

Text Mining Dictionaries

Inclusion Dictionary
• Contains only relevant terms to be used in the analysis
• Called a start list by SAS Text Miner
Exclusion Dictionary
• Contains irrelevant or low-information terms that will be ignored
• Called a stop list by SAS Text Miner
Synonym Dictionary
Multi-word Term Dictionary
Topic Dictionary

24
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The corpus dictionary contains every word used in the corpus. A subset of the corpus dictionary
contains relevant terms—that is, terms that will aid in achieving the analytical objective of the text
mining project. This dictionary of relevant terms is called a start list. Terms not in the start list are
ignored, except possibly for use in determining relative frequencies or other calculations that require
a count of all words in a document.
Zipf’s Law, discussed in a later chapter, helps identify terms in a dictionary that should be included
in an analysis. In particular, Zipf’s Law suggests that terms appearing with low frequency and terms
appearing with high frequency are irrelevant. Identifying irrelevant terms with the aid of Zipf’s Law
and domain knowledge is often easier than constructing a dictionary of relevant terms. The
dictionary of irrelevant terms is called a stop list. By default, the Text Parsing node specifies a stop
list in the selected language. You decide whether to use the default stop list, create your own stop
list, or create a start list. You can also use existing stop or start lists created for your organization.
You specify a stop list or a start list, but not both, because the corpus dictionary will uniquely define
one of these lists when the other is supplied.
Text mining works with a collection of documents. The collection can be dynamic—that is,
documents can be added to the collection. You can use the collection to train a model, and you can
apply the model to new documents coming into the collection. New documents are scored relative
to how they compare to the original documents in the collection. If a new document contains a new
term, then text mining is ignorant of this new term until that document is used in a new training step.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-46 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Mining: Training

0.2 0.8
0.1
0.7
0.3
0.3
0.1 0.2
0.8
0.5
0.5 0.7
0.3 0.8
0.7 0.5
0.1
0.2

Corpus = Text Scores

Original Mining
Documents Training

25
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text mining training can be performed using only nodes within the SAS Text Miner group of tools.
However, SAS Text Miner nodes export data, and these data can be imported into pattern discovery
and predictive modeling nodes of SAS Enterprise Miner. Thus, a trained model can be obtained by
using a combination of SAS Text Miner nodes and SAS Enterprise Miner nodes. Although many
commercial text mining products have strong text-analytics capabilities, many of these lack data
mining capabilities beyond text analytics. The ability to score new documents using a decision tree
or a neural network presents new opportunities to improve text mining outcomes (for example,
making it possible to use variables derived from text analytics in predictive models).

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-47

Text Mining: Scoring New Documents

0.8
0.1
0.7
0.3
0.1 0.2
0.5
0.5
0.3 0.8

0.2 0.1

New Text Scores

Documents Mining
Trained
Model

26
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Scores
The value or values associated with a document can be
• segment identifiers related to text categorization or more general
predictive modeling
• cluster identifiers related to grouping documents based on similarity
of content
• probabilities of membership in segments or clusters
• numeric values representing document content based on weighted
averages of transformed word frequencies.

27
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

As new documents appear, they can be scored using the model trained on the original corpus.
Eventually, the model can be updated by being retrained on the corpus with the new documents
added.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-48 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Some data mining references imply that scores are associated with predicting a target variable
in a supervised learning setting. The above slide makes it clear that any numeric or class variable
added to a data set by a Text Miner node can be treated as a score. This interpretation makes
it clear that you can add a Score node after a Text Cluster, Text Profile, or Text Rule Builder node.
The Score node will add the numeric or class variables to a score data set. This is appropriate for
supervised and unsupervised learning problems.

There are unique challenges in scoring new documents. A score data set in SAS Enterprise Miner
contains all of the features necessary for scoring, including the document. For supervised learning,
a score data set does not need to have a target variable, because the goal of supervised learning
is to predict a target variable when it is not known. To produce scores for a single observation, the
document must be parsed. Because score data is “new” data, it is possible that the score document
contains words that are not in the corpus dictionary. All new words must be treated as exclusion
words. Zipf’s Law suggests that any new words encountered will fall into the large collection of low
frequency terms. As mentioned above, low frequency words are usually added to the exclusion
dictionary (stop list).

SAS Text Miner uses documents in the training data only to modify the dictionaries supplied by the
user. Unlike some predictive methodologies in data mining, text mining does not use validation data
for tuning except in the Text Rule Builder node. The validation data have no impact on the scores
that are produced by the Text Cluster and Text Topic nodes. For comparison, decision trees are
allowed to use the validation data for pruning, and because pruning modifies the tree, the derived
scoring mechanism is impacted by pruning, and hence scoring calculations are impacted by the
validation data. The Text Rule Builder node uses the validation data to influence rule derivation.
Hence, the Text Rule Builder node was not used in the first demonstration. Opinions vary, but you
probably want at least 1,000 documents with at least 100 documents in the smallest target category
before using the Text Rule Builder node.

How Are the Scores Used?

• Information retrieval
• Querying using linear algebra
• Discovery through topic identification
• Text categorization
• Clustering or grouping documents (unsupervised classification)
• Automatic categorizing of documents (supervised classification)
• Analytics support
• Derive additional features or attributes to be used for exploration or predictive
modeling

28
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-49

Experts differ in how they characterize applications of text mining. For example, Miner et al. (2012)
describe the following three areas: information retrieval, information summary, and information
extraction, whereas above only information retrieval is listed.

How Are the Scores Judged?

The quality of a set of scores depends on the purpose of scoring.
For supervised text categorization, three popular assessment measures
are used.
Precision – the percentage of selected documents that are correctly
classified
Recall – the percentage of all documents in the requested category that
are selected
F1 – the harmonic mean of precision and recall
Misclassification Rate – the percentage of incorrectly classified documents

29
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

If you have N documents, and M documents belong to the desired category, the possible outcomes
from a query or classification request are given in the following table.

Assigned to Category Not Assigned to Totals

Not in Selected FP TN N-M

Totals TP+FP FN+TN N

The precision expressed as a percentage is

Precision=100*TP/(TP+FP)

Recall is given by
Recall=100*TP/(TP+FN)=100*TP/M

The misclassification rate is

Misc=100*(FP+FN)/N
The F1 score is
F1=1/(0.5*(1/Precision)+0.5*(1/Recall))

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-50 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Mining Tools

The following Enterprise Miner text mining nodes are discussed:
• Text Cluster
• Text Filter
• Text Import
• Text Parsing
• Text Profile
• Text Rule Builder
• Text Topic
• HP Text Miner

30
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The Text Cluster, Text Topic, and Text Rule Builder nodes add columns to the imported data sets that
are related to scoring. These columns will also be added to score data sets if scoring is requested
through the Score node. Scoring usually creates one or more columns with the role of segment or
prediction, but some of the columns that are added have roles of Input and Rejected. The
Enterprise Miner Metadata node can be used to change variable roles.

Other Enterprise Miner Nodes

You also use other Enterprise Miner nodes for various purposes, such
as predictive modeling and scoring new cases.
• Data Partition node
• Decision Tree node
• Regression node
• Memory-Based Reasoning node
• Score node

31
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-51

Note: Data mining analysts know how a predictive model scores new data. However, some
analysts might be unaware that unsupervised learning models (that is, data without a known,
available target) can also generate scores, and new data can be scored using the model.
For example, the Text Cluster node divides a document collection into mutually exclusive
clusters. For the expectation-maximization algorithm, a new document is scored by
calculating the probability of membership in each cluster, and then it is assigned to the
cluster associated with the highest probability.
Data mining is often described with respect to two general application areas: pattern discovery
(unsupervised learning – no target variable) and predictive modeling (supervised learning – target
variable). (Specific examples of these two application areas are presented in this course.)

Data Mining: Two Broad Areas

• Pattern Discovery/Exploratory Analysis (Unsupervised Learning)
• There is no target variable, and some form of analysis is performed
to do the following:
- identify or define homogeneous groups, clusters, or segments
- find links or associations between entities, as in market basket analysis
• Prediction (Supervised Learning)
• A target variable is used, and some form of predictive or classification model
is developed.
• Input variables are associated with values of a target variable, and the model
produces a predicted target value for a given set of inputs.

32
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

For text mining, pattern discovery encompasses the area of information retrieval, and prediction
encompasses the area of text categorization. If you are using text mining to suggest groupings
of documents that do not have pre-assigned labels, you would be categorizing documents in an
unsupervised learning setting.
A text mining project usually falls into one of two areas: information retrieval or text categorization.
If you are working on a project that has text mining components, such as predicting insurance fraud,
then you will be using text mining in a support role so that the general problem is prediction
(supervised learning), and prediction can be enhanced by incorporating text mining input variables.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-52 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Mining Applications: Unsupervised

• Information retrieval
• finding documents with relevant content of interest
• used for researching medical, scientific, legal, and news documents such
as books and journal articles
• Document categorization for organizing
• clustering documents into naturally occurring groups
• extracting themes or concepts
• Anomaly detection
• identifying unusual documents that might be associated with cases requiring
special handling such as unhappy customers, fraud activity, and so on

33
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Anomaly detection can sometimes be a first step toward creating a target variable if none exists.

Text Mining Applications: Supervised

• Many typical predictive modeling or classification applications can be
enhanced by incorporating textual data in addition to traditional input
variables.
• churning propensity models that include customer center notes, website forms,
emails, and Twitter messages
• hospital admission prediction models incorporating medical records notes
as a new source of information
• insurance fraud modeling using adjustor notes
• sentiment categorization from customer comments
• stylometry or forensic applications that identify the author of a particular
writing sample

34
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-53

Text Mining Applications

Successful results are easier to verify for predictive modeling applications,
where free-form textual data can be used to derive new types of input
variables.
• Predictive modeling requires data labeled with a known target (outcome)
variable.
• Most analysts agree that predictive modeling is where the “big payoff”
in data mining is.
• Predictive modeling takes most advantage of the integrated environment
in SAS Enterprise Miner, which provides powerful predictive modeling
tools (regression, decision trees, neural nets, and so on).

35
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Signal versus Noise in Predictive Modeling

• Target = Signal + Noise
• Signal = Systematic Variation = Predictable
• Noise = Random Variation = Unpredictable

37
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Users who are new to the world of analytics often have a naïve notion about noise. Science fiction
movies include computers and robots that speak and understand human languages. Television
police dramas have detectives who make perfect predictions about where crimes will occur. The
reality is that noise permeates existence. You might have an expectation that, after you master text
mining, you can perfectly predict customer behavior based on responses to an online survey. This
expectation is unrealistic.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-54 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Psychologists know that human beings might react differently to the same stimulus if sufficient time
elapses between exposures. On Monday, when you are hungry at lunchtime, you eat a sandwich.
Yet, on Tuesday when you are hungry, you opt for a salad. This tendency for different outcomes to
occur with similar inputs is attributed to noise, which is unpredictable. You can predict with almost
certainty that you will eat lunch next Thursday, but you cannot predict what you will eat with the
same certainty. (Of course, if you bet someone a million dollars that you will eat a spinach salad next
Thursday, then you will almost certainly eat a spinach salad!) Analytic experts expect errors in
prediction related to noise, so methods are developed to minimize errors in the presence of noise.
The incremental value that text mining can provide to your predictive models should be assessed by
comparing the quality of a model (accuracy, ROC index, and so on) without incorporating text mining
to that achieved after text mining is added.

Pure Separation = Pure Signal = No Noise

INPUT

Target: Primary Outcome= Secondary Outcome=

38
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The above graphic illustrates the pure signal situation. In this case, the training data can be perfectly
separated into primary or secondary outcomes using a linear decision boundary. You rarely expect
to see this in practice. Unfortunately, some people who are new to text mining are disappointed
when the methods do not perfectly categorize documents with this type of accuracy.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-55

No Separation = Pure Noise = No Signal

INPUT1

INPUT2

Target: Primary Outcome= Secondary Outcome=

39
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

At the other extreme is the pure noise situation. In this case, the training data appears to have no
patterns upon which to base a model that can separate the primary outcomes from the secondary
outcomes. This situation is more common than you might like. Although pure signal is very rare, pure
noise can actually occur in practice.

Some Separation = Signal + Noise

INPUT1

INPUT2

Target: Primary Outcome= Secondary Outcome=

40
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-56 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The most common situation in practice is a mixture of signal and noise. You can predict more
accurately than randomly guessing. How well you predict depends on whether data is dominated
by systematic variation or random variation.

Unsupervised Classification: Fraud Cases?

X1
Separation

X1=Distance to Physician X2=Ratio BI/PD

(BI=Bodily Injury and PD=Property Damage)
41
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

When no target variable is available, you can still investigate whether a natural separation occurs
in the data with respect to the analytic objective. For example, fraud cases are often unusual in
higher dimensional space because the human beings that commit fraud have difficulty controlling
outcomes so that they look normal in many dimensions. This example could represent insurance
claims data for automobile accidents.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-57

Unsupervised Classification: Actual Fraud

X1
Separation
with Noise

X1=Distance to Physician X2=Ratio BI/PD

(BI=Bodily Injury and PD=Property Damage)
42
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Even with good separation, noise is usually present. For the fraud example, most of the claims with
a long distance from claimant to physician and a high ratio of bodily injury to property damage costs
are fraudulent (dark circle), but a few are legitimate cases (light circle). Other fraud cases are not
so separated, perhaps because the fraudulent physician had a practice near the claimant’s home.
In this example, BI is some quantitative measure of bodily injury and PD is a quantitative measure
of property damage.

Text Mining Signal versus Noise

…physician… …chiropractor…
…physician… …dentist…
…podiatrist…
…chiropractor… …chiropractor…
…physician
SVD1 …dentist… …physician… …
…chiropractor…
…dentist… …dentist… …dentist……dentist…
…physician…
…chiropractor…
…chiropractor… …dentist…

SVD2
…fraud… …no fraud…

Note: SVD1 and SVD2 are variables that are created in text mining.
43
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-58 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

A string of fraud rings operating in Southern California in the 1990s had the common elements of a
lawyer, a chiropractor, and a recruiter. The recruiter approached people who received unemployment
benefits and told them that they could obtain worker’s compensation benefits from their previous
employers. The recruiter referred a candidate to an unscrupulous lawyer, who scheduled treatments
with a chiropractor, a partner in the fraud ring. After a few weeks of treatments, the lawyer filed a
claim for three to five times the chiropractor bills (a fairly common practice in insurance litigation).
Claims adjusters often receive training in fraud prevention. When information about the fraud rings
was disseminated, a claims adjuster would often add a comment to the adjuster notes when unusual
activity involving claimant representation by a lawyer and incoming chiropractor bills became known.

The above slide illustrates the following:

• The notes that mentioned chiropractors tend to be fraudulent.
• The notes that mentioned other medical professionals, like dentists, tend to be associated with
legitimate claims.
A case might actually include billing records for chiropractic services, but the word chiropractor
would not appear in adjustor notes except when conditions of the case were consistent with the
conditions addressed in fraud prevention training. Consequently, there were many cases involving
chiropractors that were not fraudulent, but most cases having adjustor notes that mention
chiropractors were fraudulent. The two variables, SVD1 and SVD2, are derived from running the Text
Cluster node. (These are discussed later in this course.) Notice that, in this example, the fraudulent
cases tend to be associated with higher values for the SVD2 variable.

Text Mining: Perfect Document Separation

Document National International Document
ID News News Subject
# Words # Words

1 3 0 National
2 5 0 National
3 7 0 National
4 8 0 National
5 0 4 International
6 0 5 International
7 0 3 International
8 0 7 International
Perfect Separation: No Mixing of Subjects
44
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Some document collections are well separated for analytic purposes. The hypothetical example
above shows eight documents, with four that describe national news items exclusively, and the
remaining four describing international news items exclusively. Suppose that you could identify a set
of terms that are associated with national news and another set of terms associated with
international news. These terms could then be used to classify the documents in the corpus.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-59

Text Mining: Good but Imperfect Separation

Document International
National International Document
ID News News Subject
# Words News # Words

11 3 Words
1 National
12 8 2 National
13 7 6 Mixed
14 8 1 National
15 1 4 International
16 2 5 International
17 3 3 Mixed
18 1 7 International
Good Separation: Little Mixing of Subjects
45
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

With the same topic and analytic objective, another document collection has documents that might
mention a heterogeneous set of news articles. You still get good separation, but noise creeps in due
to the fact that a document can include multiple subjects.

Text Mining: Poor Separation

Document International
National International Document
ID News News Subject
# Words News # Words

21 3 Words
4 Mixed
22 8 2 National
23 7 6 Mixed
24 8 1 National
25 4 4 Mixed
26 6 5 Mixed
27 3 3 Mixed
28 1 7 International
Poor Separation: Substantial Mixing of Subjects
46
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-60 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Finally, the above example shows that if you have a collection of documents that mention many
topics and mixes topics, then trying to classify documents into clean categories is difficult. However,
if you can accommodate a classification system that assigns multiple categories to documents, such
as the Text Topic node and topic identification, then you can still successfully apply text mining
techniques.

1.2 Working with Data Sources

Objectives
• Describe SAS Enterprise Miner metadata and detail the types of roles
and measurement levels that are supported.
• Explain how to create data sources that can be used by SAS Enterprise
Miner projects.
• Provide examples of data sources that are relevant for text mining.

48
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Analysis Element Organization

Projects Libraries Process Nodes

and Flows
Diagrams
Datasources My Library
Reports EMWS em_dgraph IDs

System EMWS1 Part

…
…

Workspaces
49
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-62 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

SAS Enterprise Miner organizes projects by placing components of the project in separate folders
or directories. The Datasources folder contains metadata for each data source. The Workspaces
folder holds all of the details about each diagram, including property settings of nodes used in each
process flow.
SAS Enterprise Miner can import data from many sources, including common PC file formats such
as Microsoft Excel and common commercial relational databases (for example, Sybase, Teradata,
and Oracle), as well as from SAS data sets. The functionality of SAS Enterprise Miner comes from
the assignment of roles and levels to variables in a data set. Initially assigning metadata roles makes
the building of process flows much easier. Data properties do not need to be repeated or copied
and pasted for each new task.
One of the first tasks in any project is to identify one or more relevant data sources. Although you
can merge tables inside SAS Enterprise Miner, a best practice is to use the query optimization
features of the native database to build the analysis table and then import this table into SAS
Enterprise Miner.

Defining a Data Source

• Select a table.
• Define variable roles.
• Define measurement levels.
• Define the table role.

SAS
Foundation
Server
Libraries
50
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Variable roles specify how a variable can be used.

Variable Roles
• Assessment • Referrer
• Censor • Rejected
• Classification • Residual
• Cost • Segment
• Cross ID • Sequence
• Decision • Target
• Frequency • Text
• ID • Text Location
• Input • Time ID
• Label • Web Address
• Prediction 51
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

These are the most common roles in text mining:

• Text
• Text Location
• ID
• Input
• Target
• Web Address
• Rejected
The document is stored as a variable with a role of Text. If the document is larger than 32,767
characters, and you want to analyze the entire document, then using Text Location is the way
to do that due to the limitation of SAS variables to a maximum length of 32K bytes.
Note: SAS allows 32,767=2**15-1 characters in a character variable. A document is stored as a
SAS character variable in a SAS data set. The SAS Text Miner documentation rounds this
down to 32,000, but 32,767 is the correct figure.

Additional variables in the data set usually have roles of ID, Input, Target, or Rejected. An ID
variable identifies the document uniquely. An input variable can be used for segmentation or
predictive modeling. Only input variables are used to derive segments or clusters. SAS Text Miner
converts each document into a collection of inputs. For predictive modeling, the goal is to predict
the value of a target variable. Only input variables are used to predict the target.

Any other variable in the data that has no purpose for the analysis has a role of Rejected.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-64 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Measurement Levels
• Categorical (Class, Qualitative)
• Unary
• Binary
• Nominal
• Ordinal
• Numeric (Quantitative)
• Interval
Ratio*
•
* All methods that accommodate an interval measurement scale
in SAS Enterprise Miner also support a ratio scale.

52
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Elementary statistics textbooks for social science majors usually describe four measurement levels:
• nominal
• ordinal
• interval
• ratio

Other statistics textbooks might speak only of categorical and numeric data.
A variable with a nominal measurement scale is purely categorical in nature. There is no numeric
interpretation, and there is no natural ordering. Examples include eye color, political party affiliation,
and country of origin. An ordinal variable is a categorical variable that has an inherent ordering.
Thus, ordinal variables are also called ordered categorical variables. Examples include course letter
grade, response on a Likert scale, or items on a top-10 ranking list.
Note: Nominal data can be ranked by frequency of occurrence, price, personal preference, and
so on. If the ranking is meaningful and exploited by the analysis, the nominal variable
becomes an ordinal variable.
A binary scale implies a nominal scale with only two distinct values.

A variable with an interval measurement scale has a numeric interpretation so that the difference
between two numeric values is meaningful. A variable with a ratio measurement scale is valid
as an interval-scaled variable, but in addition, the ratio of two numeric values is meaningful.
Temperature in degrees Celsius is on an interval scale, but not a ratio scale, because 20 degrees
divided by 10 degrees being equal to 2 does not mean that 20 degrees are twice as hot as 10
degrees.
Most numeric data are on a ratio scale. Most analytic methods only require that the data be on an
interval scale. All of the methods used by SAS Enterprise Miner that work for numeric data also work
for interval- or ratio-scaled values. Consequently, the ratio scale is not supported.

Different nodes expect specific table roles. The Score node scores raw, training, validation, test,
and score data sets. The Association node acts on transaction data sets.

Table Roles
• Raw
• Training
• Validation
• Test
• Score
• Transaction

53
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Using the Data Partition node, you can split raw data into training, validation, and test data sets. This
is an important step in predictive modeling. You want to achieve good generalizability of the model
by avoiding the problem of overfitting—that is, creating a model that looks good on the training data
but does not generalize well to a holdout sample.

Analysis Data

Training

➔ ➔ Validation

Test
Raw Data
Data Partition
Node

54
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-66 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Score data can be scored by the Score node if all of the required data elements are present.
The role of the score produced by the Score node is Prediction or Segment, depending on how
the score is produced. Consequently, you need to be familiar with variable roles even if they are not
assigned by you.

Scoring (Predicting) New Data

➔ ➔ Predictions

Score Score
Data Node

55
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

SAS Enterprise Miner and SAS Text Miner anticipate the need to create and modify data before
an analysis. For text mining, the Text Import node can be deployed in a SAS Enterprise Miner
process flow to process a document collection. (The Text Import node is discussed in the next
chapter.)

Working with Text Mining Data Sources

• When documents are stored in separate files in the same directory,
or subdirectories under the same directory, then the Text Import node
can be used to create an appropriate SAS data set for text mining.
• When documents are stored together (for example, one document per
row in a Microsoft Excel spreadsheet), then the Import Data Wizard
or File Import node can be used to create a text mining data set.
Note: Sometimes special SAS programming might be required if you are
combining text data with other data.

56
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The next slide describes how text data is treated by SAS Enterprise Miner.

Working with Text Mining Data Sources

Two supported types of text mining data:
• The data set contains at least one variable with the role Text, and
documents can be stored completely as a SAS character variable
(limited to 32K).
• The data set contains at least one variable with the role Text Location.
(This is used in the situation where a document size exceeds 32K.)
• The location must be the full pathname of the document with respect to the
Text Miner server.
• An additional variable with the role Web Address can include the path to an
unfiltered version of the document to be displayed in an interactive viewer such
as the Interactive Filter Viewer.

57
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text parsing is always required as the first step in a text mining flow. This step accepts data sets with
the role of Train, Validate, Test, or Score data. At least one data source must be a data set with the
role of Train or Raw.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-68 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The input data source must have at least one variable with a role of Text or Text Location. As stated
above, the Text variable can contain an entire document or a truncated piece of an entire document.
The Text variable is a character variable, and SAS can accommodate only character variables with
lengths up to 32K (32,767 bytes). If a document exceeds 32K in length, then SAS must read the
entire document from a location specified in the input data. If no location is specified, then the Text
Miner nodes process only the truncated documents.
To process documents that exceed 32K, a variable with the Text Location role must be included in
the input data. The text location must be the full pathname of the document folder with respect to the
Text Miner server. For example, a document might be visible on your Windows computer at this
location:
S:\MyProject\MyDocuments\Doc1.txt

The Text Miner server might recognize the location as follows:

//Sdisk/MyProject/MyDocuments/Doc1.txt

The second form of the document location must be used in the input data.
The Text Filter node can access documents through the Interactive Filter Viewer. By default, the
Interactive Filter Viewer displays only the portion of the document stored in the Text variable. If you
want to see the entire document in the Interactive Filter Viewer, then you can include a variable with
the role of Text Location that provides the pathname of the file that contains the full document.

If the input data source contains two or more variables with a role of Text, and the Use status is Yes
for these variables, then the Text Parsing node chooses the variable with the largest length. If the
lengths are the same, then the variable that appears first in column order is selected. If your data
has two or more text variables, you should set the Use status to No for all Text variables except the
one to be included in the analysis.
If you want to include two or more Text variables in your text mining project, then you must connect
Text Parsing nodes in parallel and change the Use status of the variables as needed.
In many cases, you need to preprocess textual data before you can import it into a data source.
The Text Import node is designed for this purpose. The Text Import node can be used in file
preprocessing to extract text from various document formats or to retrieve text from websites by
crawling the web. The node creates a SAS data set that you can use to create a data source to
use as input for the Text Parsing node. Depending on which structure (of the two described
above) that you use, you must adjust the roles of the variables accordingly in the Data Source
Wizard.

Working with Text Mining Data Sources

Additional data sources:
• Dictionaries
• start lists
• stop lists
• Synonym tables
• Multi-word term tables
• Topic tables

58
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The software distribution of SAS Enterprise Miner includes the following sample data sets in the
Sashelp library:

Data Set Description Used In

language_multi Multi-term lists for various languages Text Parsing node

languagestop Stop lists for various languages Text Parsing node

Engstop Stop list for the English language Text Parsing node

Engsynms Synonym list for the English language Text Parsing node

The keyword language is chosen to correspond to one of these supported languages: Arabic,
Chinese (simplified and traditional), Czech, Danish, Dutch, English, Finnish, French, German,
Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese,
Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, and Vietnamese. (However, only
English, French, German, Italian, Portuguese, and Spanish have built-in stop lists and multi-term
lists.)

Note: You can specify multiple languages in the Text Parsing node. One choice for a multi-
language stop list is to take the union of the individual language stop lists.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-70 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

1.3 Using SAS Text Miner

Objectives
• Describe the SAS Text Miner nodes.
• Explain the SAS Text Miner node properties.

60
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Using SAS Text Miner

• Text Cluster
• Text Filter
• Text Import
• Text Parsing
• Text Profile
• Text Rule Builder
• Text Topic

61
I Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Parsing Node

Text Parsing Properties

62
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Parsing Node

The Text Parsing node
• builds the corpus dictionary
• associates terms with parts of speech and controls which parts of speech
to recognize
• performs stemming to equate terms that are different verb tenses
of the same verb, or to equate terms that are either singular or plural
versions of the same noun
• identifies up to 16 entities such as address, company name, currency,
and person’s name
• imports custom entities created by a product such as SAS Concept
Creation or SAS Content Categorization Studio
• controls recognition of numbers or punctuation
63 as separate terms.
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Note: Custom entities are not discussed in this course. You can refer to SAS Text Miner 15.1
Reference Help for information about how to bring in results to the Text Parsing node from
SAS Concept Creation for SAS Text Miner in SAS Content Categorization Studio.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-72 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Parsing Node

Verb stemming example:
• type
• typed
• typing
• types
Noun stemming examples:
• house, houses
• matrix, matrices
• criteria, criterion

64
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Parsing Node

The Text Parsing node special tables:
• Synonyms
• Multi-word term dictionary
• Start/stop list - table of terms to include or exclude from the analysis

65
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Parsing Node

Dictionaries when a stop list is specified:
• Corpus dictionary: the union of all terms in the corpus
(derived, not specified)
• Stop list: a dictionary of terms to be ignored in the analysis
(specified by the user)
• Start list: terms in the corpus dictionary that are not in the stop list
(derived)

The stop list is typically used to remove low information terms that add only
noise to the analysis. Noisy data has no descriptive or predictive value.

66
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Parsing Node

Dictionaries when a start list is specified:
• Corpus dictionary: the union of all terms in the corpus
(derived, not specified)
• Start list: a dictionary of terms to be used in the analysis
(specified by the user)
• Stop list: terms in the corpus dictionary that are not in the start list
(derived)

The start list can be a technical or business dictionary that is developed

by the analyst or obtained from other sources.

67
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-74 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Parsing Node Results

68
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The term table contains all terms parsed from the document collection. If a table with the role Raw
or Train is imported into the node, then the entire document collection is used. If more than one table
is imported (for example, train, validate, and test data), then the term table contains all terms found
in the train data. If you select the term table and then select File  Save as, you can save
a permanent copy of the table. This table is dynamic and can change based on properties specified
in successor nodes.
The three plots in the Results window are similar to plots found in the Text Filter Results window.
These plots are explained later.

Text Filter Node

Text Filter Properties

69
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Node

Text Filter Properties
• Frequency weights and term weights are included.
• The optional Check Spelling property uses a spelling dictionary
and word-similarity algorithms to find and correct misspellings.
• The Minimum Number of Documents property performs frequency
filtering for rare terms. This property can be used rather than searching
for rare terms and adding them to the stop list.
• The Filter Viewer enables you to interactively control terms to drop
or keep, interactively create synonyms, and perform queries and view
concept links.

70
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-76 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Filter Node

Analysis Features
• Frequency weights
• Log (default)
• Binary
• None (count or frequency)
• Term weights
• Entropy (default)
• Inverse Document Frequency
• Mutual Information
• None

71
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Node Spell Checking

1. Set the Check Spelling property to Yes.
2. For the Dictionary property, specify a SAS table that contains a spelling
dictionary.
3. Run the node.
4. The Spell-checking Results property provides the name of the synonym
table that is created.
5. The synonym table maps misspelled child terms to correctly spelled
parent terms. Parent terms come from the spelling dictionary.

72
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

If you do not specify a spelling dictionary, the spelling checker takes the terms in the bottom 5% with
respect to frequency as candidates for misspelling, and uses the terms in the top 95% as the
spelling dictionary. Although this can be successful, it can also lead to many incorrectly identified
misspellings. For example, in the Weather-Animals-Sports data set, this approach identifies baseball
as a misspelled version of basketball.

Text Filter Node Results

73
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Node Results

Zipf plot should exhibit

exponential decay.

74
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Although it is true that most document collections of reasonable size can be approximated by Zipf’s
Law, there are exceptions. The exponentially decaying plot is typical for the English language, but
it might be less typical for Chinese languages or languages such as Swahili. Lu, Zhang, and Zhou
(2013) examine deviations from Zipf’s Law for modern languages.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-78 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Filter Node Results

# Docs by Freq should

be monotonic.

75
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Node Results

The word vehicle appears 625 times in one document because of

a data processing error. Frequency counts that deviate substantially
from an approximate linear relationship are suspicious.
76
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The Text Filter node Results window facilitates investigation of data quality. Deviations from expected
relationships in the Zipf plot and in the Number of Documents by Frequency plot usually indicate
data problems.

Text Filter Node Results

Alpha terms should dominate.

Keep/Drop status depends on
the stop list.

77
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Node Results

Dominant roles should be:

noun, verb, adjective, adverb.

78
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-80 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Some insight can be obtained by understanding a document collection. For example, guidelines for
documents that might be used in legal proceedings suggest that certain adjectives and adverbs be
restricted to “objective” or “verifiable,” with recommendations to avoid terms that cannot be backed
by evidence. For example, you might be cautioned to avoid the adjective reckless or the adverb
recklessly. This can lead to different frequency distributions for parts of speech. Thus, there is no
overall expected frequency distribution for the Role by Freq table. Domain knowledge can suggest
expected distributions. However, the dominance of nouns, verbs, adjectives, and adverbs is a
universal characteristic of English documents.

The properties of the Text Parsing node also affect the Role by Freq distribution. If you clear all of
the choices from the Ignore Parts of Speech property, you might see prepositions or determiners
begin to dominate. The following plot was obtained by permitting the detection of all parts of speech
for the Medline data, which is described in a later chapter.

Even when all parts of speech are allowed, some might never appear in a document collection.
Three parts of speech are not detected in the Medline data.

Text Filter Node Results

Weight by # Docs
depends on the term
weight used, but
generally # Docs is
inversely related to
weight.

79
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The use of term weights enables shorter documents to have the same influence as larger documents
in providing understanding of the document collection.

Text Filter Viewer

80
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-82 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Filter Viewer

• Add terms to start and stop lists by changing the KEEP status
• Create synonyms using the Treat as synonyms and Remove synonyms
properties
• Subset document collections by entering queries in the search window
• View concept links

81
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Viewer

Term Properties

82
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Viewer

Document Properties

83
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Viewer

Query filters have the following characteristics:
• can be used in the properties panel and in the Interactive Filter Viewer
• return documents satisfying the query
• can be used to subset the document collection for the continuing
downstream analysis of the collection

84
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-84 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Filter Viewer

Query Operators
• +term returns all documents having at least one occurrence of term.
• -term returns all documents having zero occurrences of term. This query
type must be combined with another query (for example, “+dog –cat”).
• “text string” returns all documents having at least one occurrence of the
quoted text string.
• string1*string2 returns all documents that have a term that begins with
string1, ends with string2, and has text in between.
• >#term returns only documents that have term or any of the synonyms
that are associated with term.

85
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Example Text Filter Queries

Query: +glucose +diabetes
• Return documents having at least one occurrence of glucose and at least
one occurrence of diabetes
Query: glucose diabetes
• Return documents having at least one occurrence of glucose or at least
one occurrence of diabetes
Query: +glucose –diabetes
• Return documents that have at least one occurrence of glucose but have
zero occurrences of diabetes

86
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Query Limitations
• You cannot combine operators, for example,
+dog ->#term is not supported because combining - and ># is not
supported.
• Query length is limited to 100 characters.

87
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Filter Node Results

Concept Linking

88
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

In the Text Filter Viewer, you can select a term in the term table, and if you right-click the term, one
choice is View Concept Links. In the above slide, the analyst is investigating the term price. The
selected term is called the parent term, and the derived concept linked terms are called child terms.
Up to nine different terms will be displayed. The displayed terms are the terms with the strongest
association to the parent term.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-86 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The Reference Help provides the following description:

“The width of the line between the centered term and a concept link represents how closely
the terms are associated. A thicker line indicates a closer association.”

The actual metric used to judge association strength is not given.

Text Filter Node Results

Concept Linking

89
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The parent term price is contained in 623 documents.

Text Filter Node Results

Concept Linking

90
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The child term walmart is contained in 238 documents, and 39 of these documents contain the
parent term price.

Text Filter Node Results

Concept Linking

91
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-88 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

If you right-click the child term walmart and select Expand Links, five new terms are identified.
These five terms are said to have a second-order association with the parent term price. These are
the so-called “friends of friends.” The identification of associated terms through concept linking helps
construct queries or custom topics to enhance information retrieval. For example, if you are looking
for documents that discuss price, adding the word discount to the query can identify more
documents that might be of interest.

Spell Checking

This brief demonstration shows how to use the spell-checking feature of the Text Filter node.
A table of correctly spelled English words is provided as DMTXT.Englishdictionary. You can easily
obtain a spelling dictionary by using a search engine and searching for “spelling dictionary table.”
SAS does not supply a spelling dictionary.
1. Create a diagram named Spell Checking.
2. Drag a File Import node from the sample tab into the diagram.

3. For the property Import File, specify

D:\workshop\winsas\DMTX51\WeatherAnimalsSports_MS.xls

4. Attach a Text Parsing node to the File Import node. Use default settings.
5. Attach a Text Filter node to the Text Parsing node. Set Check Spelling to Yes, and specify
DMTX51.Englishdictionary as the dictionary. Run the Text Filter node.

6. Open the Spell-Checking Results table. It will be stored in the workspace library for the diagram.
Assume that the library is EMWS3. Then the spelling results table is
EMWS3.TextFilter_spellDS.

The table needs to be edited, but this can be accomplished through Text Parsing properties.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-90 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

7. Use a SAS Code node to permanently save this table to DMTX51.WAS_synonyms.

The following code is stored in D:\workshop\winsas\DMTX51\sassrc\ SCN_SpellTable.sas:
data DMTX51.WAS_Synonyms;
set EMWS3.TextFilter_spellDS;
run;
8. Attach a Text Parsing node to the File Import node. Select the Synonyms property by clicking
the ellipsis button. Replace the default table with the table created above.

The table is shown below.

Some of the terms do not need to be in the table. You can use the Delete button to eliminate
them.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-92 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Here is a modified synonym table after deleting several rows.

The last term, teem, was manually added using the starburst button.

Text Cluster Node

Text Cluster Properties

93
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Cluster Node

• The Text Cluster node separates the entire corpus of documents into
mutually exclusive clusters.
• Each document belongs to one and only one cluster, and the user has
control over the number of generated clusters.
• For interpretation, key descriptive terms from the documents are
automatically displayed for each cluster.
• The descriptive terms help the analyst understand the types of documents
that are being put in a cluster.

94
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The Text Cluster node divides a document collection into mutually exclusive clusters. By default,
15 terms are displayed that are most strongly associated with each of the clusters. These descriptive
terms help the analyst understand the types of documents that are in a given cluster. The Weather-
Animals-Sports demonstration produced a cluster with descriptive terms such as cold, rain, snow,
and winter. This highlights the fact that this cluster consists mostly of documents about the weather.
(It is possible that a descriptive term displayed for one cluster can also be important for describing
other clusters.)

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-94 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The interpretable value of the descriptive terms becomes clear as you work through some hands-on
examples with the Text Cluster node.
From Help  Contents:
“The Text Cluster node uses a descriptive terms algorithm to describe the contents of both EM
clusters and hierarchical clusters. If you specify to display m descriptive terms for each cluster, then
the top 2*m most frequently occurring terms in each cluster are used to compute the descriptive
terms.”
“For each of the 2*m terms, a binomial probability for each cluster is computed. The probability
of assigning a term to cluster j is prob=F(k|N, p). Here, F is the binomial cumulative distribution
function, k is the number of times that the term appears in cluster j, N is the number of documents
in cluster j, p is equal to (sum-k)/(total-N), sum is the total number of times that the term appears
in all the clusters, and total is the total number of documents. The m descriptive terms are those that
have the highest binomial probabilities.”

“Descriptive terms must have a keep status of Y and must occur at least twice (by default) in a
cluster.”

Text Cluster Node

• For example, suppose the corpus of documents
is a collection of newspaper articles, some of them about sports and
others about politics. Then you might expect to see the following:
• one cluster of documents with key descriptive terms such as baseball, soccer,
score, and so on
• another cluster of documents with key descriptive terms such as election,
campaign, votes, and so on
• The Text Cluster node is run after
• the Text Parsing node performs its natural language processing and “tokenized”
the terms
• the Text Filter node selects the terms to work with and applies certain weights.

95
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Cluster Node

• Clustering is performed by using a linear algebra approach to the term-
document frequency matrix.
• As an example of the raw input data that ultimately is processed to
produce clusters, the table below shows that cat occurred three times
in Document 1 and dog occurred two times in Document 2.
Doc 1 Doc 2 Doc 3 … Doc N
apple 1 0 0 … 2
cat 3 1 1 … 4
dog 2 2 1 … 3
farm 1 0 0 … 1
… … … … … …
White House 0 3 4 … 0
Senate 0 2 4 … 0

• Basically, documents that have similar term usage tend to be put in the
same cluster. In this case, Document 2 and Document 3 look somewhat
alike. 96
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The use of singular value decomposition (SVD) is called the linear algebra approach to text mining,
information retrieval, web analytics, and so on. As mentioned in a previous section, this algebraic
operation is the foundation of an approach that has many names:
• Latent Semantic Indexing (LSI)
• Latent Semantic Analysis (LSA)
• Vector Space Model (VSM)

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-96 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Cluster Node

• The user also has control of the cluster derivation.
• Choose the exact or maximum number.
• Choose the maximum number of clusters.
• Choose the cluster algorithm that is used.
- Expectation-Maximization (EM)
- Hierarchical
• Clustering documents is a powerful analytic approach, but can you see
a possible shortcoming of this idea?
• To use the previous example, what happens to a newspaper article that deals
with both sports and politics?
• It can be placed only in one cluster or another, but not both. That is why there
is also a Text Topics node.
97
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Cluster Node

SVD Values
• The primary purpose of the Text Cluster node is to derive clusters. This
is achieved by deriving a numeric representation for each document.
• The derived numeric representation produces the clusters.
• Latent Semantic Analysis (LSA), implemented through an algorithm called
Singular Value Decomposition (SVD), produces the numeric
representation.

98
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

There are competing methodologies to LSA/SVD: Latent Dirichlet Allocation (LDA), Correlated Topic
Modeling (CTM), and Non-Negative Matrix Factorization (NNMF). NNMF is available in PROC
IMSTAT, a procedure in the SAS In-Memory Statistics suite of software products. Competing
methodologies exist because there is no universally optimal way to quantify text. LSA is popular
in commercial software and academic publications, but the competing techniques have been shown
to be useful for specific document collections. A demonstration in a later chapter shows how LSA
provides superior results to published findings for NNMF, but no technique can be expected to be

superior for all data sets. In general, the competing methodologies do not tend to give dramatically
different results. Arguments for a particular methodology often include considerations of computing
efficiency, dimensionality reduction, and ease of interpretation.

Text Cluster Node Results

99
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Cluster Node Results

Cluster 4

Cluster 10

X-Y coordinates are based on multidimensional scaling, a methodology for

projecting a higher dimensional space into two dimensions. In the plot,
clusters 4 and 10 are relatively close together, so documents in these
clusters should be more similar to each other than to documents in the
other 13 clusters. 100
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-98 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Topic Node

Text Topic Properties

101
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Topics Node

• A topic is a subject or theme or idea that occurs in a document.
• For example, suppose the document is a newspaper article containing
topics related to the following:
• sports
• politics
• law
• Clearly, a document can contain more than one topic (whereas a
document can belong to only one cluster).
• Topics are generated
• automatically by the Text Topic node using basically the same underlying
mathematical algorithm that Text Cluster uses (modified a bit)
• also by user definitions.
102
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Topics Node

• The basic idea behind automatic topic generation is to find terms that
occur frequently together within documents. Together these terms
“define” the topic.
• This approach can be looked at by thinking of terms as potential “friends”
in a social network.
• The terms car and auto might co-occur (be “friends”) within many of the same
documents.
• However, even if they are not direct “friends” in the same documents, they can
still be “friends of friends.”

103
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Topics Node

• As an example of this indirect connection, suppose car and auto never co-
occur in the same document, but each co-occurs frequently with tire.
Therefore, car and auto are friends of the same friend and might be
recognized by the Text Topic algorithm as key descriptive terms for the
same topic.
• By default, 25 topics are automatically generated.
• These topics are identified (or interpreted) by the analyst most frequently
by examining a list of five key descriptive terms that are automatically
displayed.

104
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-100 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Topic Node

Text Topic Properties
• A custom topic table can be supplied by the user. The table can be
imported, or the table can be manually created with a table editor.
• The user can request up to 1,000 single-term topics to be derived.
By default, no single-term topics are derived.
• A user can specify up to 1,000 multi-term topics to be derived. By default,
25 multi-term topics are derived.

105
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Topic Node

Topics
• Single-term topics are not the same as filtering on a single term. For
example, a topic can be derived based on the single-term price, but
documents might be labeled as not having the topic even if the term price
is present in the document.
• The node might return fewer topics than requested. After the designated
number of topics are derived, the node can decide that topics 24 and 25
are not sufficiently distinct to warrant including both topics, for example.
If so, topic 25 (based on order of importance) is dropped.

106
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Topic Node

Custom Topics
• A custom (user-defined) topic consists of a label and one or more terms.
Each term has a role and a weight.
• The weight associated with a term-role pair indicates the analyst’s
judgment about the relative importance of the term-role pair to the topic.
• In practice, most users define weights in the range of
0 to 1, where 1 is the highest importance and a weight of 0 is the lowest.

107
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Topic Node

A Custom Topic Table

Topic Term Role Weight

analytics analytics noun 1.0
analytics analyze verb 0.9
analytics logistic regression NOUN_GROUP 0.5
data data noun 1.0
data data warehouse NOUN_GROUP 0.7
data analyze verb 0.2

Columns in the custom topic table have names:

_topic_, _term_, _role_, and _weight_.
108
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Weights can be any numeric value, positive or negative. Negative weights imply that the term
supports the negative, or opposite, of the concept. A 0-1 system is the easiest to use.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-102 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Rule Builder Node: Introduction

• The Text Rule Builder node provides a stand-alone predictive modeling
solution for data having a text variable and a categorical target variable.
• This node generates an ordered set of rules that together are useful
in describing and predicting a target variable.
• This node facilitates “active learning” in which a user can dynamically
interact with an algorithm to iteratively build a predictive model.
• This node creates Boolean rules from small subsets of terms to predict
a categorical target variable.
• The node must be preceded by Text Parsing and Text Filter nodes.

109
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Rule Builder Node: Introduction

• The node must have a target variable with a measurement level of binary,
ordinal, or nominal.
• The Data Partition node should appear before the Text Parsing node.
The validation and test data sets must contain the same target variable
identified in the training data.

110
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Rule Builder Node Version 14 and Later

• Prior to SAS Enterprise Miner 14, the Text Rule Builder node used a series
of rather complex macros to build Boolean rules for classifying documents.
• Beginning with 14.1, a new procedure, HPBOOLRULE, performs all of the
rule derivation, replacing the Text Rule Builder macros.
• Because HPBOOLRULE represents an improvement, rules might differ from
those derived using earlier versions of the software.

111
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Rule Builder Node Train Properties

• The Generalization Error
property determines the
predicted probability for rules
that use an untrained data set
(validation).
• This is to prevent overtraining.
Higher values do a better job
of preventing overtraining at a
cost of not finding potentially
useful rules.

112
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-104 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Rule Builder Node Train Properties

• The Purity of Rules property controls the
complexity of rules to consider.
• Purity of Rules determines how selective
each rule is by controlling the maximum
p-value necessary to add a term to a rule.
• Selecting Very High results in the fewest,
purest rules. A rule with only a few words
is considered to be purer than a rule with
many words.
• Selecting Very Low results in the most rules
that handle the most terms.
• Valid values are Very Low (p<.17), Low
(p<.05), Medium (default, p<.005), High
(p<.0005), and Very High (p<.00005). 113
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Rule Builder Node Train Properties

• Exhaustiveness determines the exhaustiveness
of the rule search process, or how many
potential rules are considered at each step.
• As you increase the exhaustiveness, you
increase the amount of time that the Text
Rule Builder node requires and increase the
probability of overtraining the model.
• A high value permits many rules. Unlike the
other two properties, a high value increases
the time required to find useful rules. For the
other two properties, high values decrease
the time required to find useful rules.
• Valid values are Very Low, Low, Medium
(default), High, and Very High. 114
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Rule Builder Node Score Properties

• Content Categorization Code – Click
the ellipsis button to the right of the
Content Categorization Code
property to view the Content
Categorization Code window. The
code that is provided in this window
can be copied and pasted into SAS
Content Categorization Studio. This
node must be run before you can
access the Content Categorization
Code window.

115
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Rule Builder Node Score Properties

• Change Target Values – Click the
ellipsis button to the right of the
Change Target Values property
to view the Change Target Values
window. The
window enables you to view and
reassign target values. As a result,
you can rerun the Text Rule Builder
node and iteratively refine your
model.

116
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-106 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Rule Builder Node Score Properties

• Change Target Values – This property setting facilitates
active learning in which a user can dynamically interact
with an algorithm to iteratively build a predictive model.
• The observations in this window contain all observations
in the training, validation, or test data set that meet any
of the following conditions:
• all misclassified observations
• includes observations for which the target contains a missing
value
• an observation for which you have previously changed the
imported target value to a different target value
• The observations in this window are ordered by the
model’s posterior probability in descending order from
1 to 0.
117
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Active Learning
Requirements
• Target values are “fuzzy” or subject to error.
• Historic data and expert opinion might help discover observations where
the target value is likely to be incorrect.
• A team of domain experts is available to identify likely miscoded target
values. Experts can assign a “degree of belief” value to suspicious cases.
A cutoff can be derived to decide whether suspicious cases should
be recoded.

Relevance
• Changing target values that are likely to have been miscoded can lead
to discovery of better scoring rules.
118
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Active learning can be explained using an actual text mining project for illustration. An insurance
company wants to automatically classify a claim as fraudulent or not. Data are collected, a model
is selected, and the model is used to score training, validation, and test data sets. Some cases with
high scores have a fraud flag set to zero. While investigating these anomalous cases, the researcher
notices that several of the cases have very suspicious attributes. She forwards the cases to trained
insurance fraud investigators and asks them to provide a value between zero and 100 indicating the
probability that a claim experienced fraud. After deliberation, a cutoff of 75% was selected, so any
reviewed case with a probability score above 75 was recoded as a fraud case. The model was refit
with the recoded data.
When using a methodology like that supported by the Text Rule Builder node, this form of active
learning can lead to the discovery of additional rules.
The scenario makes it clear that trained experts are required to participate in the active learning
process. If active learning is not done properly, it becomes an exercise in boosting model diagnostics
without really improving the model.

Text Profile Node

• Profiles a categorical target variable using terms in a document collection.
• Unlike the Text Topic node, representative terms are not selected simply
based on term weight.
• To promote identification of truly discriminating terms, terms found for
a particular category are downweighted by the probability that a term will
appear in more than one category.

119
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-108 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Profile Node

120
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Profile Node

121
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The Text Profile node is covered in detail in a later chapter.

1.02 Multiple Choice Question

Which Text Filter Viewer query will retrieve all documents that contain both
the word technical and the word dictionary?

a. Query: technical dictionary

b. Query: +technical +dictionary
c. Query: "technical dictionary"
d. Query: +>#technical +>#dictionary

122
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-110 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Technical Details
The following material is extracted from the Reference Help for SAS Enterprise Miner.

Parts of Speech in SAS Text Miner

SAS Text Miner can identify the part of speech for each term in a document based on the
context of that term. Terms are identified as one of the following parts of speech:
• Abbr (abbreviation)
• Adj (adjective)
• Adv (adverb)
• Aux (auxiliary or modal)
• Conj (conjunction)
• Det (determiner)
• Interj (interjection)
• Noun (noun)
• Num (number or numeric expression)
• Part (infinitive marker, negative participle, or possessive marker)
• Pref (prefix)
• Prep (preposition)
• Pron (pronoun)
• Prop (proper noun)
• Punct (punctuation)
• Verb (verb)
• VerbAdj (verb adjective)

When you create a stop list table, start list table, synonym table, or custom topic table, the role
variable takes the value of one of the above abbreviations, or it will use a noun group or entity role.
Following is an example of a custom topic table that uses part of speech roles for the variable
_role_. The label for the variable _role_ is Role, which is displayed as the column heading.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-112 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Noun Groups in SAS Text Miner

SAS Text Miner can identify noun groups, such as clinical trial and data set, in a document
collection. Noun groups are identified based on linguistic relationships that exist within
sentences. Syntactically, these noun groups act as single units. Therefore, you can choose
to parse them as single terms.
• If stemming is on, noun groups are stemmed. For example, the text amount of defects is parsed
as amount of defect.
• Frequently, shorter noun groups are contained within larger noun groups; both the shorter
and larger noun groups appear in parsing results.

Entities in SAS Text Miner

An entity is any of several types of information that SAS Text Miner can distinguish from general
text. If you enable SAS Text Miner to identify them, entities are analyzed as a unit, and they are
sometimes normalized. When SAS Text Miner extracts entities that consist of two or more
words, the individual words of the entity are also used in the analysis.

Out of the box, SAS Text Miner identifies the following standard entities:
• ADDRESS (postal address or number and street name)
• COMPANY (company name)
• CURRENCY (currency or currency expression)
• DATE (date, day, month, or year)
• INTERNET (email address or URL)
• LOCATION (city, country, state, geographical place/region, political place/region)
• MEASURE (measurement or measurement expression)
• ORGANIZATION (government, legal, or service agency)
• PERCENT (percentage or percentage expression)
• PERSON (person’s name)
• PHONE (phone number)
• PROP_MISC (proper noun with an ambiguous classification)
• SSN (Social Security number)
• TIME (time or time expression)
• TIME_PERIOD (measure of time expressions)
• TITLE (person’s title or position)
• VEHICLE (motor vehicle including color, year, make, and model)
You can also use SAS Content Categorization with Teragram Contextual Extraction to define
custom entities and import these for use in a Text Parsing node. When you create compiled
custom entity files, ensure that you specify September 14, 2009 as the compatibility date.
(Valid files have the extension .li.) Otherwise, the files cannot be used in SAS Text Miner.

Entities are normalized in the following situations:

• SAS Text Miner uses a fixed dictionary of company and organization names in order to identify
these entity types, and entities of this type are frequently associated with a parent. For example,
if IBM appears in the text, it is returned with the predefined parent International Business
Machines. Typically, the longest and most precise version of a name is used as the parent form.
• SAS Text Miner normalizes entities that have an ISO (International Standards Organization)
standard (dates/years, currencies, and percentages). Rather than return the normalization
as a parent of the original term, these normalizations actually replace the original term.
• You can alter any parent forms that are returned by editing the synonym list. Place terms that you
want to identify as an entity in the term variable, the parent to associate with it in the parent
variable, and place the entity category in the category variable. Then rerun the node.

Attributes in SAS Text Miner

When a document collection is parsed, SAS Text Miner categorizes each term as one
of the following attributes, which gives an indication of the characters that compose that term:
• Alpha, if characters are all letters
• Num, if term characters include a number
• Punct, if the term is a punctuation character
• Mixed, if term characters include a mix of letters, punctuation, and white space
• Entity, if the term is an entity

Creating a Library Reference in SAS Enterprise Miner

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-114 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Text Mining Basics for SAS Programmers

As you try to emulate SAS Text Miner with SAS code, you quickly realize that there are many
challenges related to stemming, part-of-speech tagging, and identifying multi-word terms. The code
presented below is purely educational to help illustrate some of the issues related to string-based
versus token-based querying. The goal is to engage in information retrieval using a simple SAS
program, and to compare how different SAS functions support querying documents to find specific
words.
The SAS language supports many character functions. The following program illustrates some
of the text mining features of the SAS language. It is stored in the course folder and has the name
SAS_Language_IR.sas. The comments describe the SAS functions that are used for information
retrieval. For example, illustration 1 shows that the INDEX function is a string-based search function.
libname DMTX51 "D:\workshop\winsas\DMTX51";

title1 "Text Mining Basics for SAS Programmers";

/-- Illustration 1: INDEX is a string based search function --/

title2 "Search on LIONS using INDEX";

data work.templions;
set DMTX51.WeatherAnimalsSports_train;
if (index(lowcase(TextField),"lions")>0) then output;
run;
proc print data=work.templions;
run;

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-116 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

The following table shows the results of the above query.

To illustrate that the INDEX function is string based, you can alter the query slightly by using
the singular form of lion.
title2 "Search on LION using INDEX";
data work.templion;
set DMTX51.WeatherAnimalsSports_train;
if (index(lowcase(TextField),"lion")>0) then output;
run;
proc print data=work.templion;
run;
The results follow.

You can see that two additional documents were detected.

If the word dandelion appeared in one of the documents, the second query would have retrieved
the document. To avoid this, you can use the INDEXW function.
/*-- Illustration 2: INDEXW is a token based search function --*/

title2 "Search on LIONS using INDEXW";

data work.templions;
set DMTX51.WeatherAnimalsSports_train;
if (indexw(lowcase(TextField),"lions")>0) then output;
run;
proc print data=work.templions;
run;

This query produces the following table:

Why was the “House cats…” document not retrieved? The INDEXW function accepts an optional
third argument. Without this third argument, the default delimiter to identify words is a space.
title2 "Search on LIONS using INDEXW and Delimiters";
data work.templions;
set DMTX51.WeatherAnimalsSports_train;
if (indexw(lowcase(TextField),"lions",' ,')>0) then output;
run;
proc print data=work.templions;
run;
With the comma delimiter added to the space delimiter, the additional document is identified.

When you specify the singular lion query, documents with the word lions are not identified using
the INDEXW function.
title2 "Search on LION using INDEXW and Delimiters";
data work.templion;
set DMTX51.WeatherAnimalsSports_train;
if (indexw(lowcase(TextField),"lion",' ,')>0) then output;
run;
proc print data=work.templion;
run;
The following table is produced:

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-118 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

You can parse a document collection and calculate term statistics using the SAS language.
The following code produces a simple text relevance score to accompany a query, similar
to the results of the Text Filter Viewer:
title2 "COUNTW/SCAN Parsing for LIONS";
data work.templions;
set DMTX51.WeatherAnimalsSports_train;
attrib Word length=$32;
keep Target_Subject TextField NumWords WordFreq;
NumWords=countw(TextField);
FoundFlag=0;
WordFreq=0;
do WordNum=1 to NumWords;
Word=lowcase(scan(TextField,WordNum,
' ,.;:@*&|\/+=<>?!$%^()[]{}'));
if (Word="lions") then do;
WordFreq+1;
FoundFlag=1;
end;
end;
if (FoundFlag=1) then output;
run;

proc sql noprint;

select max(WordFreq) into: MaxFreq
from work.templions;
quit;

data work.templions;
set work.templions;
TermRelevance=WordFreq/(&MaxFreq);
run;
proc print data=work.templions;
run;
The SCAN function steps through a text string and identifies tokens based on the separators that are
provided. A later chapter briefly discusses elements of document parsing. PROC SQL and the MAX
function are used to get the normalizing constant for computing relevance. The term relevance score
is just the normalized term frequency for the selected term. The results are given below.

The program SAS_Language_IR.sas contains additional code illustrating queries for lion and for lion or
lions.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-120 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Practice

1. Changing Text Miner Properties to See How This Affects Results

Experiment with changing some of the properties from the setup in the Weather-Animals-Sports
demonstration. For example, you might copy and paste the previous diagram and make the
following changes:
• Change the Text Filter node Term Weight property to Entropy (instead of Mutual
Information).
• Change the Text Cluster node Max SVD Dimensions and the Number of Clusters to 4
(instead of 3).
• Change the number of multi-term topics on the Text Topic node to 4 (instead of 3).
If you copy and paste the Metadata and StatExplore nodes, you will have to manually change the
node properties to accommodate the change in names of the variables output by the copied Text
Cluster node.
Look at all the results from the Text Topic and Text Cluster nodes and interpret them as you did
in class.
One point of this exercise is that changing some of the property settings on the nodes leads to
different ways of looking at your data. You want to be experimental and creative in finding a good
combination of interpretable results and correct categorizations.

1.4 Chapter Summary

Data mining incorporates analytic techniques applied to problems of pattern discovery and predictive
modeling. SAS Enterprise Miner uses the SEMMA methodology for addressing data mining
problems.
Text mining is a specialized area of data mining that brings together algorithms and methods from
natural language processing and information retrieval to solve problems involving a collection of
documents. SAS Text Miner provides modern techniques for solving text mining problems.
In this chapter, you used the following text mining nodes:
• Text Parsing
• Text Filter
• Text Cluster
• Text Topic

For Additional Information

Berry, Michael W., and Murray Browne. 2005. Understanding Search Engines: Mathematical
Modeling and Text Retrieval. Philadelphia: SIAM.
Chakraborty, Goutam, Murali Pagolu, and Satish Garla. 2013. Text Mining and Analysis: Practical
Methods, Examples, and Case Studies Using SAS. Cary, North Carolina: SAS Institute Inc.
Langville, Amy N., and Carl D. Meyer. 2006. Google’s PageRank and Beyond: The Science
of Search Engine Rankings. Princeton, New Jersey: Princeton University Press.
Lu, Linyuan, Zi-Ke Zhang, and Tao Zhou. 2013. “Deviation of Zipf’s and Heaps’ Laws in Human
Languages with Limited Dictionary Sizes.” Scientific Reports.
https://pdfs.semanticscholar.org/4a68/beccab4981d4120a35737b9f9c99d4c55834.pdf
Miner, Gary, Dursun Delen, John Elder, Andrew Fast, Thomas Hill, and Robert A. Nisbet. 2012.
Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Waltham,
Massachusetts: Academic Press.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-122 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

1.5 Solutions
Solutions to Practices
1. Changing Text Miner Properties to See How This Affects Results
If you use SAS Enterprise Miner 15.1 and you ran the exercise with the specified property settings,
the following Text Cluster results are obtained:

With the request for four dimensions and four clusters, it appears that clusters 1 and 3 are two
separate animals clusters.
You should have discovered that copying and pasting the Metadata node and the StatExplore node
require some changes because the Text Cluster results are for the new TextCluster2 node added
by the copy-paste operation. Consequently, you must set the new role for TextCluster2_cluster_
to Input.

The StatExplore results follow.

The new settings produce the same misclassified document.

Similarly, with four topics, there now are two (Topic 2 and Topic 4) that involve animals.

The decision tree resulting from these settings is shown below and indicates one misclassification
on the training data.

Did you try some other property settings that give good results?

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-124 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner

Solutions to Student Activities (Polls/Quizzes)

1.01 Multiple Choice Question – Correct Answer

Using the Explore feature of the Exported Data property, a researcher notes
that a training data bar chart for a binary target variable shows 3,462 cases
coded as 1 and 16,538 cases coded as 0. Which choice best describes the
situation leading to this experience?
a. The training data has 3,462 “successes” and 16,538 “failures.”
b. A sample of the training data has 3,462 “successes” and 16,538 “failures.”
c. The data has been oversampled.
d. The researcher forgot to set the random sample property for explorations.

23
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

1.02 Multiple Choice Question – Correct Answer

Which Text Filter Viewer query will retrieve all documents that contain both
the word technical and the word dictionary?

a. Query: technical dictionary

b. Query: +technical +dictionary
c. Query: "technical dictionary"
d. Query: +>#technical +>#dictionary

123
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 2 Overview of Text
Analytics

2.1 Using the Text Import Node ................................................................................................... 2-3

Demonstration: Using the Text Import Node ........................................................................ 2-10

2.2 A Forensic Linguistics Application ..................................................................................... 2-29

Demonstration: Stylometry for Forensic Linguistics ............................................................. 2-33

2.3 Information Retrieval ............................................................................................................ 2-38

Demonstration: Retrieving Medical Information ................................................................... 2-41

Practice ................................................................................................................................. 2-46

2.4 Text Categorization ............................................................................................................... 2-49

Demonstration: Categorizing Aviation Safety Reports ......................................................... 2-51

Practice ................................................................................................................................. 2-71

2.5 Chapter Summary ................................................................................................................. 2-72

2.6 Solutions ................................................................................................................................ 2-73

Solutions to Practices ........................................................................................................... 2-73

Solutions to Activities and Questions ................................................................................... 2-88

2-2 Lesson 2 Overview of Text Analytics

2.1 Using the Text Import Node

Objectives
• Describe how the Text Import node is used for processing document
collections and creating a single SAS data set for text mining.
• Show how the SAS data set created from Text Import can then be merged
with another SAS data set containing target information and other
non-text variables.
• Show how to compare two models, one using only conventional input
variables and another using the conventional inputs and some text mining
variables.

3
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Often the most challenging part of the data mining process is obtaining and preprocessing the data.
SAS provides a rich set of tools for data preparation.

SAS Data Access Features

• SAS provides access engines for commercial databases and common
file types.
• SAS Enterprise Guide, the SAS windowing environment, and other
SAS products or components support a Data Import Wizard.
• The SAS language supports high-level and low-level I/O functions
for reading data files.

4
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-4 Lesson 2 Overview of Text Analytics

SAS/ACCESS engines provide direct connectivity to popular commercial databases.

A SAS/ACCESS engine hooks into the database supervisor to enable direct access to tables
in the database. SAS/ACCESS engines also provide connectivity to common file formats, such
as Microsoft Excel files.
The Data Import Wizard enables access to common file formats, including comma-separated values
(CSV) and Microsoft Excel files.
The SAS language provides a complete set of data access functions, including functions for
low-level file I/O, so that, theoretically, any file format can be read and processed.

2.01 Multiple Answer Question

Select all data file formats that you routinely encounter in your work.

a. PC file formats—for example, Microsoft Excel (ODBC, OLE DB)

b. Microsoft SQL Server tables
c. Oracle tables
d. Sybase tables
e. Teradata tables
f. DB2 tables
g. MySQL tables
h. Informix tables
i. other 5
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The SAS language also supports Perl regular expressions.

Selected Functionality of the SAS Language

• Very often your text files require considerable processing before they can
be used.
• The SAS language provides numerous tools for doing this:
• support for Perl regular expressions
• character string functions for searching and modifying text data
• mathematical and statistical functions for working with numeric data
• formats and informats for reading and writing data in most recognized data
formats
• a macro facility to enable users to program complex operations for enterprise-
wide use

6
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Perl regular expressions enable you to use terse scripts for complex data operations on text files.
For example, to preserve confidentiality, you might want to convert all postal codes to a generic
phrase.

A Perl Regular Expression Macro in SAS

%macro PrivateUSAPostalCode(TextVar);
&TextVar = prxchange(
's/\d{5}/_PRIVATE_USA_POSTAL_CODE_/',
-1,&TextVar);
%mend PrivateUSAPostalCode;

This is an example of how a Perl regular expression can be used for

processing text. In this case, the code is eliminating identifying ZIP code
information. This macro uses the SAS PRXCHANGE function to convert
the occurrence of a five-digit ZIP code into the string
_PRIVATE_USA_POSTAL_CODE_. You might need to perform many steps
to get the text data in the form that you need.
7
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-6 Lesson 2 Overview of Text Analytics

Running SAS Programs

In SAS Enterprise Miner
• Program Editor: View  Program Editor
• SAS Code node: Utility tab

Outside SAS Enterprise Miner

• SAS Studio
• SAS Enterprise Guide
• SAS windowing environment
• batch
• other

8
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

SAS programs are used sparingly in this course. Preference is given to the use of the SAS Code
node for running SAS programs.

2.02 Multiple Choice Question

Which statement best reflects your situation with respect to SAS
programming?

a. I am comfortable programming using the SAS programming language.

b. I have experience programming in other languages and am eager
to learn how to program in SAS.
c. I would rather use a point-and-click interface than write programs.
d. I would rather crawl for a mile through broken glass than write
a computer program.

9
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Because data preparation can be the most arduous task in text mining, SAS Text Miner includes
the Text Import node that facilitates reading all popular commercial document formats.

Previous versions of SAS Text Miner included the %TMFILTER SAS macro for reading document
collections. The %TMFILTER macro is available in the current release of SAS Text Miner. The Text
Import node provides the functionality of the %TMFILTER macro without requiring the user to create
and execute a SAS program.
A SAS macro is a script that can be compiled and executed to perform complex tasks.
At the simplest level, a SAS macro is a script that generates SAS code to be executed
by the SAS supervisor. These scripts are often stored in SAS catalogs that can be accessed
and viewed by users. Proprietary scripts are stored in compiled form and cannot be read by users.
Some of the SAS macros included with SAS Enterprise Miner and SAS Text Miner are compiled.
Sample SAS macros are stored in the catalog SASHELP.EMUTIL. For example, the SAS source file
SASHELP.EMUTIL.EXTDEMO.SOURCE provides examples of SAS Enterprise Miner functionality
that can be exploited using a SAS Code node. The catalog SASHELP.EMTEXT contains macros
related to text mining. Your ability to access and view macros in SASHELP catalogs depends
on the version of SAS that you are running and unique factors related to how your organization has
chosen to install SAS software. For example, your Information Technology (IT) or Management
Information Systems (MIS) Department might have restricted Read access to SAS installation files.
For more information about SAS programming and Enterprise Miner, see SAS® Enterprise Miner™
15.1 Extension Nodes: Developer’s Guide, which is the current programming guide for the SAS
version used for this course. The following link was active when this course was developed.
http://supportprod.unx.sas.com/documentation/cdl/en/emxndg/67980/PDF/default/emxndg.pdf

If the link is inactive, search for Enterprise Miner Extension Node Developer’s Guide at
http://support.sas.com

Text Import Node

The Text Import node converts collections of documents into a single SAS
table. Each document then becomes a row in the SAS data set. You need
to specify an import file directory and also a destination directory. You can
also alter the Text Size parameter. (The default is 100.)

10
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-8 Lesson 2 Overview of Text Analytics

Text Import Node

Here are some of the supported document types:
• Microsoft Word (.doc, .docx)
• Microsoft Excel (.xls, .xlsx)
• Microsoft PowerPoint (.ppt, .pptx)
• Rich Text (.rtf)
• Adobe Acrobat (.pdf)
More than 100 file
• ASCII Text (.txt) formats are supported!
• Corel
• FrameMaker

11
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

You can use a SAS Code node to modify the SAS table produced by the Text Import node.
For example, you can choose to drop variables such as LANGUAGE, TRUNCATED, OMITTED,
and EXTENSION, because these variables are rarely used beyond the data preparation stage. Many
document collections use a naming convention so that the path given in the URI field or the filename
given in the NAME field can be used to derive ID or index variables.

Modifying Imported Data

data &EM_EXPORT_TRAIN;
attrib ClaimNo length=$12
label="Claim Number"
AdjusterNotes length=$256
label="Adjuster Notes";
set &EM_IMPORT_DATA;
AdjusterNotes=Text;
ClaimNo=substr(Name,1,12);
keep ClaimNo AdjusterNotes Size;
run;

12
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

In the above example, the SAS Code node modifies the data produced by the Text Import node
so that it can be merged with claims data indexed by the variable ClaimNo. This code is used
in the demonstration below.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-10 Lesson 2 Overview of Text Analytics

Using the Text Import Node

This demonstration starts with using the Text Import node to read in insurance adjuster notes for
an insurance subrogation modeling example. The Text Import node is set up and run differently from
the File Import node used in Chapter 1.
Note: The Text Import node uses the SAS Document Conversion Server. This server must
be running as a service under Microsoft Windows. Your instructor might provide additional
information about starting the SAS Document Conversion Server.
After the document collection is processed by the Text Import node, the exported data must be
merged with a claims features data set that includes the target variable and several input features.
You use a SAS Code node to pre-process the data for merging, and then you use a Merge node
to merge the two data sets. After the two data sets are merged into a single raw data set, you follow
that up with a typical text mining flow. This shows many of the steps that you might follow when you
work with your own data.
Note: Some background and definitions are helpful here. The term subrogation refers to a legal
right that an insurance company has to sue a third party in order to recover any
compensation payouts. For example, if you have a car accident that is caused by some other
person who was at fault for hitting you, your insurance company compensates you directly.
However, it can also try to recover money from the insurance company of the person who hit
you. This is called subrogation. Typically, it costs time and money for your company to use
a lawyer to initiate subrogation proceedings, so the company does not pursue this unless
it thinks that there was a good chance of winning a lawsuit or reaching a settlement. Most
subrogation cases are settled amicably when evidence clearly supports a subrogation claim.
In the demonstration that follows, you work with a data set in which the target variable
(SubroFlag) is defined by whether insurance claims were successfully subrogated.

The adjuster notes source files are contained in the D:\workshop\winsas\dmtx51\InsClaimsNotes

directory. (It is possible that the data resides elsewhere, but your instructor knows the correct
pathname.)
1. Create a new diagram and name it Subrogation Model. The completed process flow is given
below.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-11

2. Drag a Text Import node into the diagram. For the Import File Directory property, navigate to
the pathname for the insurance adjuster notes (D:\workshop\winsas\dmtx51\InsClaimsNotes).
In general, you also need to create a destination directory for the output of the Text Import node.
In this course, use the D:\workshop\winsas\dmtx51\InsClaimsNotesDestination folder.
(If it does not exist, create it in Windows.) Navigate to this destination directory so that it is
selected in the properties panel. Modify the text size from the default 100 to the maximum
32000. An example of the completed properties panel appears below.

3. Look at how the source document files are stored in

D:\workshop\winsas\dmtx51\InsClaimsNotes. There is an individual file for each adjustor
note. The names of the files are actually claims numbers and are used for matching purposes
in a later step. The Text Import node can easily process them if they are stored as PDF, Word,
and other types of supported file formats. To illustrate how the Text Import node converts
document files, the first three files are docx (Word document), PDF (Adobe portable document
format), and xlsx (Excel format), and the remaining files are raw text (txt) files.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-12 Lesson 2 Overview of Text Analytics

4. Open one of these documents to see what it contains. For example, the file 001924817308.txt
contains the sentence shown below.

5. Run the Text Import node. View the results.

The pie charts for Omitted/Truncated Documents and Document Languages are mostly solid
blue. There are few omitted or truncated files, and almost all documents are in English. The pie
chart for Document Types is not blue because there are multiple document types. However,
the frequency of the files that are not TXT files is small, so the three documents show up only
as a thin line in the pie chart.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-13

The Output window indicates that 3,038 documents were processed.

Of particular importance are omitted and truncated documents, which are flagged by the binary
variables OMITTED and TRUNCATED created by the Text Import node. A file is omitted
if it cannot be converted. Following are reasons for a file being omitted:
• The file format is not supported by the Text Import node, or the file format has been specifically
excluded by the Extensions property.
• The file has security features that prevent it from being read or processed.
• The file is corrupted.
A file can be truncated if it exceeds the value given by the Text Size property, which is bounded
by the size of a SAS character variable. A SAS character variable cannot exceed 32,767
characters.
Note: It is important to be aware that truncation affects only how much of the document is
visible in certain windows. The full document is still analyzed downstream by all the
Enterprise Miner (including Text Miner) nodes. Whether a file is truncated for the analysis
depends on whether you properly use the role of text location.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-14 Lesson 2 Overview of Text Analytics

6. Click the ellipsis in the Exported Data line in the properties panel for the Text Import node.
Select Explore and look at the Sample Statistics window. All the variables that were created
by the Text Import node are shown.

Particular variables of interest are described below.

Text – the adjustor notes themselves. This variable is assigned a role of text.

Name – the name of the input file containing the particular adjustor note for that observation.
Filtered – the pathname for the converted file. This variable is given the role of text location..
URI – the pathname for the original file before conversion. When the Text Import node is used
as a web crawler, the pathname defines the location of the file extracted from the Internet and
copied to the Import File directory. The copied file is usually an HTML file, and if so, the role
of web address that is assigned signifies that HTML properties, such as font styles, should be
used by the Filter Viewer when displaying the document.
Here is a quick snapshot of the 19 documents and the first three variables. The Name variable
is important for match-merging with the claim features data set.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-15

7. Use the Plot Wizard to construct a bar chart for the file extension.

8. Use the Plot Wizard to produce a scatter plot of the size variables. The unconverted files will
have large variation because some files, like the docx, PDF, and xlsx files, will be larger because
they contain formatting information.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-16 Lesson 2 Overview of Text Analytics

The PDF file is 83,694 bytes before conversion, but only 31 bytes after being converted
to a TXT file.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-17

9. Select the largest Filtered Size file to the right.

10. Because you selected the square corresponding to the file, you can use the dynamic linking
feature of Enterprise Miner to find the actual file. This could be time consuming for larger
document collections, so you might want to change the tip text to ClaimNo. Right-click in the plot
and select Data Options. Change the role of Name to Tip.

11. Click OK.

12. Select the largest document.

13. Scroll down to claim 993128559708.txt.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-18 Lesson 2 Overview of Text Analytics

The claim is highlighted in a lighter color, making it easy to find. However, the Enterprise Miner
explorer is not set up to read documents, so you will have to wait to read the entire document
until you can use the Filter Viewer in the Text Filter node.
Examining document size can indicate problems in data processing. It can also provide insight
into choice of term weight as discussed in a later chapter.
14. Bring in a SAS Code node and attach it to the Text Import node.
15. Select the Code Editor ellipsis in the properties panel for the SAS Code node. In the Training
Code window, right-click and select Open. Then navigate to the SAS program
D:\workshop\winsas\dmtx51\sassrc\SCN_SubrogationText.sas and bring it in.

This code is in the file

SCN_SubrogationText.sas

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-19

(If you have any problem finding the program, you can enter the seven lines of code manually.)
This code uses &EM_IMPORT_DATA, a macro variable that refers to the SAS data set created
by the Text Import node. The variable AdjusterNotes is defined in the Text field. (This is not
really necessary because you can keep the name Text.) The tricky part is to define the variable
ClaimNo, which is obtained using the SUBSTR function by extracting the first 12 characters from
the variable Name. This is why it is essential to give each text file a name that is the same
as the claim number. Run the SAS Code node.
Note: The slide show provided source code for basic processing of exported data coming from
the Text Import node. One of the documents in the subrogation collection is stored in xlsx
format, and the converter adds sheet number to the beginning of the document. The
code shown in the slide show has been modified to remove the unnecessary xlsx sheet
information. This could have been accomplished in many different ways. If you are going
to process a document collection having many different Microsoft Excel formats, then
you might want to use the Extension variable to decide how to modify the converted
document. If you are not a SAS programmer, you might need help with accomplishing
document post-processing.
16. Bring the SAS data set SUBROGATION_TARGET into the diagram. The data set was created
and is shown in the project panel under Data Sources. Look at the variables for this data set.
SubroFlag is the binary target variable (1=successful subrogation, 0=unsuccessful). ClaimNo
is the variable to use for matching against the adjuster notes data that were created with the Text
Import node. The remaining variables are potential input variables that can be used to predict
the target and the variables that the text mining nodes create.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-20 Lesson 2 Overview of Text Analytics

17. Bring in the Merge node and connect it to the two data sets. They are matched by ClaimNo.
This node is found on the Sample tab at the top of the window.

18. The data set SUBROGATION_TARGET is already ordered by ClaimNo. It can be matched one-
to-one with the SAS data set created by the Text Import node and the program in the SAS Code
node. The default setting is to perform a one-to-one merge. However, you can also perform
a match merge using the BY variable ClaimNo. Select Merging  Match in the properties
panel, and select the Variables property to select the appropriate BY variable. Change Merge
Role for ClaimNo to By.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-21

19. Select the By Ordering property to verify that ClaimNo has been selected. The By Ordering
property is necessary when you have two or more BY variables.

20. Run the Merge node.

Note: While match merging is less efficient than one-to-one merging, it avoids the problem
of mismatches when one of the source data sets is sorted in place by something other
than ClaimNo.
21. Connect a Save Data node to the Merge node. Specify DMTX51 for the SAS Library Name
property, and specify Subro for the Filename Prefix property. The Save node can be used
to create a permanent copy of a SAS data set created by SAS Enterprise Miner.

22. Run the Save Data node. A copy of the training data set exported from the Merge node will
be saved as DMTX51.Subro_train. If you archive or delete the Text Import diagram, you can still
access the subrogation data without rerunning the Text Import node and remerging the data.
23. Connect a Text Parsing node and a Text Filter node to the output of the Merge node. Run the
default settings in both cases. For the Text Filter node, the default weightings on the properties
panel are Log for Frequency Weighting and Mutual Information for Term Weight, if there
is a target variable present. (There is in this case.) It is usually a good idea to explicitly show
these choices rather than keep Default showing. Run the program from the Text Filter node.

Text Filter Settings

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-22 Lesson 2 Overview of Text Analytics

24. Open the Results window for the Text Filter node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-23

The three primary diagnostic plots do not uncover anything unusual about the subrogation data.
The Zipf plot reflects the exponential decay consistent with the power rule formulation provided
in a later chapter. The Number of Documents by Frequency plot is approximately linear. There
are more nouns, verbs, and adjectives than other parts of speech. The number of adverbs is low,
which is consistent with how objective reports are constructed. Adverbs tend to reflect opinions
rather than fact. For example, “The claimant recklessly operated the fork lift which caused the
collision with the stack of pallets” versus “The claimant struck a stack of pallets while operating
the fork lift.” Close the Results window.

25. Open the Filter Viewer. For an actual project, you can explore the quality of the data with
the Filter Viewer. Two explorations are illustrated, but many more are possible.
26. The first exploration compares the adjuster notes to other attributes of a claim. Sort the
document window by the variable Body (label=Body Part). Scroll down to head injuries.
Note: You can use the Edit  Find feature to quickly scroll down to the desired point. If you
select head, you will encounter a few documents that mention the head with respect
to other injuries (such as eye injuries), but this strategy might be faster than simply using
a scroll bar, especially for very large document collections.

One of the documents seems to be in disagreement with the assigned body part.

This actually appears to be a multiple body part injury. Although this technique of selectively
scanning the document collection can uncover data problems, it is very labor intensive.
The second exploration is more systematic. Previous exploration reveals that back, finger, arm,
hand, and head injuries seem to dominate. The following plot was obtained from a variable
property window using the Explore feature.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-24 Lesson 2 Overview of Text Analytics

There are 233 head injuries.

27. The first exploration revealed a problem with a single claim. Because there are 233 cases
classified as head injuries, you can exploit the Subset Documents feature of the Text Filter node.
Drag a new Text Filter node into the diagram. Select the property Subset Documents. For
Column name, select Body (Body Part).

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-25

28. Under Value, click the ellipsis icon, and select Head.

Your subsetting query is complete.

29. Click OK, and observe how the query populates the Subset Documents property.

30. Run the node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-26 Lesson 2 Overview of Text Analytics

31. Open the Filter Viewer. Verify that 233 documents were retrieved. Move the mouse pointer over
a column heading in the Documents window.

You see that 233 documents are contained in the documents table.
32. In the Terms table, sort by # DOCS.

Of the 233 documents retrieved, 73 have the noun head, 34 have the verb head, and 22 have
the adjective head. It might be surprising that most adjuster notes do not contain the word head
for head injuries.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-27

If you sort by TERM and scroll down to the first entry for head, you get the following view.

33. Because headache was a possible source of body part misclassification revealed above, extract
the eight headache cases. Select the headache row, right-click, and then select Add Term to
Search Expression. Select Apply.

Several of the claims seem to warrant Body=multiple rather than Body=head.

Other explorations are warranted. For example, why would head used as a verb suggest a head
injury? Unfortunately, you cannot include role in the query.
The subrogation data are revisited later to illustrate text mining and predictive modeling.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-28 Lesson 2 Overview of Text Analytics

2.03 Multiple Answer Question

Which of the following tasks can be performed with the Text Import node?

a. perform optical character recognition (OCR) of embedded bitmaps

in document files
b. convert Microsoft Word, Excel, and PowerPoint files to ASCII text
c. process documents having more than 32,000 characters
d. act as a web crawler or robot to fetch and convert Internet pages
to ASCII text files

14
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-29

2.2 A Forensic Linguistics Application

Objectives
• Define stylometry and explain how it relates to text analytics.
• Illustrate how text mining can be used to support forensic linguistics
using stylometry techniques.

17
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

SAS users come from many different business and government organizations. Students in this class
are sometimes involved with various security and intelligence problems. The forensic linguistics
demonstration is intended to show how SAS Text Miner can be used for these types of problems.
Consider some background information for this example. Between 1978 and 1995, a person called
the “Unabomber” (who is known to be Theodore Kaczynski) mailed bombs to selected individuals
associated with technology research. His bombs killed three people and injured 23. In 1995, he sent
a long article entitled Industrial Society and Its Future to the FBI and demanded that it be published
in a major newspaper or he would strike again. This long article was eventually published in the
New York Times and The Washington Post. The style and content of the writing was recognized
by Theodore Kaczynski’s brother, and this ultimately led to Kaczynski’s arrest.
Note: In 1995, text analytics software was a rarity. However, several universities had faculty
members who were actively involved in text mining research. The FBI actually received help
from several researchers, but due to the lack of good control data, a text mining solution was
not forthcoming. In particular, scores generated for persons of interest were not sufficiently
high for the FBI to have good cause for further investigation. (Personal communication,
M2006 Data Mining conference.)
This demonstration uses 232 paragraphs extracted from Kaczynski’s long article, and 1726
paragraphs extracted from the writing of five other authors. The latter are used as comparison
documents. There is a total of 1958 documents (paragraphs). You run both the Text Cluster and
Text Topic nodes on this training data and then create a decision tree model in order to attempt
to accurately classify the documents by their authors. Classification such as this is really a form
of prediction modeling. In addition, 11 documents are used as unknowns. You use the two models
to classify these 11 unknown paragraphs with regard to their likely authors. (Spoiler alert: In this
setup, all 11 of the unknown cases were written by Kaczynski.)

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-30 Lesson 2 Overview of Text Analytics

Stylometry
Stylometry is defined as the use of linguistic style to characterize written
language.

Applications:
• attributing authorship of anonymous or disputed literary works
• detecting plagiarism
• forensic linguistics (for example, identifying Theodore Kaczynski
as the Unabomber based on his writing style)

18
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

continued...
Forensic Linguistics
Special Case:
Stylometry Applied
to Forensics

Problem: Eleven written sources… Who is the author?

19
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Forensic linguistics typically uses predictive modeling to score a document of unknown, but
suspected, authorship. The score represents an estimate of the probability that the document was
written by a suspect. The value of text mining applied to forensic linguistics is that suspects can
be identified for investigation. The text mining results are rarely if ever used as evidence in
prosecuting a suspect, although testimony might include a discussion of techniques in describing
how the suspect was identified.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-31

continued...
Forensic Linguistics

TK is a suspect.

Corpus: 1,958 paragraphs from six authors taken from written works and interviews
20
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The data for this study is real, but the situation is hypothetical. Separation of documents was
enhanced for educational purposes. In actual forensic linguistic studies, there are rarely such pure
results as those achieved here.

The six authors in the training data are coded as AM, CD, DM, DO, FE, and TK. The initials were
changed for the first five authors. TK is Theodore Kaczynski, the so-called Unabomber. The TK
documents are paragraphs from the manifesto written by Kaczynski and published in The New York
Times and Washington Post. Obviously, when the manifesto was published, the author was not
known to be TK. The 11 unknown documents are excerpts from interviews with Kaczynski after
he was convicted of murder. Thus, although based on real data, this example is artificial.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-32 Lesson 2 Overview of Text Analytics

Forensic Linguistics

Score Data Set: Eleven documents from the same unknown author
Problem: Build classification models on the known documents with six
different authors. Apply these models to the unknown 11 documents
to determine the likely author of each one.
21
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-33

Stylometry for Forensic Linguistics

This demonstration illustrates how to use text mining nodes and other Enterprise Miner nodes
to build classification (prediction) models in a forensic setting. You analyze writing samples from six
authors. For five authors, the writing samples in the training data have to do with technical material
about statistics and SAS courses. For one of the authors (TK), the writing samples are paragraphs
from his published manifesto.

1. Create a diagram named Forensic Linguistics.

2. The training data that is used is contained in one SAS data set named Forensics. It resides
in the directory D:\workshop\winsas\DMTX51, which was associated with the SAS library
DMTX51 in a previous demonstration. For this demonstration, the document (paragraph)
extracts were already run through Text Import, and a target variable (Author) was associated
with each document.
Note: If you need to create a library pointing to this directory, select File  New  Library.
Then navigate to D:\workshop\winsas\DMTX51. Specify the library name as DMTX51.
3. Go to Data Sources for the project. Right-click to open Create Data Source. Select Browse
to find the SAS data set Forensics in the DMTX51 library. Select Forensics. In step 5 of the
Data Source Wizard when the variables are shown, change the variable roles to correspond
to the following:

Bring the data set into the diagram.

4. Connect a Data Partition node to the Forensics Input Data node and use the settings
(50%/30%/20%).
5. Connect a default Text Parsing node to a Data Partition node.
6. Connect a default Text Filter node. Change the properties under Weightings to explicitly show
Log and Mutual Information. Remember, these are actually the defaults that are used here
because a nominal target variable is present. It is always a good idea to make this clear
in the properties panel.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-34 Lesson 2 Overview of Text Analytics

7. Attach a Text Cluster node to the Text Filter node and run it with the default settings. Open
the results. Cluster 5 has descriptive terms, such as power and people, that are clearly
associated with the Unabomber’s long published manifesto. Close the window.

8. Attach a default Text Topic node to the Text Cluster node and run it. Open the Interactive Topic
Viewer and look for topics that are likely to be from the Unabomber’s writing. For example, select
the third topic shown below. Look at the Terms and Documents windows associated with this
topic.

9. Close the Interactive Topic Viewer.

10. Connect a Decision Tree node to the Text Topics node. Change the default leaf size from 5
to 25 and change the assessment measure to Average Square Error. Run the tree and look
at the tree in the Results window. Notice how well the leaves of the tree separate the six authors.
In particular, the author TK is accurately classified. The overall misclassification rate for all the
document extracts can be seen in the Fit Statistics window. These rates are approximately .076,
.123, and .081 for the Train, Validation, and Test data sets, respectively. Clearly, good
accuracy was achieved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-35

11. Open the Forensics_score data set and designate it as a Score data set. This data set contains
the 11 paragraphs that were drawn from TK’s interview after he was captured. You want to see how
accurately the tree model classifies these paragraphs. To do this, open a Score node and attach it to
both the Forensics_score data set and the Decision Tree node. Your complete diagram should look
as shown below.

12. Run the Score node and open the exported data from the properties panel. Select the Test data
and click Explore. Use the Plot Wizard to construct a bar chart. Use the following properties:

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-36 Lesson 2 Overview of Text Analytics

The dominance of a single color for each author illustrates how accurate the decision tree model
is for scoring the test data. You can verify by positioning the cursor over the dominant color for
any of the authors.

A total of 44 of the 46 TK documents are scored as TK documents. Two TK documents are

scored as an AM document. Every document that was scored as TK was a TK document,
so precision is 100%. Because two documents were not identified by the model, recall is
44/46=96%.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-37

13. Select the Score data and click Browse.

14. Scroll to the far right in the browsing window and look at the last two columns.

The last column gives the model’s predicted author category (Prediction for Author) and the
second-to-last column gives the model probability for this category (Probability of Classification).
All 11 paragraph extracts are correctly classified as written by TK. (Remember, the data for this
demonstration was enhanced to ensure such a clear-cut result!)

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-38 Lesson 2 Overview of Text Analytics

2.3 Information Retrieval

Objectives
• Describe information retrieval and explain how it is done in the interactive
Text Filter Viewer.
• Use the Medline medical abstracts data to illustrate an application
of information retrieval.

24
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Information Retrieval
“Information retrieval (IR) is finding material (usually documents)
of an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).”
– Manning, Raghavan, and Schütze (2008)

25
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

One of the more publicized success stories in information retrieval concerns the discovery by
Don Swanson (1988, 1991) that magnesium deficiency could be a source of migraine headaches.
Swanson queried medical reports for articles about migraines and nutrition.

For a given corpus of documents, information retrieval (IR) groups documents based on the
similarity of contents. An IR query can be a Boolean query, a query based on latent semantic
indexing, or a query based on some other method of quantifying document content. The Text Filter
node uses a weighted cosine similarity measure to compute the similarity between a document
and the query. Documents that are most similar to the query are returned.

Filtering and Querying

Filtering and querying using the Interactive Filter Viewer:
• Query operators control how filtering is performed.
• To clear a query, select Clear  Apply.
• You can close the Interactive Filter Viewer and save the current query.
Rerunning the node exports the results of the query rather than the full
data set.

26
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Filtering and Querying

Text Filter Query Operators (Review)
• +term returns all documents that have at least one occurrence of term.
• -term returns all documents that have zero occurrences of term.
• "text string" returns all documents that have at least one occurrence
of the quoted text string.
• string1*string2 returns all documents that have a term that begins with
string1, ends with string2, and has text in between.
• >#term returns all documents that have term or any of the synonyms that
were associated with term.

27
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The Interactive Filter Viewer does not recognize ># operators mixed with the + operator.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-40 Lesson 2 Overview of Text Analytics

Filtering and Querying

Query:
+diabetes

28
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Retrieving Medical Information

You often want to explore a document collection by searching on various terms of interest. This does
not require a target variable and is efficiently done with the Text Filter node. As always, you first run
the Text Parsing node. This demonstration illustrates how to do this using medical information from
Medline data.
The MEDLINES data source contains a sample of 4,000 abstracts from medical research papers
that stored in the MEDLINE data repository.
1. Create a new diagram and name it Medline Information Retrieval. Drag the MEDLINES data
source into the diagram. Look at the variables.

There is more than one variable with the role of Text. In cases like this, the Text variable with
the longest length is the one that is selected for analysis by the Text Parsing node. If two or more
Text variables have the same length, the one appearing first in alphabetical order is selected.
In this example, ABSTRACT (2730 bytes in length) is the longest of the Text variables and
is the one that is analyzed.
2. Attach a default Text Parsing node to the Input Data Source node. Notice that the default Text
Parsing node populates the properties panel with certain tables. For example, there is a default
Synonyms table named SASHELP.ENGSYNMS. (This actually contains only one row (one
synonym) and is present only as a template). There is also a default stop list named
SASHELP.ENGSTOP. (The use of such tables and others is discussed in a later chapter.)
3. Attach a Text Filter node to the Text Parsing node. The default frequency weighting is Log.
When there is no target variable, the default term weight is Entropy. It is a good idea to make
this explicit, as shown.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-42 Lesson 2 Overview of Text Analytics

4. Run the default Text Parsing and Text Filter nodes.

5. Select Filter Viewer from the properties panel. This accesses the Interactive Filter Viewer where
searching on terms in the documents is performed.
6. In the Terms window, right-click any term in the table. Select Find.

7. Enter glucose as the term to find.

The table jumps to the portion of the table that contains the term glucose.

8. Expand to see the stemmed versions of glucose. The term occurred 263 times in its singular
form and one time as a plural.

9. Right-click on the first row of glucose and select Add Term to Search Expression.

The term glucose is added to the Search window. The preceding symbols (>#) indicate that
all stemmed versions (or synonyms if any were defined) of the term are searched for.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-44 Lesson 2 Overview of Text Analytics

10. Click Apply. The following results appear in the Documents window:

The abstract is shown on the left. Stretch the column labeled TEXTFILTER_SNIPPET so that
you can see the term glucose in every row. This indicates the part of the abstract where glucose
appears. (This is the first occurrence if there are multiple occurrences.)
11. Place your mouse pointer above the TEXTFILTER_SNIPPET label. You see the following
message: “Left-click on column header to sort 93 rows of the table.” This indicates that 93
documents were selected because glucose, glucoses, or both) are found at least once in each
document.

12. The TEXTFILTER_RELEVANCE column returns a measure of how strongly each document
is associated with the search term. This is a relative measure. The most relevant document
is given the highest value of 1. The calculation of this metric considers factors such as the
number of times a term (or its stemmed versions and synonyms) appears in a document. To get
an idea of this, click twice on the column heading for TEXTFILTER_RELEVANCE until you see
the most relevant document in the first row (the one with TEXTFILTER_RELEVANCE=1.0).
Then select that row.

13. Select Edit  Toggle Show Full Text to see the complete document with the highest relevance
score.

The full text for this abstract can be read.

Reading through the full document, it is obvious that glucose is used many times. This explains
why this document has the highest relevance measure for a query based on this term. Select
Edit  Toggle Show Full Text to go back to the original way of viewing the documents.
14. Ninety-three documents were retrieved by the query. It is also useful to be able to retrieve
documents that contain one term, but do not contain another term. For example, in order to take
these 93 documents and eliminate any that contain the term diabetes, in the Search window,
enter –diabetes. (That is, precede the term with a minus sign as shown below.) Click Apply.

Be sure to separate the two terms with a space.

15. Verify that 72 documents of the original 93 remain after you eliminate any documents with
the term diabetes.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-46 Lesson 2 Overview of Text Analytics

Practice

1. Finding a Text String in a Movies Data Set

This simple exercise searches through documents using the Filter Viewer.

a. Create a new diagram named Movies.

Into this diagram, bring the data table stored as DMTX51.MOVIESGENRE (label=Movies
Data with Genre Codes). This contains 1,527 movie synopses randomly selected from
movie descriptions, reviews, and summaries found on the Internet. Some movies might have
multiple entries.

b. Connect the default Text Parsing and Text Filter nodes to this data set.
c. Identify the name of an actor or actress of interest to you. Using the Interactive Filter Viewer
of the Text Filter node, find all of the movies in the data set that have a synopsis that
mentions the name that you selected.

d. Has Brad Pitt ever portrayed a vampire in a movie?

2. Text Mining SAS Course Descriptions (Optional Exercise and Self-Guided Demonstration)
This exercise is intended to show how useful information retrieval and text mining can be for
activities such as call routing. In this example, there is no target variable and the emphasis
is on information retrieval and document categorization.
SAS Education supports more than 300 courses. Prospective students often have questions
about curriculum and specific course content. For example, a prospect might ask for information
about courses that discuss neural networks. Text mining provides a solution for automating
queries based on keywords.

Descriptions of the SAS courses can be found at http://support.sas.com/training.

The SAS course descriptions data set DMTX51.SASCOURSES contains descriptions of courses
supported in 2011. The data set has 735 rows (documents) and four columns. Some courses
have multiple versions that are associated with different releases of the software. The metadata
is shown below.

The variable CourseOutline contains the course outline text.

The following flow diagram implements the text mining analysis:

Frequency filtering is a methodology to create or add to a stop list. You can run the Text Parsing
node with the default stop list and then use frequency filtering to add terms to this list. Frequency
filtering specifies a cutoff frequency. Terms with a frequency below the cutoff are added to the
stop list. You can also specify a cutoff frequency at the high end so that terms with a frequency
above the cutoff are added to the stop list. For creating a start list, keep terms with frequencies
between the high and low cutoff values. The start list DMTX51.SASCOURSESTART contains
a start list that was obtained using domain knowledge and frequency filtering.
a. Create a diagram named SAS Course Outlines. If you need to, create a data source for
the DMTX51.SASCOURSES data set. (The metadata is presented above.) Drag this data
source into the diagram.
b. Attach a Text Parsing node to the Input Data Source node. Change the Synonyms
property so that there is no synonyms table and add DMTX51.SASCOURSESTART
as a start list.
c. Attach a Text Filter node to the Text Parsing node. Change frequency weighting from
Default to Log. Change term weighting from Default to Inverse Document Frequency.
Log is the default frequency weight, but Entropy is the default term weight. Inverse
Document Frequency is recommended for documents larger than a paragraph. Run
the Text Filter node.
d. Open the Filter Viewer (also called the Interactive Filter Viewer). Determine how many
documents contain the term neural network by doing a search on this term. How many
documents did your search return? Why is this number not 22?
e. Select the document corresponding to the course with the code BDMT61. Select Edit 
Toggle Show Full Text. You can read the course outline for BDMT61.
f. Select Clear  Apply to return all of the documents in the collection. Navigate back to the
neural network row in the Terms table. Right-click the neural network TERM cell and select
View Concept Links. The concept link plot appears. What are some of the terms strongly
associated with neural network?

g. Close the Filter Viewer. Attach a Text Topic node to the Text Filter node. For User Topics
in the properties panel, select the data set DMTX51.SASTOPICS. Keep all of the other
default settings. Run the Text Topic node.

h. Access the Results window. Which topic contains the most documents?
i. Close the Results window for the Text Topic node. Look at the exported data and determine
what variables were created by this node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-48 Lesson 2 Overview of Text Analytics

j. Open the Topic Viewer from the properties panel and explore the results.
Note: A custom topic is similar to a predefined query. The topic weight shown in the
documents window determines whether the topic is present. (That is, the query
is satisfied.) If the topic weight exceeds the document cutoff, then the document
is classified as having the topic.
k. Close the Topic Viewer. Attach a Text Cluster node to the Text Filter node. Most users
attach the Text Topic node and Text Cluster node directly to each other, but they work
independently. Neither requires any results from the other.

l. Use the default setting and run the Text Cluster node. Open the Results window. How many
clusters were created? Can you interpret some of the clusters from the displayed descriptive
terms? How many SDV variables were created?

2.4 Text Categorization

Objectives
• Describe text categorization and explain how it can be accomplished
using the Text Rule Builder node.
• Use the Aviation Safety Reporting System (ASRS) to illustrate text
categorization.

32
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

continued...
Text Categorization
• Text categorization can be supervised or unsupervised.
• Supervised text categorization requires a nominal target variable.
• A training data set contains documents and one or more categorical target
variables, also referred to as labels.
• Labels are usually assigned by human judges, but they can be automated
labels assigned using a computer scoring method.
• If a human judge is used, one or more judges read and assess a document
to assign a label.
• The goal of text mining is to derive score code to accurately reproduce the
assigned label, and be able to score new documents that have not been
labeled.
33
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

This section focuses on supervised text categorization.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-50 Lesson 2 Overview of Text Analytics

Text Categorization
• Many text categorization problems are solved as complete self-contained
text analytics projects.
• The text provides all of the information used for scoring.
• Scoring documents is the primary purpose of the project.
• Some predictive modeling projects benefit from text analytics.
• There are several sources of information, including text.
• Scoring is in a more general area, such as fraud or direct marketing.
The predicted text categories provide inputs for the general area of interest.

34
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

This section focuses on text categorization as a self-contained text analytics project. A later chapter
addresses general predictive modeling with text analytics inputs.

Text Categorization with SAS Enterprise Miner

• The Text Rule Builder node in SAS Text Miner provides a stand-alone
predictive modeling solution for supervised text categorization.
• In SAS Enterprise Miner, the MBR (Memory-Based Reasoning) node
produces a scoring algorithm based on the k-nearest neighbors algorithm,
which uses orthogonal inputs to find distances between observations.
The Text Cluster node produces SVD variables that satisfy the
orthogonality property.
• There are many other predictive modeling choices in Enterprise Miner,
including regression models, decision trees, and neural networks.
• The Model Comparison node can help select the best text categorization
solution.

35
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Categorizing Aviation Safety Reports

This demonstration illustrates how to categorize documents with pre-assigned labels using SAS Text
Miner. In most text categorization problems, labels are assigned by human judges, so the labels are
often subject to error due to the usual problems of fatigue, environment, and so on. The labels are
classified as target variables.
The Aviation Safety Reporting System (ASRS) data set can be accessed from the following link:

http://asrs.arc.nasa.gov/
From the website:

“ASRS captures confidential reports, analyzes the resulting aviation safety data, and disseminates
vital information to the aviation community.
“More than 850,000 reports have been submitted (through October 2009) and no reporter’s identity
has ever been breached by the ASRS. ASRS de-identifies reports before entering them into the
incident database. All personal and organizational names are removed. Dates, times, and related
information, which could be used to infer an identity, are either generalized or eliminated.”

As with other data sets used in this course, data sets derived from ASRS have been modified.
The original data for this demonstration was extracted from the ASRS, pre-processed, and provided
to competitors in a text mining competition sponsored by SIAM and the NASA Ames Research
Center. The competition results were presented at the Seventh SIAM International Conference
on Data Mining held in 2007 in Minneapolis, Minnesota. Participants were prohibited from using the
R language, SAS software, and most commercial software. A link that provides access to the original
data follows.
https://c3.nasa.gov/dashlink/resources/138/
A single report in the ASRS database can be a composite derivation of two or more reports filed for
the same incident. For example, one runway incursion incident can result in three reports: one from
the pilot, one from the copilot, and one from an air traffic controller. An incident involving two or more
aircraft can have reports filed from pilots of all aircraft involved, as well as from air traffic controllers.
In both examples, there will be only one ASRS report, but that report will be prepared by NASA
professionals based on all reports submitted.

Reports can be submitted by aviation professionals, such as pilots, flight attendants, and mechanics.
Reports can also be submitted by non-professionals, such as private pilots.

A report in the ASRS database has many fields, with one field representing a primary narrative
describing the incident. This primary narrative is stored in the Text variable. All of the other fields
have been omitted to simplify the text mining component of the analysis. In practice, an automated
labeling system would attempt to use all fields.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-52 Lesson 2 Overview of Text Analytics

If you examine an individual report, you will see rather unusual terms (for example,
instrumentlandingsystem as one word, rather than instrument landing system). The ASRS was
introduced in 1976, and the technology of that period was limited to the type of searches similar
to find and search features of the Text Filter Viewer. You can have pattern searches, looking for
specific patterns of characters regardless of the identification of terms, and you can have term
searches that require an exact match to a parsed term (token). To facilitate matching known
systems, factors, or events, NASA constructed a dictionary of keywords to facilitate rapid search
and retrieval. Reports were edited with the keyword dictionary in mind. Thus, instrument landing
system and ILS were replaced with instrumentlandingsystem. With modern tools like Latent
Semantic Indexing, the use of the keyword dictionary has less value, and the labor involved
in editing reports to match the keyword dictionary could be difficult to justify.

NASA manually assigns to each report 1 or more of 54 anomalies, 1 or more of 32 results, 1 or more
of 16 contributing factors, and 1 or more of 17 primary problems. For example, the report might
describe an event that was a “runway ground incursion” anomaly, with a “took evasive action” result,
that was a “human factor” contributing factor, and a “human factor” primary problem. These fields are
not available in the contest data. Instead, the contest data has 22 labels, with a value of 1 “if
document i has label j.” Otherwise, the label has a value of -1. Labels correspond to the topics
identified by NASA to aid in the analysis of the reports. The labels are not defined in the competition.
For the course data, the 22 labels are named Target01, Target02 up through Target22, and an
original coding of (-1,1) has been changed to (0,1), with a code of 1 indicating the presence of the
label in the document. A document can be associated with one or more labels.
The 22 target events vary considerably with respect to the difficulty of modeling them. An analysis
of the ASRS data was provided in the E.G. Allan et al. article “Anomaly Detection Using Nonnegative
Matrix Factorization” (2008). Descriptions of several of these target events, along with published
ROC index values for models that Allan et al obtained using an analytic method known
as Nonnegative Matrix Factorization, are shown in the table below.

Event Label in Course Description of Event Reported ROC Index From

Data Allan et al Model Results

Target02 Operation Noncompliance .6009

Target05 Incursion (collision hazard) .8977

Targer13 Weather Issue .6287

Targer21 Illness or Injury event .8201

Target22 Security concern / threat .9040

You will create a diagram flow that tests out the effect of trying to predict Target02 and Target05
using the Text Rule Builder node and various predictive modeling nodes from SAS Enterprise Miner.
Target02 will be modeled in a demonstration, and Target05 will be left as an exercise.

The ASRS training data contains columns that indicate which of the 22 manually assigned labels
relates to a given report. The goal is to develop a system to automatically detect incidents to avoid
the time, cost, and error associated with manually labeling the reports. In other words, you will be
building a model on a data set where experts have already read the reports and made evaluations.
(This is the sort of process that many people will use for sentiment analysis: create a data set of
labeled cases and then build an automatic classification/prediction system based on these known
cases.) In an actual operational system for this example, you would build 22 models to evaluate
whether each of these 22 types of events occurred. This collection of models would provide 22
predicted values, one for each target.

Note: Running the diagram setup for this demonstration will take several minutes.
Approximately 51% of the ASRS data exhibits a value of Target02=1. The balanced nature of the
data will help prevent problems often associated with a rare target. For binary targets with values in
(0,1), SAS Enterprise Miner chooses the value 1 as the primary event and 0 as the secondary event.

1. Create a new diagram and name it Aviation Safety Reporting System. (As we have stated
throughout, in virtually every case this diagram should have been already created for you
and the relevant nodes run.) The diagram will look like this when completed:

Enlargements of the process flow make it easier to identify the nodes that are used. Note that
the single Text Cluster node is in both flows.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-54 Lesson 2 Overview of Text Analytics

The last Text Rule Builder node at the bottom has the name Text Rule Builder — Aggressive
to indicate that additional computing time is allowed to try to find more complex rules. The full
name is not displayed because of space limitations.

2. Create a data source for the ASRS training data using DMTX51.ASRS_TRAINING.
Use the following metadata:

The data set contains the 22 variables with the names Target01 throughTarget22 that were
defined above. We will use Target02 for this demonstration, so all the others should be rejected.
From a table in E.G. Allan et al. (2008, p. 215), the incident for Target02 has to do with
noncompliance with policy procedures. The variable Size is just the length of the report in bytes
and will not be used. Only the report itself (Text) is needed, but ID can be left as an ID variable.)
3. Drag the ASRS data source onto the diagram.
4. To investigate the robustness of the automated assignment, add a Data Partition node
to the partition data set DMTX51.ASRS_TRAINING. Use a 50/30/20 partition.

5. Attach a Text Parsing node to the Data Partition node. Leave all the defaults as is and run
the node.
6. Attach a Text Filter node to the Text Parsing node. Change the default weightings to Log and
Mutual Information. (In this case, these are the default properties, so they are set to specific
values to highlight the properties used for default settings.) Leave all else in default mode
and run the node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-56 Lesson 2 Overview of Text Analytics

7. Attach a Text Cluster node to Text Parsing. Change the Transform settings as follows:

A preliminary exploration was performed to decide what properties to use. There might
be a diagram on the virtual lab server named ASRS – Preliminary that shows some of the
explorations that were performed. Exactly 22 clusters are requested to see whether the derived
clusters might be highly correlated with the 22 target values. This would be unlikely given that
clusters are mutually exclusive and the 22 target values are not mutually exclusive—that is,
some documents contain more than one target variable having a value of 1.
8. SVD Resolution set to High and Max SVD Dimensions set to 50 generates a 50-dimensional
SVD solution. This means that 50 SVD variables will be added to the data. Run the node and
go to Exported Data in the properties panel. Select the TRAIN data set and then click Explore.
Verify that you have a 50-dimensional solution by looking at the number of TextCluster_SVD
variables.

Fifty SVD dimensions

were created.

The 50 SVD variables will supply 50 inputs for a predictive model.

9. Attach a Decision Tree node to the Text Cluster node. Rename the node using the name
DT – Entropy – 50. This reminds you that you are using Entropy term weights and 50 SVD
dimensions.
10. Attach a Memory Based Reasoning (MBR) node to the Text Cluster node. Change
the Number of Neighbors property to 30. MBR implements a nearest neighbor algorithm,
so predictions are average target values for the 30 nearest neighbors to the observation being
predicted. Nearest neighbors are selected from the training data only.

The MBR node requires numeric inputs, and it assumes that the numeric inputs are orthogonal—
that is, the Pearson product moment correlation between all pairs of inputs is zero. The SVD
inputs are derived to be orthogonal, so this condition is satisfied. For a training data set with
more than 10,000 observations that have 50 input variables, larger values than the default 16
are recommended. Unfortunately, larger values can dramatically increase the execution time for
the node. It is common to try to find an optimal neighborhood size, but this can be very time
consuming.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-58 Lesson 2 Overview of Text Analytics

11. Attach a Regression node to the Text Cluster node. Change Selection Model to Stepwise,
and change Selection Criterion to Validation Error. Change Use Selection Defaults to No.

12. Make the following changes to the selection options:

The use of validation error as a criterion causes the Regression node to apply the stepwise
selection algorithm until a stopping rule is satisfied and then select the particular step where
the minimum validation error is achieved.

13. Attach a Neural Network node to the Regression node. This causes the neural network model
to use only the subset of the 50 SVD values that were selected by the stepwise selection
algorithm in the Regression node. Change Model Selection Criterion to Average Error.
14. Attach a Text Rule Builder node to the Text Filter node. The Text Rule Builder does not use
the SVD inputs from the Text Cluster node. Use the default settings.
15. Attach a second Text Rule Builder node to the Text Filter node. Use the following properties:

Rename the node Text Rule Builder – Aggressive. The term aggressive reflects that you are
willing to sacrifice processing time to derive more complex and less pure rules. The Very High
generalization error attempts to offset the potential overfitting problem inherent in using a very
low setting for Purity of Rules and a very high setting for Exhaustiveness.
16. Now connect all six prediction nodes to the Model Comparison node. Set the properties panel
for Model Comparison to be ROC for the Selection Statistic property and Test for Selection
Table. As a consequence of these changes, the ROC index for the validation data will be shown
at the very beginning of the Model Comparison node Results window. Run the Model
Comparison node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-60 Lesson 2 Overview of Text Analytics

17. Open the Model Comparison Results window.

The Fit Statistics table shows that the neural network and regression scores achieve the highest
ROC index for the test data set. The aggressive Text Rule Builder results are slightly better than
those produced using the default settings.

The worst ROC index in the table above is better than the value reported by Allan (2008), but
to properly compare the techniques presented here and those described in the Allan paper,
the same evaluation data set must be used. The ROC curves follow.

The ROC curve for the test data follows.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-62 Lesson 2 Overview of Text Analytics

In color, you can see that there is little difference between the MBR model, the neural network
model, and the regression model. Although the overall ROC index is similar for all models, the
decision tree and the default Text Rule Builder model appear to be uniformly inferior, whereas
the aggressive Text Rule Builder model seems to consistently beat the bottom two models.
Although the ROC index might be preferred, the misclassification rate is easy to interpret. Here
is the Fit Statistics table with misclassification rate displayed:

Note: The ROC index also has a rather easy interpretation, but not quite as obvious
as the misclassification rate. The area under the ROC curve represents the probability
that if you randomly draw a Target02=1 observation and also randomly draw a Targt02=0
observation, and score both observations with the model used to construct the ROC
curve, the ROC index value is the probability that the score for the Target02=1 case will
be higher than the score for the Target02=0 case. Thus, the ROC index reflects the
quality of the model in discriminating between the two cases. If the ROC index is 50%,
then you could just as easily flip a coin to decide how to classify an observation.
There might be compelling reasons to use a set of text-based rules for scoring. The SVD
variables are difficult to interpret, whereas the Text Rule Builder rules are easy to explain. With
a misclassification rate of about 32%, the aggressive Text Rule Builder model does not seem
to be very good, but we will explore the model to see why it provides a powerful alternative
to conventional predictive modeling nodes.

18. Open the Results window for the Text Rule Builder – Aggressive node.

The Rule Success window provides a color-coded graphical display to help visualize how well
the rules classify documents.

When you see the colors, observing blue (target=1) and red (target=0) bar segments, you can
see whether a particular rule tends to favor a particular outcome. The most accurate rules tend
to be dominated by one color. Unfortunately, when many rules are generated, which happens
when you use aggressive property settings, the chart is truncated to fit into the Results window
and to have sufficient resolution for reading. The above chart displays only rules 114 through
187. A total of 187 rules were obtained. For comparison, the default setting produced 162 rules.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-64 Lesson 2 Overview of Text Analytics

19. Examine the Rules Obtained window for the aggressive settings.

The first rule is temporaryflightrestriction. (Recall, NASA edits safety reports to match certain
predefined keywords, which can be concatenated collections of words.) A temporary flight
restriction is exactly what you might think: it is a mandatory restriction on flight that is temporary
in nature, such as prohibiting flight over a sports arena when a game is in progress. Whereas
many temporary restrictions are expected, such as restricted flight over sporting events, others
might be so random as to be unexpected, such as visits by important government officials. When
a government leader (such as the President of the United States) or an important public figure
(such as the Pope of the Catholic Church) makes a public appearance, flight over the event is
usually restricted. Even flights over the transition route to the event might be restricted. Flights
might also be temporarily restricted because required navigational or communication equipment
is temporarily out of service. Even if a temporary flight restriction is unexpected, it is published in
a timely fashion, and all pilots are expected to be aware of all restrictions relevant for a planned
flight. The following two images show temporary flight restrictions (TFRs) being used in Northern
California related to three events.

The shaded circle that includes the Metropolitan

Oakland International Airport and the diamond labeled
STADIUM represents a temporary flight restriction due
to the NFL football game between the Oakland Raiders
and the San Diego Chargers played on October 9,
2016.

The temporary flight restriction is well defined by a textual description that describes specific
geographic boundaries and restricted altitudes. The above TFR for OAK (OAK is the three-letter
airport code for Metropolitan Oakland International Airport) and surrounding areas represented
by the shaded circle on the map are restricted to aircraft from the surface to 3,000 feet AGL
(above ground level). If a pilot flies into the airspace defined by the TFR without having been
given clearance by air traffic control (ATC), the pilot is violating a Federal Aviation Regulation
and is subject to penalties, including loss of pilot’s license. Such violations can be caused by
poor communication with ATC, misreading a chart, or inadequate preflight planning. Whatever
the cause, filing an ASRS report helps NASA and the FAA improve safety, and also immunizes
the pilot from civil penalties imposed by the FAA, assuming there was no injury or physical
damage related to the flight in the TFR airspace.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-66 Lesson 2 Overview of Text Analytics

Lighter shaded area

might be missed by a
pilot.

The two shaded circles over San Francisco are for the Fleet Week Air Show and a Major League
Baseball (MLB) playoff game between the Chicago Cubs and the San Francisco Giants. Date
and time values are given for both events in the TFR information box, as well as altitude
restrictions. These two TFR areas are more confusing because they overlap, and a pilot might
fail to notice the lighter shaded area indicated in the above graphic.

For the ASRS training data, 94 documents exhibited the term temporaryflightrestriction, and 93
of these documents were flagged as an operation noncompliance event (Target02=1). This
produces a precision value for this rule of 93/94=98.94%. On the other hand, the training data
has 6,437 training observations with Target02=1, so recall for this rule is 92/6437=1.44%. When
the term is present in a document, it is almost certainly a noncompliance event, but there are
many noncompliance events that do not use the term. The F1 statistic is the harmonic mean
of precision and recall. SAS code to calculate this quantity would look like the following.
F1=1/(0.5*(1/Precision)+0.5*(1/Recall));

The F1 value measures the trade-off between precision and recall, and gets larger as precision
and recall get closer to each other in value. Because recall is so small, the F1 value for the first
rule is small: F1=2.85%.

The statistics in the table are cumulative. Rule number 2 is

install & ~airport & ~report

which is interpreted to apply when a document contains the word install and also does not
contain either the word airport or report. The tilde (~) symbol is interpreted as a logical not
operation. After removing the 94 training documents that contain the term
temporaryflightrestriction, there are 230 training documents that satisfy the rule, and 208 have
Target02=1. The overall precision for the first two rules is
Precision=(93+208)/(94+230)=92.90%.

To score a document, the rules are applied in order. If a rule is satisfied, then the target value
associated with the rule is assigned, and the variable w_Target02 (Why Into: Target02) is
assigned the rule number. If no rule is satisfied, the secondary event target value is assigned.
As stated above, Enterprise Miner defines the secondary event value to be 0 for (0,1) binary
targets. If no rule applies to a document, then w_Target02 is assigned a missing value,
and I_Target02 is assigned a value of 0.
20. Close the Results window and select Exported Data in the properties panel for the Text Rule
Builder – Aggressive node. Click the training data set, and select Explore. Select the Plot
Wizard, and create a bar chart. Use the following properties:

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-68 Lesson 2 Overview of Text Analytics

21. Click Finish. Below is the plot that is produced.

The dominant rule is rule 107.

The frequency 182 is for the random sample of documents selected for exploration. Recall that
for dynamic exploration, SAS Enterprise Miner pulls a subset of the data from the server to the
client.

Rule 107 is shown in the table that follows with columns rearranged.

Of the remaining documents, a total of 388 documents do not satisfy the first 106 rules, but
do satisfy rule number 107. The frequency of 338 documents have correctly identified the target
value of 0. These 388 documents are then removed from the data and the remaining rules are
then created.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-70 Lesson 2 Overview of Text Analytics

2.04 Multiple Choice Question

For the ASRS text categorization demonstration, what decision target value
is assigned to a document if it does not satisfy any of the 194 rules derived
using the “aggressive” settings of the Text Rule Builder node to predict
Target02.

a. 0
b. 1
c. 0.5251
d. missing

37
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Practice

3. Categorizing Movies by Genre

Using the Movies data, categorize documents as related to movies that are in the comedy
genre. In other words, use Comedy as the target variable in a text categorization exercise.
a. Use the Movies diagram from a previous exercise.
b. Use a Metadata node to change the role of Comedy to Target.
c. Set up a process flow for predictive modeling, including a Data Partition node, appropriate
predictive modeling nodes, and a Model Comparison node. The solution will use a Text Rule
Builder node and a Decision Tree node.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-72 Lesson 2 Overview of Text Analytics

2.5 Chapter Summary

SAS provides many tools and products for accessing and processing data. The SAS language
features a rich set of character functions for processing text. In addition, Perl regular expressions
are supported. The most typical way to bring in text documents for SAS Text Miner is to store them
as individual files in a directory and use the Text Import node to convert them to a SAS data set (one
row per document). The original documents can be in many different formats (PDF, RTF, Word, PPT,
and so on). If the document files are saved with a name that corresponds to a key ID variable,
simple data processing can then be used to match the documents to other variables contained
in a separate data set.
When there is a target variable, it is possible to assess how much variables derived from text
incrementally add predictive power to other input variables.
Analytic techniques for text analysis can be used to identify authorship. This has applications
in historical research and can also contribute to forensic linguistics.
Information retrieval (IR) methods are designed to access relevant information quickly.
The Interactive Filter Viewer in the Text Filter node supports queries to extract relevant documents.
The Text Rule Builder node is a powerful tool for text categorization.

For Additional Information

Allan, E., M. Horvath, C. Kopek, B. Lamb, T. Whaples and M. Berry. Anomaly detection using
nonnegative matrix factorization. In M. Berry and M. Castellanos, Editors, Survey of Text Mining II:
Clustering, Classification, and Retrieval, pages 203–218. Springer-Verlag London Limited, 2008.
Swanson, Don R. 1988. “Migraine and Magnesium - 11 Neglected Connections.” Perspectives in
Biology and Medicine. 31 (4). pp. 526–557.
Swanson, Don R. 1991. “Complementary structures in disjoint science literatures.” SIGIR’91.
Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research
and Development in Information Retrieval. pp. 280–289.

2.6 Solutions
Solutions to Practices
1. Finding a Text String in a Movies Data Set
Set up the flow diagram in the standard way.

The data set was created with the following variable definitions:

Synopsis is the Text variable to analyze. Run the nodes. (This takes a few minutes to do.)

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-74 Lesson 2 Overview of Text Analytics

Go to the Interactive Filter Viewer. Suppose you are interested in finding movies that star
Sandra Bullock. There are at least two ways to search on her name. One way is simply to enter
“Sandra Bullock” (with quotation marks, but uppercase or lowercase makes no difference)
in the Search window. This retrieves 15 documents.

Another search expression is to use the following:

This also retrieves the same 15 documents.

As to whether Brad Pitt ever played in a vampire movie, a search on Brad Pitt Vampire
(or +brad +pitt +vampire) returns two documents. If you scroll to the right and look at the title
of each of these movies, you see that he indeed played a vampire in the movie Interview with
the Vampire. You can read the synopsis for this movie to verify it.

You might be interested in the relationship between a keyword search as implemented

by the Filter Viewer, and a linear algebra search as implemented by the Text Topic node. The
linear algebra approach is called Latent Semantic Indexing and is discussed in the next chapter.

Attach a Text Topic node to the Text Filter node. Specify the following user topics. Note that
the topic BPV stands for “Brad Pitt Vampire.”

Run the node and then select the Topic Viewer.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-76 Lesson 2 Overview of Text Analytics

There are 46 movies that satisfy this linear algebra query. Only one document is relevant,
and fortunately, it is the top-scoring document. The movie Seven mentions Interview with
the Vampire, but it is not a vampire movie. The following movies get a high score for brad
and pitt but not for vampire: Legends of the Fall, Fight Club, Troy, and Seven Years in Tibet.
If you want to emphasize the term vampire more than the proper nouns brad and pitt, you can
modify the topic weight for the term. Suppose you double the weight for vampire.

You eliminate a number of the non-vampire movies listed above, and in doing so you also
eliminate Brad Pitt movies. Legends of the Fall still finishes in the top 10. Because your goal
is information retrieval, and you have found the information that you were looking for, the false
positives do not pose a problem. The linear algebra approach to information retrieval can be
superior to a simple keyword search because you can use different weightings for different
words.

2. Text Mining SAS Course Descriptions (Optional Exercise and Self-Guided Demonstration)
Below are the major steps of this exercise and demonstration. You are also introduced to some
features that were not previously mentioned.
After you set up the diagram and run the Text Filter node as specified earlier, the Interactive
Filter Viewer should resemble the following:

If you sort the TERM column in the Terms table by clicking the heading cell that contains the
word TERM, then you can use a quick-find feature. Select any term in the TERM column. Then
enter the first letter of the term that you want to find. The window moves to the first term starting
with that letter. You can also select Edit  Find to go directly to a desired term.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-78 Lesson 2 Overview of Text Analytics

In the Filter Viewer, select Edit  Find, and enter the two-word phrase neural network. After
you click OK, you are taken to the first cell in the TERM column that contains the phrase neural
network. Notice that there are 13 documents containing this phrase. Right-click in the cell
containing the phrase neural network, and select Add Term to Search Expression. The Search
window contains "neural network". The quotation marks are required if the search expression
has more than one word. If you look at the filter rules, you see that this expression searches for
documents containing neural network and any of the synonyms of neural network.

Click Apply. The following results appear in the Documents window:

Stretch the TEXTFILTER_SNIPPET column so that you can see the term neural networks for
all the listed documents. The TEXTFILTER_RELEVANCE column contains the result of the inner
product calculation described in the previous discussion of a Boolean query. The cutoff is not
displayed, but all documents that produced a result above the cutoff are returned. Typically,
the courses with lower relevance scores include neural network material, but the courses include
other material as well. The neural network portion is a small section of the course. You can
determine that 17 documents were returned by placing your mouse pointer over any of the
column headings.
On the other hand, when you look at the Terms window, you see that neural network appears
sometimes as a noun group and sometimes as a simple noun. These are treated as two
separate types of terms. Consequently, the number of documents in which these two types
of terms appear (13+9=22) does not have to agree with 17 from above. A single document could
contain neural network at least once as a noun group and then neural network appears
elsewhere in the document as a noun.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-80 Lesson 2 Overview of Text Analytics

You can also look at a full document. Select the document corresponding to the course with
the code BDMT61. Select Edit  Toggle Show Full Text. You can read the course outline for
BDMT61.

The relevance score for this document (.25) is in the low end of the relevance values for returned
documents. It is not an extremely high value because most of the course outline discusses topics
that are not related to neural networks. You can select Edit  Toggle Show Full Text again to
return to one row per document.
Of the 17 returned documents, most appear to be legitimately related to neural networks,
so the precision of this query (percent of the documents returned that are relevant to the search
query) is approximately 100%.

Select Clear  Apply to retrieve all of the documents in the collection. Navigate again to the
neural network row in the Terms table. Right-click the neural network cell. Select View Concept
Links. The concept link plot appears.

Right-click
neural network
and select View
Concept Links.

The Concept Linking window appears. You can see the terms most strongly associated with
neural network. You can also right-click on any of these terms and select Expand Links to look
at indirectly associated terms.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-82 Lesson 2 Overview of Text Analytics

After you attached a Text Topic node to the Text Filter node, you were asked to go the
properties panel. Under User Topics, open the customized topic list, DMTX51.SASTOPICS.
After you run the Text Topics node with the other settings retained as defaults, open the Results
window.

The three bars to the far left in both bar chart windows relate to the three custom topics: data,
programming, and statistics, which were specified in the DMTX51.SASTOPICS data set. The
remaining bars relate to (automatically) derived topics. The Number of Terms by Topics bar chart
reveals that only a few terms were used to define the custom topics. Perhaps more terms should
be used. The Number of Documents by Topics bar chart reveals that the custom topics are more
prevalent than the derived topics. The smallest custom topic, programming, appears in 127
documents. The most popular automatically derived topic appears in 96 documents. (See the
arrow pointing to the bar.) You can get the topic frequencies by positioning the cursor over the
bar related to a topic. The plots are dynamic.
Close the Results window. You were asked to determine what variables were created by the Text
Topic node. Opening the exported data for this node shows that the variables TextTopic_1 to
TextTopic_28 and TextTopic_raw1 to TextTopic_raw28 were created. Twenty-eight topics were
generated (3 user topics + 25 derived topics).

Open the Interactive Topic Viewer through the properties panel. The following window appears:

A custom topic is similar to a predefined query. The first three topics are always the user-defined
ones (in this case: data, programming, and statistics). The topic weight shown in the Documents
window determines whether the topic is present. (That is, the query is satisfied.) If the topic
weight exceeds the document cutoff, then the document is classified as having the topic. Close
the Topic Viewer.
Running the Text Cluster node with default settings leads to a 25-cluster solution. Maximize
the cluster table and examine the descriptive terms for each cluster.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-84 Lesson 2 Overview of Text Analytics

The descriptive terms help identify the courses that appear in each cluster. It would also be
useful to read some of the documents in each cluster to better understand what types of
documents belong to a cluster. Because courses often contain material from several subjects,
clustering into mutually exclusive categories might be less useful than the topics created from
the Text Topic node.
Looking at the exported data for the Text Cluster node shows that 36 SVD variables were
generated.

3. Categorizing Movies by Genre

Using the Movies data, categorize documents as related to movies that are in the comedy
genre. In other words, use Comedy as the target variable in a text categorization exercise.

a. Use the Movies diagram from a previous exercise.

b. Use a Metadata node to change the role of Comedy to Target. Even though you will only
be using the synopsis to create input variables, you should set the other genre binary flags
to Rejected.

c. Set up a process flow for predictive modeling, including a Data Partition node, appropriate
predictive modeling nodes, and a Model comparison node. The solution will use a Text Rule
Builder node and a Decision Tree node.
1) Attach a Data Partition node to the Metadata node. Set the partition to 75/25. (You might
have chosen a different partition, which is acceptable, but results will not match if you do
not use a 75/25 split.)

2) Attach a Text Parsing node to the Data Partition node. Use default settings.
3) Attach a Text Filter node to the Text Parsing node. Use default settings.
4) Attach a Text Cluster node to the Text Filter node. Use default settings, except to speed
execution, choose exactly 10 clusters.
5) Attach a Text Topic node to the Text Cluster node. Use default settings.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-86 Lesson 2 Overview of Text Analytics

6) Attach a Decision Tree node to the Text Topic node. Change Leaf Size to 25,
and change Assessment Measure to Average Square Error.
7) Attach a Text Rule Builder node to the Text Filter node. Attach the Decision Tree node
and the Text Rule Builder node to a Model Comparison node.
8) Run the Model Comparison node.

The decision tree and the rules produced by the Text Rule Builder node provide very
similar results based on the ROC curve. The decision tree has a slight edge with respect
to misclassification rate.
For further study: Can you use the Text Topic node to obtain a custom comedy topic that
competes with the results above?

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-88 Lesson 2 Overview of Text Analytics

Solutions to Activities and Questions

2.03 Multiple Answer Question – Correct Answers

Which of the following tasks can be performed with the Text Import node?

a. perform optical character recognition (OCR) of embedded bitmaps

15
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

2.04 Multiple Choice Question – Correct Answer

a. 0
b. 1
c. 0.5251
d. missing

38
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 3 Algorithmic and
Methodological Considerations in
Text Mining

3.1 Methods for Parsing and Quantifying Text .......................................................................... 3-3

3.2 Dimension Reduction with SVD .......................................................................................... 3-19

Demonstration: Experimenting with the SVD Dimensions ................................................... 3-26

Practice ................................................................................................................................. 3-31

3.3 Chapter Summary ................................................................................................................. 3-39

Solutions to Practices ........................................................................................................... 3-41

Solutions to Activities and Questions ................................................................................... 3-43

3-2 Lesson 3 Algorithmic and Methodological Considerations in Text Mining

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-3

3.1 Methods for Parsing and Quantifying

Text

Objectives
• Explain tokenization and describe the transition from tokens to words
in a language.
• Define frequency (local) weights and term (global) weights and describe
how they are used.
• Provide guidelines for choosing weights.
• Explain the basic vector (metric) space model for representing documents
and terms.
• Explain how singular value decomposition projects documents and terms
into a smaller dimensional metric space.

3
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Text Mining Definitions

Corpus
A collection of documents is called a corpus.

Tokens, Separators, and Terms

A document consists of a set of tokens. A token is a contiguous string
of characters that does not contain a separator. A separator is a special
character such as a blank or mark of punctuation. A term is a token
or a sequence of tokens (such as White House) with a specific meaning
in a given language.

4
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-4 Lesson 3 Algorithmic and Methodological Considerations in Text Mining

Types of Text Extraction Ordered

By Increasing Complexity
1. Token extraction
2. Term extraction (token + language  term)
3. Concept extraction (nouns, noun phrases)
4. Entity extraction (associates nouns with entities – for example, Person:
Mr. White, Location: White House)
5. Atomic fact extraction (associates nouns with verbs, that is, subject 
action – for example, terrorist  bombed)
6. Complex fact extraction (natural language understanding)

(Wakefield 2004)
5
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The approach for parsing and quantifying your text data varies based on the task that you want
to perform. In this chapter, we discuss what is available in SAS Text Miner. In other text analytic
products, other techniques are used. For example, SAS Sentiment Analysis provides capabilities
for atomic fact extraction. However, many atomic fact extraction exercises must be customized.
For example, you could train a predictive model to mimic how a domain expert assigns categories.
Supervised classification often requires problem-specific tasks related to data preparation and model
building.

Characteristics of a Document
A document consists of the following elements:
• letters
• words
• sentences
• paragraphs
• punctuation
• possible structural items (chapters, sections)

The elements of a document can be counted and compared across

documents.

6
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-5

Various weighting strategies are introduced later to modify simple counts for the terms.

Zipf’s Law
Let t1,t2,…,tn be the terms in a document collection arranged in order from
most frequent to least frequent.
Let f1,f2,…,fn be the corresponding frequencies of the terms. The frequency
fk for term tk is proportional to 1/k.
Zipf’s law and its variants help quantify the importance of terms
in a document collection. (Konchady 2006)
“The product of the frequency of words (f) and their rank (r)
is approximately constant.”

7
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

In practice, Zipf’s Law is derived as a Power Law, with free parameters that can be estimated based
on the document collection. The general formula is shown here:

f k = C /( + k )
n
where C is a constant such that, for given  and , f
k =1
k = T , the total number of words in the

document collection. The parameters  and  are estimated for a given document collection.
Konchady (2006) relates Zipf’s Law to quantifying the importance of a term: “…the number of
meanings of a word is inversely proportional to its rank.” (Konchady 2006, page 87)
Application of Zipf’s Law permits identification of important terms for purposes such as describing
concepts or topics. You will not encounter Zipf’s Law (or similar theoretical laws) directly, but you can
see the results of Zipf’s Law in text mining applications (for example, in the list of terms used to
define a topic). Along with methods such as Hidden Markov Models (HMM), the implementation is
often hidden from the user. Only the results of the methodology are visible.

Note: General methodologies like HMMs have specific applications, like part-of-speech tagging.
An HMM views a document as a collection of states defined by words. There might be many
paths that could take a document from one word to another, but a specific document has
only a single realized path. If you can calculate probabilities associated with the states
(terms) and paths (intermediate terms) for a document collection, you can start answering
questions like, “Given a term and the words that come before and after the term, what
is the probability that the term is a noun?” General HMM software gives the user the ability
to define, for example, the maximum length of a path with respect to the number of states
contained in the path. For part-of-speech tagging and multi-word term identification, the
software contains “hardcoded” values, such as how many words to examine before the term
of interest, and how many words after the term can be used to calculate probabilities.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-6 Lesson 3 Algorithmic and Methodological Considerations in Text Mining

Relevance of Zipf’s Law to Text Mining

• Often, a few, very frequent terms are not good discriminators.
• stop words (for example, the, and, an, or, of)
• often words that are described in linguistics as closed-class words, which
is a grammatical class that does not get new members
• Typically, there is the following in a document collection:
• a high number of infrequent terms
• an average number of average-frequency terms
• a low number of high-frequency terms
Note: Terms that are neither high nor low frequency are the most
informative.

8
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Frequency filtering is suggested by Zipf’s Law.

Quantification Steps
The basic strategy for the quantification of free-form text with the Text
Miner nodes involves the following:
• obtaining the corpus of terms that will be used after applying stemming,
synonym creation, filtering, and so on
• representing each document and each term in a vector space via the
document by term (or term by document) matrix
• projecting the documents and terms into a lower dimensional vector
space
• conducting clustering and topic generation for the documents in this lower
dimensional vector space

9
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-7

Raw Document by Term Matrix

• The raw document by term matrix shows the frequencies that each term
was used in each document. Here you can think of documents as
observations and terms as variables.
(Var 1) (Var 2) (Var 3) (Var 4) (Var 5) (Var 6) … (Var 4999) (Var 5000)
apple cat cats dog dogs farm … White House Senate
(Obs 1) Doc 1 1 1 2 2 0 1 … 0 0
(Obs 2) Doc 2 0 1 0 1 1 0 … 3 2
(Obs 3) Doc 3 0 1 0 0 1 0 … 4 4
(Obs …) … … … … … … … … … …
(Obs N) Doc N 2 2 2 3 0 1 … 0 0

Document by Term Matrix

• For the table in this slide, each document is represented by a row vector
of 5000 frequencies.
• Doc 1 has the row vector (1, 1, 2, 2, 0, 1, …, 0, 0).
• Notice that Doc 1 and Doc N have somewhat similar vector values,
as do Doc 2 and Doc 3. 10
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Obtaining document by term frequencies shows how documents can be represented in a vector
space whose elements are the frequencies of each term. However, this is likely to be a high-
dimensional space with many 0 values. The dimensionality can be reduced by language processing
steps such as stemming, synonym creation, and filtering out low-frequency terms.

Applying Stemming, Filtering, and So On

• In the previous document by term table, you saw that each document
was represented by 5000 terms. That is quite a lot of variables.
• By stemming terms, such as putting cat and cats together, you reduce
the number of columns of the document by term matrix.
• Applying synonyms and filtering out very common and very rare terms
also reduce the number of columns. In this example, you go from 5000
to 1000 terms.
(Var 1) (Var 2) (Var 3) (Var 4) … (Var 999) (Var 1000)
apple cat (stemmed) dog(stemmed) farm … White House Senate
(Obs 1) Doc 1 1 3 2 1 … 0 0
(Obs 2) Doc 2 0 1 2 0 … 3 2
(Obs 3) Doc 3 0 1 1 0 … 4 4
(Obs …) … … … … … … … …
(Obs N) Doc N 2 4 3 1 … 0 0

Reduced Document by Term Matrix after Stemming, Filtering, Synonyms, and so on

11
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The basic data table of document by term frequencies can of course be transposed into the term
by document frequency matrix.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-8 Lesson 3 Algorithmic and Methodological Considerations in Text Mining

Transposing: Term by Document Matrix

• Transposing the table into a term by document matrix of course provides
exactly the same information.
• This term by document matrix is often the one presented for analytic
purposes.
• You can also think of the terms as the objects and the documents as the
variables. The term apple is represented by the vector (1,0,0,…,2).
• In this table, the terms White House and Senate have similar row vectors.
(Var 1) (Var 2) (Var 3) (Var …) (Var N)
Doc 1 Doc 2 Doc 3 … Doc N
(Obs 1) apple 1 0 0 … 2
(Obs 2) cat (stemmed) 3 1 1 … 4
(Obs 3) dog(stemmed) 2 2 1 … 3
(Obs 4) farm 1 0 0 … 1
… … … … … … …
(Obs 999) White House 0 3 4 … 0
(Obs 1000) Senate 0 2 4 … 0

Transposing: Term12
by Document Matrix after Stemming,
Synonyms, and so on
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

However, there are various problems with this type of data. Even after stemming and filtering, there
are often a large number of terms remaining, so there is still the difficulty of a high-dimensional
vector space. Also, the data matrix is very sparse. There are usually 90% of the document-term
frequencies that are 0. Furthermore, by Zipf’s law, the frequency counts of terms are very long tailed.
That is, there is a small number of very common terms that are used over and over again in most
of the documents.

Sparse, High-Dimensional Vector Spaces

• After the frequency counts are obtained, you see that both terms
and documents can be represented in vector spaces.
• However, in both cases, even after stemming and other filtering steps have
been applied by the Text Parsing and Text Filter nodes, you usually still
face a very high-dimensional data set.
• In addition, the matrices of frequency counts are very sparse because
many words appear in only one or two documents. Typically, 90% or more
of the cells in the matrices are 0.
• Also, the frequency counts are highly skewed, as shown by Zipf’s law.
A small number of words occur many times.

13
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-9

The dimensionality and sparseness problems will be addressed by projecting the document and term
vector spaces into a lower-dimensional space by means of a key theorem from linear algebra
referred to as the singular value decomposition (SVD). Before applying SVD, however, it has been
found that weighting the raw document-term cell counts usually produces better text mining results.
Weighting also helps alleviate the problem of the skewness of the higher frequency terms by making
them less influential.

Handling These Problems

• The problems of high dimensionality and sparseness are addressed
by the application of a key theorem in linear algebra called singular
value decomposition (SVD), as discussed shortly.
• The problem of skewed frequency counts is addressed by applying weights
to the frequencies.
• This is a two-tiered weighting scheme controlled in the Text Filter node:
• Local weights Lij , also called frequency weights, are calculated for term i in
document j .
• Term weights Gi, also called global weights, are calculated for term i .
• The final weight for each cell is the productGi Lij .

14
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Frequency weights, which are often called local weights in the text mining and information retrieval
literature, are the first step in transforming the raw cell counts. (Actually, frequency weights are
a function of the raw cell counts, and the following three functions can be chosen by the user.)

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-10 Lesson 3 Algorithmic and Methodological Considerations in Text Mining

Step 1: Frequency Weights (Local Weights)

• There are three options for the frequency weights in the Text Filter node:

None Lij = aij

Binary Lij =  10 ifotherwise
term i is in document j

Log Lij = log2(aij + 1)

aij is the number of times that term i appears in document j.

• The default is Log.

15
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Weighted Term-Document Frequency Matrix

Documents

Term ID D1 D2 … Dn

T1 1 L1,1 L1,2 … L1,n

T2 2 L2,1 L2,2 … L2,n

…
…

Lij = frequency weight for term i and document j

16
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Term weights, often called global weights in the literature, modify frequency weights to adjust for
document size and term distribution.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-11

Step 2: Term Weights (Global Weights)

• There are four options for choosing the term weights for term Gi .
• Entropy (default when no target present)
• Inverse Document Frequency (IDF)
• Mutual Information (only used with a target and the default when a target
is present)
• None

17
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

A brief discussion of the formulas behind the weights begins below. Although you might gain some
insight by looking at the mathematics, experimentation rather than intuition is often the best strategy
for choosing weights. Experience with similar text analytic problems can help you develop your own
guidelines.

Term Weight Formulas

ai,j = frequency that term i appears in document j
gi = frequency that term i appears in document collection
n = number of documents in the collection
di = number of documents in which term i appears
pi,j = ai,j / gi

18
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-12 Lesson 3 Algorithmic and Methodological Considerations in Text Mining

Term Weight Formulas

Entropy

Low High
Information Information

19
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The entropy term weight is slightly misnamed here. The usual definition of entropy from Shannon’s
di
(1948) information theory is the expression − pij log 2 pij , so a better way to describe the term
j =1
weight used here would be 1 – normalized entropy.

Because the logarithm of zero is undefined, the product in the numerator is taken to be zero
if the proportion pij is zero. Two simple cases illustrating the calculation of this term weight are shown
below.

Simple Case 1: Term i occurs one time in only one document:

(1 / 1) log 2 (1 / 1) (1)( 0)
Gi = 1 + = 1+ =1
log 2 ( n ) log 2 ( n )

Simple Case 2: Term i occurs one time in each of the total n documents.

Gi = 1 +
 (1 / n) log 2 (1 / n )
= 1+
n(1 / n )( − log 2 (n ))
= 1+
− log 2 (n )
= 1−1 = 0
log 2 (n ) log 2 (n ) log 2 (n )

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-13

Term Weight Formulas

IDF (Inverse Document Frequency)

Low High
Information Information

20
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

If a term appears in every document, then the IDF weight is 1 because then di = n . The maximum
weight for a fixed document collection occurs when the term appears in exactly one document,
and the weight becomes 1 + log 2 (n) . No upper limit exists because the number of documents n
in a collection can be arbitrarily large.
Entropy and IDF weights achieve a maximum when exactly one term appears in exactly one
document. This implies a very discriminating term, but not a very useful one, because it occurs
in only one document. In fact, by default, the Text Filter node removes terms that do not occur
in at least four documents, although this is under user control. Both weights are at minimum or near
minimum if a term appears exactly one time in every document. In this case, the term is not very
discriminating because it occurs everywhere throughout the collection of documents.

Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-14 Lesson 3 Algorithmic and Methodological Considerations in Text Mining

Term Weight Formulas

Mutual Information
 

Gi = max(over k)
log10

( P(ti , Ck ) 
P(ti ) P(Ck ) 
)
 

where
C1, C2, …, Ck are the k levels of a categorical target variable.
P(ti) is the proportion of documents containing term i.
P(Ck) is the proportion of documents having target level Ck.
P(ti,Ck) is the proportion of documents where term i is present
and the target is Ck.
(Note that 0  Gi < ∞ and the log is base 10.)
21
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Although Gi is theoretically unbounded, in practice it is usually less than 1. Here is a simple example
showing how it is calculated for the case of a binary target, where k=2:

Number of Number of
documents with documents with
target level C1 target level C2

Number of docs 100 25 125

where term ti not
present

Number of docs 10 50 60
where term ti is
present

110 75 185
From this crosstabulation, we get

P(ti ) = 60 /185 = .324

P(C1 ) = 110 /185 = .595
P(C2 ) = 75 /185 = .405
P(C1 , ti ) = 10 /185 = .054
P(C2 , ti ) = 50 /185 = .270

Gi = max {log ( .324*.595

10
.054
), log =( .324*.405
.270
10 )}=.313

Generalizing this to the case of a categorical target with k>2 merely requires extending
the crosstabulation to a 2 by k table and then computing the individual factors in the same way
as above.
Multiplying the local and global weights produces an adjusted count that is often superior to using
raw counts alone.

Weighted Term-Document Frequency Matrix

After the frequency (local) and term (global) weights have been calculated
for each term, the final weights used are the product of the two.
âi,j = Gi Li,j
Gi is the term (global) weight for term i.
Li,j is the frequency (local) weight for term i in document j.

The term-document frequency matrix, weighted or unweighted, is the foundation of the linear algebra
approach to text mining.

Weighted Term-Document Frequency Matrix

Documents

Terms D1 D2 Dn

continued...
Term Weight Guidelines
• When a target is present, Mutual Information is the default. It is a good
choice when it can be used.
• Entropy and IDF weights give higher weights to rare or low-frequency
terms.
• Entropy and IDF weights give moderate to high weights for terms that
appear with moderate to high frequency but in a small number of
documents.
• Entropy and IDF weights vary inversely to the number of documents
in which a term appears.
• Entropy is often superior for distinguishing between small documents that
contain only a few sentences.
• Entropy is the only term weight that depends on the distribution of terms
24
across documents. Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Term Weight Guidelines

• Remember, you can suppress both frequency weights and term weights
by choosing the option None for each of these in the Text Filter node.
• If you choose None, then the raw cell counts will be analyzed downstream—that
is, âi , j = ai , j .
• Be experimental. Try different weight settings to find what gives you the
most interpretable or most predictive results for your data.

A simulation study artificially creates a document collection and distributes terms across the
documents using various strategies (for example, creating rare terms and creating terms with
frequency counts that follow a certain distribution). Even though the data set is completely artificial
and simple, it is informative to examine these results.

Simulation Study of Term Weights

Mutual
Term Term Freq Doc Freq Entropy IDF Information
armadillo 102 2 0.8495 6.6439 0.0000
bear 105 64 0.1264 1.6439 0.0060
cat 113 59 0.1405 1.7612 0.0040
cow 110 66 0.1107 1.5995 0.1010
dog 107 66 0.1183 1.5995 0.0070
gopher 106 55 0.1580 1.8625 0.0030
hamster 109 65 0.1194 1.6215 0.0060
horse 109 62 0.1315 1.6897 0.0050
kitten 105 62 0.1303 1.6897 0.0200
moose 1934 100 0.0973 1.0000 0.0000
mouse 108 63 0.1296 1.6666 0.0250
otter 1 1 1.0000 7.6439 0.0000
pig 107 58 0.1440 1.7859 0.0040
puppy 115 58 0.1576 1.7859 0.0190
raccoon 967 50 0.2478 2.0000 0.3010
seal 10 10 0.5000 4.3219 1.0000
squirrel 100 100 0.0000 1.0000 0.0000
tiger 117 70 0.1027 1.5146 0.0010
walrus 25 25 0.3010 3.0000 0.6020
zebra 3812 100 0.1008 1.0000 0.0000
26
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

You can verify the IDF calculations using the Doc Freq column and noting that there are 100
documents in the simulation. For example, for the term armadillo, the IDF term weight is as follows:

1 + log 2 (100 / 2) = 1 + log 2 (50) = 6.6439

The gray scale version is difficult to interpret, but the color version of the table highlights low
information terms with a red background and high information terms with a green background.
The target variable emulates a human judge assigning a label to a document that is perceived
to be about marine mammals. For the judge to assign a value of 1, there must be sufficient
information about marine mammals to warrant the document being labeled as a marine mammal
document. The presence of a marine mammal term is not sufficient, as evidenced by otter having
a mutual information weight of zero. The term otter only appears one time in a document that
mentions no other marine mammals. The term raccoon gets the third highest mutual information
score because
it happens to appear with high frequency in some of the marine mammal documents.
The results show that entropy and IDF weights tend to produce similar results. IDF is recommended
for larger documents, whereas entropy might be more appropriate for smaller documents. Of course,
these results cannot be extrapolated to all document collections. In particular, a typical document
in the simulated collection is small, so the results would be more useful for document collections
such as the Medline data, but less useful for multi-page reports.

3.01 Multiple Choice Question

Which term weight is recommended and is the default for documents that
are associated with a categorical target variable?

a. IDF
b. mutual information
c. entropy
d. none (G=1)

3.2 Dimension Reduction with SVD

Objectives
• Sketch how singular value decomposition (SVD) is used to project the
high-dimensional document and term spaces into a lower-dimension
space.
• Illustrate what is happening with a simple example.
• Discuss Text Topic and Text Cluster results in light of the SVD.

3.02 Multiple Choice Question

Which response best describes your preference?

a. I am eager to learn the technical details of singular value

decomposition.
b. I would like to understand the concepts that are important, but I would
prefer to skip the math.
c. I do not really care how SVD works. I only want to know how to use the
software to solve my problem.
d. I am only here to keep from doing real work.

Linear Algebra and SVD in Text Mining

• You have seen how the main data set in the analysis of free-form text
consists of a term-document matrix.
• You now assume at this point that all the natural language parsing and
tokenization of terms, the application of start or stop lists, filtering,
weighting, and so on, have been performed so that you can focus on the
final version of the term-document matrix.
• Linear algebra includes the study of matrices and matrix properties.
• Professor Gilbert Strang of MIT, a world expert on this topic, has referred
to SVD as “The Fundamental Theorem of Linear Algebra.”

Statement of SVD Theorem

• This brief discussion is based on the very helpful paper “Taming Text with
the SVD” (recommended reading and readily available for downloading
from the Internet) by Dr. Russ Albright of SAS R&D.
• The most relevant aspects of the SVD theorem are presented for the
purpose of dimension reduction, followed by an example.
• If you do not understand the abstract explanations, the concrete example
will at least give you the “flavor” of what is happening. (Also see the
optional exercise at the end of Section 3.2.)
• Define A to be a term-document matrix with m terms and n documents.
(Typically, m > n. That is, there are more terms than documents.)

Statement of SVD Theorem

• The SVD theorem states that the term-document matrix (and, in fact, any
rectangular matrix of real or complex values) can always be decomposed
into the product of three matrices in the form A = U V T :
T signifies the transpose of a matrix.
r is the rank of the matrix A.
U is an m x r matrix satisfying the orthogonality condition U TU=Ir x r.
Ir x r is an r x r identity matrix.
Σ is an r x r diagonal matrix consisting of r positive “singular values”
 1   2  ...   r >0
V is an r x n matrix satisfying the orthogonality condition VV T=Ir x r .
• The singular values  i can be thought of as providing a measure
of importance used to decide how many 34
dimensions to keep.
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

continued...
SVD Example
• Let’s look at Russ Albright’s example consisting of three documents:
Doc 1: Error: invalid message file format
Doc 2: Error: unable to open message file using message path
Doc 3: Error: unable to format variable
• These three documents generate the following 11 x 3 term-document
matrix A. doc 1 doc 2 doc 3
Term 1 error 1 1 1
Term 2 invalid 1 0 0
Term 3 message 1 2 0
Term 4 file 1 1 0
Term 5 format 1 0 1
Term 6 unable 0 1 1
Term 7 to 0 1 1
Term 8 open 0 1 0
Term 9 using 0 1 0
Term 10 path 0 1 0
Term 11 variable 0 0 1

continued...
SVD Example
• With the right software (for example, PROC IML), it is very easy to
compute the SVD decomposition for this little example and obtain the
separate matrices V, U , and  .
• The product U T A produces the SVD projections of the original document
vectors. These are the document SVD input values that you have seen,
which are produced by the Text Cluster variable (except that they are
normalized for each document as explained on a later slide).
• This amounts to forming linear combinations of the original (possibly
weighted) term frequencies for each document.

continued...
SVD Example
• First project the first document vector d1 into a three- dimensional SVD
space by the matrix multiplication:
U T d1 = *

U T was obtained using the SVD matrix function in PROC IML applied to the A matrix

• The product of the 3 x 11 U T matrix with the

11 x 1 term-frequency vector d1 for doc 1 gives:
d1 = term-frequency vector for document 1
U T d1 = dˆ1 = (using the unweighted counts here)

• And then write this in transposed form with column labels:

SVD Example
• The SVD dimensions are ordered by the size of their singular values
(“their importance”). Therefore, the document vector can simply
be truncated to obtain a lower-dimensional projection:
• 2-D representation for doc 1 is
• As a final step, the Text Cluster node then normalizes the coordinate
values so that the sums of squares for each document are 1.0:
• Using this document’s 2-D representation,1.632 + .492 = 2.847 and 2.897 = 1.70
.
• Therefore, the final 2-D representation for doc 1 would be .
• These are the SVD1 and SVD2 values that you would see for this document
by looking at the exported data coming out of the Text Cluster node.

continued...
Dimensionality Reduction
• The tiny example given here has a term-document matrix of rank r=3.
(The rank is always less than or equal to the minimum of the number
of documents and the number of terms.)
• In actual practice, the rank of the term-document matrix is usually
in the thousands, so the SVD algorithm is used to dramatically reduce
the dimensionality of the data.
• The SVD algorithm derives SVD dimensions in order of “importance”
(based on the singular values i ).
• The number of SVD dimensions to keep is based on looking at these
singular values and establishing a cutoff value k.

continued...
Dimensionality Reduction
• The user specifies a maximum dimension M (default=100 and highest
allowed value=500) for the number of SVD dimensions to keep.
• The SVD algorithm produces the M singular values in decreasing order.
• The sum of the M singular values (squared) acts as a metric for the
amount of information in the document collection. Treating the sum
of the top M squared values as the “total information” is useful for arriving
at a reasonable cutoff.

continued...
Dimensionality Reduction
• The user also specifies an SVD Resolution value:
• High=100%
• Medium=5/6=83.3%
• Low=2/3=66.67% (the default)
• Based on these two settings, the Text Cluster node uses a simple algorithm
to decide on the final number of SVD dimensions to use.

Dimensionality Reduction
• To illustrate the logic for deciding the number of dimensions to use:
• Suppose the user sets Max SVD Dimensions=100 and SVD Resolution=Low
(66.7%).
• Assume that you are working with a big document collection so that the rank
of the term-document matrix is much larger than 100.
100
• Let the sum of the first 100 squared singular values be given as   2i = C .
i =1
k

• The algorithm determines the minimum dimension k100 such that   2i / C  .667
.
i =1

• In the end, k SVD dimensions will be kept.

The problem of dimensionality reduction is challenging in a text mining setting. The default settings
in the SAS Text Cluster often work very well, but you should be prepared to experiment with the
maximum number of SVD dimensions kept, as with many other parameter settings.

Experimenting with the SVD Dimensions

This demonstration illustrates some experimentation with the SVD dimensions.

The ASRS data set was described in an earlier chapter.

1. Create a new diagram and name it SVD Dimensions. (As we have stated throughout, in virtually
every case this diagram should have been already created for you and the relevant nodes run.)
The diagram will look like this when completed:

2. Create a data source for the ASRS training data using DMTX51.ASRS_TRAINING. Use the
following metadata:

The data set contains 22 variables with the names Target01 throughTarget22. We use Target02
for this demonstration, so all the others should be rejected. From a table in the E.G. Allan et al.
article “Anomaly Detection Using Nonnegative Matrix Factorization” (2008, p. 215), the incident
for Target02 has to do with noncompliance with policy procedures. The variable Size is just the
length of the report in bytes and will not be used. Only the report itself (Text) is needed, but ID
can be left as an ID variable.)

3. Drag the ASRS data source onto the diagram.

4. To investigate the robustness of the automated assignment, add a Data Partition node
to the partition data set DMTX51.ASRS_TRAINING. Use a 50/50/0 partition.

5. Attach a Text Parsing node to the Data Partition node. Leave all the defaults as is and run.
6. Attach a Text Filter node to the Text Parsing node. Change the default weightings to Log
and Mutual Information. (Although in this case, these are the defaults.) Leave all else in default
mode and run.

7. Attach a Text Cluster node to Text Parsing and rename it Text Cluster – 5. Change the
Transform settings as follows:

8. This generates a 5-dimensional SVD solution. Run the node and go to Exported Data
in the properties panel. Select the TRAIN data set and then click Explore. Verify that you have
a 5-dimensional solution by looking at the number of TextCluster_SVD variables.

Five SVD
dimensions were
created.

9. Attach a Decision Tree node to the Text Cluster–5 node. Rename it to DT 5 Clus. Change
the Assessment Measure property to Average Square Error and Leaf Size to 25. (These are
fairly routine changes that are often found to produce better results with trees. Obviously 25
is not some magic number, but the default of 5 for Leaf Size is considered by many analysts
to be too small.) Run the node, but it is not necessary to look at the results yet.
10. Attach another Text Cluster node to the Text Filter node and rename it Text Cluster 10.
Change the Transform settings as shown below in order to get a 10-dimensional SVD solution.
Run this and verify that you see a set of 10 TextCluster_SVD variables.

11. Then copy down your previous DT 5 Clus node, rename it DT 10 Clus, and connect it to Text
Cluster 10. Run this tree, but there is no need to look at the results yet.

12. Repeat the previous step with a new Text Cluster node renamed Text Cluster - Default. Run
this default Text Cluster node and determine that 34 TextCluster_SVD variables were
produced. Rename this third decision tree to DT default Clus and run it.
13. Now connect all three decision trees to the Model Comparison node. In the properties panel for
the Model Comparison node, set the Selection Statistic property to ROC and Selection Table
to Validation. As a consequence of these changes, the ROC index for the validation data will
be shown at the very beginning of the Model Comparison node Results window. Run the Model
Comparison node and view the results.

14. Open the results and look first at the ROC charts.

In color, it is clear that DT 5 Clus is the inferior decision tree model. Examining the Fit Statistics
window confirms this. The DT 5 Clus tree has a validation ROC index of just .65 compared
to .71 and .70 for the other two models.

Tree2 and Tree3 have similar validation ROC index values. Tree2 (DT 10 Clus) was generated
from a Text Cluster node specifying a 10-dimension solution, whereas Tree3 (DT default Clus)
was calculated from the default Text Cluster node that generated 34 dimensions.
Remember that the decision tree algorithm itself incorporates variable selection logic. So even
though Tree2 was working with 10 SVD variables as candidate variables generated by the Text
Cluster node, it actually found only 9 of them to be useful in the tree. Tree3 had 34 candidate
SVD variables to work with, but only kept 17 for the final tree. (This information can be found
by going to tree results and selecting View  Model and then viewing the Variable Importance
windows.)
The lesson here is that when you have a target variable and are using a modeling method with
variable selection logic, the decision about how many SVD variables to keep is less critical. It is
better to err on the high side and then let the variable selection algorithm from your model make
the choice. Typically, the default settings of SVD Resolution=Low and Max SVD
Dimensions=100 will work well in this situation. (You do not want to choose too few dimensions,
as we did with Tree (DT 5 Clus), and wind up with an inferior model.)

Practice

1. Comparing the Effect of the Number of SVD Dimensions Using Regression Models
For this exercise, repeat all the steps done in the previous demonstration, but now use three
forward selection regression models. The easiest way to do this without having to rerun any
of the previous nodes is as follows:
a. Bring down a regression model and connect it to the Text Cluster-5 cluster node. Rename
the regression model to Regr 5 Clus. Set it up to do forward selection with assessment on
the validation error. Then connect the Regr 5 Clus node to a new Model Comparison node.
(Just copy down the previous Model Comparison node.) The reason for using a second
Model Comparison node is because with more than three models, the graphs become a little
too cluttered to interpret easily.

b. Copy the Regr 5 Clus node down and connect it to the Text Cluster-10 cluster node.
Rename this regression model Regr 10 Clus and connect it to the second Model
Comparison node in order to compare it to the first regression model.
c. Copy down the first regression node again. Connect this copy to the Text Cluster-Default
cluster node. Rename the regression node to Regr Default Clus. Again, connect it to the
second Model Comparison node as with the other two regression models. The right part
of your diagram should now look like this:

d. After running everything through the second Model Comparison node, answer these
questions:
1) How many SVD variables were selected by each of the three different regression
models?
2) What was the validation ROC index for each of these three models? How do these index
values compare with values from the previous three decision tree models?

2. (Optional) Details of the SVD Calculations Performed by the Text Cluster Node
This optional exercise is for those students who are interested in the details of how the document
SVD variables are calculated. Although not all users of the SAS Text Analytics software want to
know this much, those who do can work through this exercise. In the first step, a very simple text
mining project is run to produce the SVD variables computed in the Text Cluster node. In the
second step, you use a PROC IML (Interactive Matrix Language) program
to explicitly see how the term-document frequency matrix is analyzed using the SVD algorithm
from linear algebra. In the end, you will be able to compare the SVD values from the Text Cluster
node to those computed by the PROC IML code and see that they are the same.
You are not expected to write the PROC IML program that is used. It is supplied for you.
If you are familiar with matrix algebra and can follow programming logic, you might be able to
understand the major parts of the internally documented PROC IML program.

a. Create a diagram named Optional Exercise SVD Calculations. (This diagram has already
been created for you, but you are encouraged to set it up on your own as you have been
doing in class.)

b. Bring in a File Import node and import the spreadsheet

Canine_Feline_Optional_Exercise_Chapter3.xls from inside the
D:\workshop\winsas\DMTXT51 folder. Run File Import and look at the exported data:

The data consist of 18 records (documents). Each record contains the names of some cat-
like (feline) animals, some dog-like (canine) animals, or both. If you look carefully, you will
see that the first 16 documents contain either pure feline or pure canine animals, but not
both. However, in documents 17 and 18, there is a mix of both types of animals.
c. Attach a Text Parsing node and change the defaults so that all the language algorithms are
turned off. The properties panel will look like this:

The purpose of turning off all the language algorithms in Text Parsing here (such as Different
Parts of Speech, Noun Groups, and Stop List) is to make this example as simple as possible
to follow.

d. Attach a Text Filter node to the Text Parsing node. Change the defaults on the properties
panel to those shown below:

Note that we are setting the Frequency Weighting and Term Weight properties to None.
Again, the reason for this is to construct a simplified example. This produces a term-
document frequency matrix with raw counts rather than weighted counts. Also, remember
to change the Maximum Number of Documents property to 2.
e. Attach a Text Cluster node and set up the properties panel in this way:

The settings SVD Resolution=High and Max SVD Dimensions=2 will make Text Cluster
produce exactly two SVD variables.

f. Run this flow from the Text Cluster node and look at the exported data. You can move the two
TextCluster_SVD columns by dragging them to the left as shown below.

Although in most realistic settings, the TextCluster_SVD variables are not very interpretable
(the document clusters are used for interpretation), in this simple example, you can interpret
the results quite easily this way:
• High TextCluster_SVD1 values are associated with documents containing the names
of canine animals.
• High TextCluster_SVD2 absolute values (or very negative values) are associated with
documents containing the names of feline animals.
g. Now that you have the TextCluster_SVD values generated from the Text Cluster node,
you will also obtain them by doing some calculations with PROC IML on the term-document
frequency matrix for this example.

With some special SAS data set programming, the 15 (terms) x 18 (documents) term-
document frequency matrix for this example could be obtained from the Enterprise Miner
project flow. However, in the interest of simplicity, for this exercise you easily obtain the
relevant numbers by hand and then verify the entries in the matrix below. These are the raw
term-document frequencies (unweighted because the weighting parameters in the properties
panel of the Text Filter node were turned off). This is what was called the A matrix in an
earlier slide.

h. Bring in a SAS Code node. There is no need to connect it to any of the other nodes because
you will be running only a self-contained PROC IML program. Your diagram should now look
like this:

i. Go into the Code Editor on the properties panel for the SAS Code node. In the Training
Code window. Right-click and select Open. Then navigate to the directory
D:\workshop\winsas\DMTX51\sassrc and then select the program
Proc_IML_Optional_Exercise_Chapter3.sas. Here is a listing of this program, in two parts
to fit across pages:

First part of listing:

Second part of listing:

j. Run this program in the SAS Code node and look at the results. The final normalized two
SVD variables for the 18 documents are at the bottom of the output:

These SVD values are the same as the values shown coming out of the Text Cluster node
except for an arbitrary and unimportant sign change for SVD2. That is, the SVD2 values
from the Text Cluster node are -1 times the values in this listing.

3.3 Chapter Summary

Text mining consists of the following steps:
1. preparing the data

2. parsing the text

3. converting the text to a numeric representation

4. transforming the numeric representation

5. reducing the dimensionality of the transformed representation

6. analyzing the text through the reduced dimension representation

SAS Text Miner nodes provide numerous strategies for completing the above steps.
The linear algebra approach to text mining, using the singular value decomposition, creates
variables (the SVD document vectors) that are used for clustering the documents and for predictive
modeling. This approach also reduces the dimensionality of the space of documents. This same
approach (modified slightly) generates topics within documents.

For Additional Information

Albright, Russell. 2004. Taming Text with the SVD. SAS Institute White Paper.
Albright, R., J.A. Cox, and K. Daly. 2001. “Skinning the Cat: Comparing Alternative Text Mining
Algorithms for Categorization.” Proceedings of the 2nd Data Mining Conference of DiaMondSUG.
Chicago, IL. DM Paper 113.
Allan, E., M. Horvath, C. Kopek, B. Lamb, T. Whaples and M. Berry. Anomaly detection using
nonnegative matrix factorization. In M. Berry and M. Castellanos, Editors, Survey of Text Mining II:
Clustering, Classification, and Retrieval, pages 203–218. Springer-Verlag London Limited, 2008.
Cherniak, Eugene. 1993. Statistical Language Learning. Cambridge, Massachusetts: The MIT Press.
Evangelopoulos, Nicholas, Xiaoni Zhang, and Victor R. Prybutok. 2010. “Latent Semantic Analysis:
Five methodological recommendations.” European Journal of Information Systems. 1–17.
Jurafsky, Daniel, and James H. Martin. 2000. Speech and Language Processing: An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition. Upper Saddle
River, New Jersey: Prentice Hall.

Konchady, Manu. 2006. Text Mining Application Programming. Boston: Charles River Media.
Manning, Christopher D., and Hinrich Schutze. 2002. Foundations of Statistical Natural Language
Processing. Cambridge, Massachusetts: The MIT Press.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction
to Information Retrieval. New York: Cambridge University Press.

Shannon, C.E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal.
Vol. 27, pp. 379–423 and 623–656.

Strang, Gilbert. 1993. “The Fundamental Theorem of Linear Algebra.” The American
Mathematical Monthly. Vol. 100, No. 9, pp. 848-855.
Thisted, Ronald A. 1988. Elements of Statistical Computing. New York: Chapman and Hall.
Wakefield, Todd. 2004. “A Perfect Storm is Brewing: Better Answers are Possible by Incorporating
Unstructured Data Analysis Techniques.” DM Direct, August 2004.
Wicklin, Rick. 2010. Statistical Programming with SAS/IML Software. Cary, NC: SAS Institute Inc.

Solutions to Practices
1. Comparing the Effect of the Number of SVD Dimensions Using Regression Models
The number of SVD variables actually used by the three regression models:

Regr 5 Clus model – 3 SVD variables

Regr 10 Clus model – 7 SVD variables
Regr default Clus model – 20 SVD variables

The easiest way to get these numbers is to go to the Results window for each regression
and look at the Effects Plot window. You have to remember to subtract the intercept term
to arrive at the values shown above.
Looking at the ROC charts from the Model Comparison tool used to assess the three regression
models, it is clear that the three models can easily be rank ordered by their predictive strength.

From the Fit Statistics window, the validation ROC index values are .651, .721, and .745.
This confirms that the Regr default Clus model outperformed the other two models and that
the default choice of selecting candidate SVD variables led to better results than underspecifying
the number of candidate SVD variables as either 5 or 10.

Referring to the earlier Model Comparison results for the trees and comparing them to their
equivalent regression models in terms of their validation ROC index values gives the following:

Model Validation ROC Model Validation ROC

Regr default Clus .745 DT default Clus .709

Regr 10 Clus .721 DT 10 Clus .704

Regr 5 Clus .651 DT 5 Clus .654

The best results are for the Regr default Clus model, and this holds up when other fit statistics
are looked at as well.

Solutions to Activities and Questions

3.01 Multiple Choice Question – Correct Answer

Which term weight is recommended and is the default for documents that
are associated with a categorical target variable?

a. IDF
b. mutual information
c. entropy
d. None (G=1)

4.1 Some Predictive Modeling Details ........................................................................................ 4-3

Demonstration: Experimenting with the Effects of Global Weights on Predictive
Power.......................................................................................................... 4-20

Demonstration: Developing a More Interpretable Model (Optional) .................................... 4-25

Practice ................................................................................................................................. 4-28

4.2 Text Profile Node ................................................................................................................... 4-29

Demonstration: Profiling News Articles Using the Text Profile Node .................................. 4-35

Practice ................................................................................................................................. 4-42

4.3 High-Performance (HP) Text Miner Node (Optional) ......................................................... 4-43

Demonstration: Predictive Modeling with the HP Text Miner Node ..................................... 4-50

Demonstration: Using PROC HPTMINE .............................................................................. 4-57

Demonstration: Predictive Modeling Using High-Performance Nodes ................................ 4-63

4.4 Chapter Summary ................................................................................................................. 4-65

4.5 Solutions ................................................................................................................................ 4-66

Solutions to Practices ........................................................................................................... 4-66

Solutions to Activities and Questions ................................................................................... 4-70

4-2 Lesson 4 Additional Ideas and Nodes

4.1 Some Predictive Modeling Details

Objectives
• Describe predictive modeling data sets.
• Explain predictive modeling projects and features of SAS Enterprise Miner
related to predictive modeling.
• Explain the trade-off between predictive power and interpretability.
• Discuss how the Text Cluster and Text Topic nodes can be set up to affect
this trade-off.
• Emphasize the need for experimenting with different predictive modeling
and text miner settings—that is, the “workbench” idea for Enterprise
Miner.

SAS Enterprise Miner has many predictive model nodes. Some nodes are general purpose, such
as the Decision Tree, Neural Network, and Regression nodes. Some nodes are specialized, such
as the MBR, Rule Induction, and Partial Least Squares nodes. This course has illustrated the use
of the Decision Tree and Regression nodes. The availability of many different modeling techniques
makes it easy to try out different approaches to find the best results for your data. Think of the
Enterprise Miner and its many different nodes as a workbench for analytic experimentation.

SAS Enterprise Miner Model Tab

• AutoNeural
• Decision Tree
• Dmine Regression
• DMNeural
• Ensemble
• Gradient Boosting
• Least Angle Regression
• MBR
• Model Import
• Neural Network
• Partial Least Squares
• Regression
• Rule Induction
• Two Stage

Note: On the High-Performance Data Mining (HPDM) tab, you also have access to random forests
and support vector machines. For specialized predictive models, the Applications tab
supports the Survival Data Mining node and the Incremental Response node.

Predictive Modeling Training Data

Training Data
inputs target

Training data case: categorical

or numeric input and target
measurements

The minimum requirement for data mining predictive modeling is at least one target variable and
at least one input variable. A predictive model is constructed using a training data set. The model
attempts to predict the value of the target variable using only the values of a set of input variables.

For example, input variables can measure customer attributes such as gender, age, income, location
of primary residence, and average purchases to try to estimate the probability that a customer will
respond to a particular promotion, such as a 20% off discount on purchases of $100 or more.

Predictive Model
Training Data
inputs target

Predictive model:
a concise representation
of the input and target
association

After a model is constructed using training data, the performance of the model can be assessed
using a holdout data set. When a final model is selected, it can be used to score new data to
determine, for example, which customers should be selected for a promotional offer. The term score
is synonymous with predict.

Predictive Model
Training Data
Validation Data
Test Data
Score Data
inputs prediction

Predictions: output
of the predictive
model given a set of
input measurements

To choose from a variety of models, a holdout data set called a validation data set is used
to determine how well models will extrapolate to new data. This helps overcome the problem
of overfitting, which occurs when a model is constructed to fit the training data set so well that
it does not fit any other data well. For a model that has been selected for deployment, a second
holdout data set, called a test data set, is used to get an unbiased estimate of the accuracy of the
model in the live environment. A predictive model can score any data set that has the inputs used
by the model. It is important to ensure that models are applied to data commensurate with how
a model was constructed. For example, a model constructed using only customers who reside
in California might not be appropriate for scoring customers in Florida.

SAS Enterprise Miner Source Data

For predictive modeling:

• at least one variable with the role Target
• at least one variable with the role Input
8
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

On the slide above, the table has nine input variables that will be used as candidate predictor
variables, and two target variables that can be used to derive predictive models. The two target
variables were automatically recognized because they contain the prefix Target. Some predictive
modeling nodes can accommodate only one target variable, so you must reject a target variable
when using one of those nodes.
When you create a data source in SAS Enterprise Miner, the Data Source Wizard displays all the
variable roles so that you can check that the target and input variables have been specified in how
you need to build a predictive model.

SAS Enterprise Miner Source Data

The Data Source Wizard summarizes the metadata.

SAS Enterprise Miner Predictions Data

Decision Tree

A predictive model will provide a predicted value. In the case of a binary target variable, the
predicted value is the posterior probability of the primary event given the inputs. You can use this
probability to derive a decision rule. For example, if the probability exceeds a (selected or derived)
cutoff value of 0.37, send the promotion to the customer. Otherwise, do nothing. SAS Enterprise
Miner model nodes will add a variety of columns to the imported data when creating the exported
data. The nature of the added columns depends on the model used. The above table is displayed
by selecting the Variables property in a SAS Code node attached to a Decision Tree node. The
predicted value in this case is called P_SubroFlag1. This is the probability that the target variable
(SubroFlag) has the value 1. There is also a variable named P_SubroFlag0, which is the
complementary value, 1-P_SubroFlag1.

SAS Enterprise Miner Predictions Data

Regression

Prediction Types for Binary Response Models

Decisions:
• I_Target is 1 if P_Target1>P_Target0. It is 0 (zero) otherwise. Thus, I_Target
decisions are equivalent to using a posterior probability cutoff of 50%.
• D_Target is 1 if Profit(Target=1)>Profit(Target=0). It is 0 (zero) otherwise.
If no profit information is provided, then D_Target is equivalent to I_Target.
Estimates:
• P_Target1 is the estimated posterior probability that Target=1.
• P_Target0 is the estimated posterior probability that Target=0.

The above slide summarizes the prediction variables that are usually of interest. A Decision Tree
node would produce, for example, P_Target1, which would be the posterior probability derived from
the tree after correcting for oversampling. Note that if the binary target variable has the name
WIDGET, then the posterior probability that WIDGET=1 is given by the variable P_WIDGET1.

SAS Enterprise Miner Input Selection

• Explore Tab
• Variable Clustering node
• Variable Selection node
• Model Tab
• Decision Tree node
• Regression node
These can be used with all the variables created by the Text Cluster
and Text Topic nodes.

Although lack of input variables can be a problem, frequently your problem is that you have a very
large number of candidate input variables to select from. Input selection, or variable selection,
is an important topic in predictive modeling. SAS Enterprise Miner facilitates input selection
in a number of different ways.

SAS Enterprise Miner Dimensionality Reduction

• Explore Tab
• Variable Clustering node
• Modify Tab
• Principal Components
• Model Tab
• Partial Least Squares node

Model Selection=Input Selection

Decision Tree Subtree Assessment Plot

A Decision Tree node performs input selection by deciding which subset of input variables are used
to partition the data into separate leaves. If a variable is not useful in separating the data into more
pure leaf nodes, then the variable is discarded. In the above plot, a tree with 40 leaves is derived.
A 59-leaf tree was pruned to remove leaves that did not improve overall model accuracy. The pruned
subtree is used to choose the input variables that are useful for prediction. These variables will
be passed to successor nodes, whereas the other input variables will have their roles changed
to Rejected.

Model Selection=Input Selection

Regression 16Iteration Plot

The Regression node also has options for performing variable selection. The above iteration plot
reveals that the Regression node tried a series of models, culminating in an eight-variable model.
But for a final model, it chose one that has only three variables.

Model Assessment
• Fit Statistics
• Average square error
• Misclassification rate
• Information criteria
• Others
• Charts and Plots
• ROC chart
• Gains chart
• Lift chart
• Others

Model assessment is performed using results from the model nodes and the Model Comparison
node.

Model Assessment: Fit Statistics

Decision Tree Node Fit Statistics Table

Fit statistics, such as average squared error (ASE), are defined in the SAS Enterprise Miner
documentation, which you can also access by selecting Help  Contents.

Model Assessment: Fit Statistics

Neural Network Node

Different models produce different fit statistics.

Model Assessment: Charts and Plots

Model Comparison Node ROC Chart

The Model Comparison node produces additional fit statistics and charts and plots that allow the
direct comparison of the performance of different models. Some plots are not available directly
through the GUI, but SAS Enterprise Miner provides extensive functionality through the SAS Code
node.

In this class, we have focused on using text mining tools to derive new input variables from free-form
text data. Therefore, it is valuable to make users of these tools aware of some of the strategies and
“tricks” that are used that can enhance your predictive models.
An important point to understand is that there is usually a trade-off between predictive power and the
interpretability of a model. That is, if you try to obtain the most powerful predictive model that you
can, you often end up with a more complicated and less interpretable model. On the other hand, for
many purposes, it is often important to obtain models that can be explained and understood by
others, including senior management and clients who will be using the model.

A practical strategy for handling this trade-off is to create both types of models: one that has the
strongest predictive power that you can obtain and another one that is more explainable. Then you
can present both of these models to your audience, and they can participate in the decision of which
one to use. This requires a fair amount of experimenting with settings and also understanding what
choices you have when you use the Text Cluster and Text Topics nodes. These choices can directly
make your model more likely to have greater predictive power or more easily interpretable.

A Trade-Off: Models with Greater Predictive Power

versus Greater Interpretability
• The many tools available in SAS Enterprise Miner, including the Text
Mining nodes, provide analysts with many ways to explore this trade-off
with their models.
• A very reasonable approach to handle the trade-off is to
• obtain the strongest model that you can regardless of its interpretability
• obtain a more interpretable model at (possibly) the expense of predictive power
• present both models to your audience (senior management, clients) and let
them choose between the two.

As we saw earlier in Chapter 1, there are several variables that are produced by the Text Cluster
and Text Topic nodes. Text Cluster generates the document SVD vectors, which are then used to
cluster the documents. The SVD variables are automatically set to the role of Input, so they are
ready to be used for a prediction model. However, the actual clusters that are produced by these
SVD variables are set to the role of Segment and therefore are not immediately available for
prediction purposes. Nevertheless, all that is required to make this change is to convert the
TextCluster_cluster_ variable from the role of Segment to the role of Input. Because the
TextCluster_cluster variable is intended to help the analyst in interpreting results, many analysts
will consider using the TextCluster_cluster_ variable for modeling purposes instead of the SVD
variables. This is a deliberate trade-off that might mean sacrificing some predictive power
for greater interpretability.

The Trade-Off with Text Cluster Variables

• You should be familiar with one strategy to address this trade-off when
using the Text Cluster node.
• The Text Cluster node generates up to three sets of variables:
• the document raw SVD variables (TextCluster_SVD is always generated.)
• the document cluster variable (TextCluster_cluster_ is always generated.)
• the probabilities of a document belonging to each cluster (TextCluster_prob
is generated only when the E-M clustering algorithm is used.)
• By default, only the raw SVD variables are used for modeling, but these,
in actual practice, are not likely to be interpretable.

Another “trick” that analysts sometimes use in prediction modeling is to use the Text Cluster node
segment probabilities as input variables. These are the TextCluster_probN variables that are
created when the Expectation-Maximization clustering algorithm is used. Using this algorithm,
if there are three clusters created (so TextCluster_cluster=1, 2, or 3), then a document will have
a probability TextCluster_prob1, TextCluster_prob2, and TextCluster_prob3 associated with
each of the clusters. The document will be assigned to the cluster for which it has the highest
probability, but these probabilities can be directly used as input variables if you want.
In either case, if you want to swap in the TextCluster_cluster_ or TextCluster_prob variables,
you have to use a Metadata node to do this. You would then reject the TextCluster_SVD variables
so that they are not used in the analysis.

The Trade-Off with Text Cluster Variables

• On the other hand, the TextCluster_cluster_ variable is designed to be
interpretable using the descriptive terms associated with each cluster.
• Similarly, the TextCluster_prob variables can be very helpful for
understanding results (when the E-M algorithm is used). For example,
suppose there are three clusters with these descriptive terms:
• zoo, lion, … (cluster 1 – animal related)
• baseball, soccer, … (cluster 2 – sports related)
• weather, hot, … (cluster 3 – weather related)
• If a document has probabilities of belonging to these clusters (using the
E-M algorithm), respectively, of .50, .48, .02, it will be assigned to cluster
1, but it is likely to be sports related as well. (cluster 2)

The Trade-Off with Text Cluster Variables

• An analyst can build more interpretable predictive models (perhaps
sacrificing some predictive power) by excluding the raw SVD and using
either the TextCluster_cluster_ variable or the TextCluster_prob variables
as inputs to a model.
• This would be done through the Metadata node.

An analyst can also manage the Text Topic node in a way that can trade off between better
predictive power or clearer interpretability.

The Trade-Off with Text Topic Variables

• There is a somewhat similar strategy available for the predictive power
versus interpretability trade-off when using the Text Topic node.
• As we have discussed, the Text Topic nodes generate two types
of variables:
• The TextTopic_raw variables, which are continuous measures that indicate
the strength of a topic present in a document.
• The TextTopic variables, which are binary measures that indicate whether
a document has a topic or not, based on using a cutoff value for the
TextTopic_raw variable.

The Trade-Off with Text Topic Variables

• Both types of variables are always generated, but by default, the
TextTopic_raw variables are set to Input for predictive modeling,
whereas the TextTopic binary variables are not.
• Both sets of variables can be interpreted with the same key terms that
describe the topic.
• However, some analysts find the binary TextTopic variables easier
to understand because a document is classified as either having a topic
or not.
• Evidently, you sacrifice some information when using the TextTopic
variables instead of the TextTopic_raw variables, but this can lead
to a more interpretable model.

Experiment!
• Regardless of whether you are aiming for better predictive power
or greater interpretability, you should adopt an experimental approach
to modeling.
• Enterprise Miner is like a workbench where you can easily try out different
approaches and compare their results.
• The default settings are meant to work well across a wide variety
of situations, but your particular analytic problem can very often
be improved by testing out different parameter settings.
• Do not take the defaults for granted as always producing the best results.

Experimenting with the Effects of Global Weights on Predictive

Power

This demonstration further shows how to approach text mining analytics and predictive modeling
experimentally by trying out different parameter settings or different modeling techniques. In this
case, we experiment with different term weight (global weight) settings and use the Model
Comparison node to compare the models. These results produce somewhat surprising results
regarding the weights and provide motivation for trying out different approaches.
We will again use the ASRS data set. This time, though, change the target from Target02 (which has
to do with noncompliance events and was used in the earlier demonstration of Chapter 3)
to Target05 (which has to do with the occurrence of a collision hazard event).
The 22 target events vary considerably with respect to the difficulty of modeling them. Descriptions
of several of these target events (as given in E.G. Allan et al. 2008), along with published ROC index
values for models that Allan et al. obtained using an analytic method known as Nonnegative Matrix
Factorization, are shown in the table below. Note that we were able to improve on their ROC index
for Target02 in the last chapter, where we obtained .711 with a decision tree using the default SVD
variables generated from the Text Cluster node. In this demonstration, using Target05, we are also
able to improve on the reported Allan et al. results.

Some of the 22 ASRS target events with their ROC index values from published model results:

Event Label in Course Description of Event Reported ROC Index From

Data Allan et al Model Results

Target02 Operation Noncompliance .6009

Target05 Incursion (collision hazard) .8977

Targer13 Weather Issue .6287

Targer21 Illness or Injury event .8201

Target22 Security concern / threat .9040

We will create a diagram flow that tests out the effect of trying the four different global weight (term
weight) options available in the Text Filter node. The flow will look like this when finished:

1. Create a diagram named Global Weight Experiment. Drag in the ASRS training data. This time
make Target05 the target variable, with all the other targets rejected.

2. Bring in the Data Partition node and leave it at the default settings
of Training/Validation/Test=40/30/30.

3. Connect the Data Partition node to a default Text Parsing node.

4. The Text Parsing node is then connected to each of four individual Text Filter nodes. Each
of these Text Filter nodes will be set up using a different term weight (global weight):

a. The first Text Filter is set to Mutual Information.

b. The second Text Filter is set to Entropy.

c. The third Text Filter is set to IDF.

d. The fourth Text Filter is set to None.

Remember that Mutual Information would be the default here because a target variable
(Target05) has been defined.
Rename each of the four Text Filter nodes to identify which global weight has been used,
such as Text Filter - Mutual Info and Text Filter - IDF.
5. Connect each Text Filter node in the series to its own default Text Cluster node and then
its own Text Topic node.

6. The output of each Text Topic node is then connected to a Regression node and the following
settings are selected:

7. Rename each of the Regression nodes to indicate the term weight that was used earlier
by its Text Filter node—that is, Regression-Mutual Info, Regression-Entropy, and so on.
(The reason for renaming each of these Regression nodes is that it will make comparing results
easier when we look at the Model Comparison node.)
8. Connect each individual Regression node to a single Model Comparison node. Change
the Model Comparison node to make the ROC index on the test data the selection statistic:

9. Check to see that you have set things up as shown in the display capture at the beginning
of this demonstration and then run the entire flow from the Model Comparison node.
Note: If you are running this flow from scratch, it will take about 12 or 13 minutes using
the Virtual Machine image provided for this class.
10. Open the Model Comparison results and look at the Fit Statistics window to compare the
performance of the four models using different term weights:

First, note that all the models are producing very high ROC index values on the test data, as well
as on the training and validation data sets. This is also obvious from the dramatically high curves
in the ROC chart for Target05. Other statistics tell the same story. So in general, we have been
quite successful at classifying ASRS reports as either indicating a collision hazard event or not.

However, what is very surprising is that the Mutual Information term weight in this case
produced the worst results (Index=.929), even worse than using no weight (Index=.954).

The lesson here is to be experimental and try out different parameter settings! The default settings
generally work well, but the analysis of each data set should be explored in a number of ways for
best results.

4.01 Multiple Choice Question

Which term weight resulted in a regression model with the most inputs
as determined by forward selection?

a. None
b. Entropy
c. Inverse Document Frequency
d. Mutual Information

Developing a More Interpretable Model (Optional)

This demonstration illustrates how to use the Metadata node to change the roles of variables
produced by the Text Cluster and Text Topic nodes to create interpretable inputs for predictive
modeling.
All of the models derived in the previous demonstration are of high quality, but all suffer from a lack
of interpretability. For example, for the entropy model, what does the fact that TextCluster2_svd1
has a coefficient of 2.6003 tell you about the predictions? (Some analysts would be concerned that
the standard errors for four inputs are larger than 2, but that is beyond the scope of this course.)

Using the insight provided earlier about interpretable text mining inputs, you can craft a predictive
model that uses only inputs that have a relatively clear interpretation.
1. Attach a Text Cluster node to the Text Filter – Entropy node.

2. Change SVD Resolution to High and Max SVD Dimensions to 50.

3. Change the number of clusters to 22. Use the default cluster algorithm, Expectation-
Maximization.

4. Run the Text Cluster node.

5. Attach a Text Topic node to the Text Cluster node. Change Number of Multi-Term Topics
to 50.

6. Run the Text Topic node.

7. Attach a Metadata node from the Utility tab to the Text Topic node.

8. Select the Train property under Variables.

9. Set the new role for all of the TextCluster5_SVD variables to Rejected.

10. Set the new role for all of the TextCluster5_prob variables to Input.

11. Set the new role for all of the TextTopic5_n variables to Input.
12. Set the new role for all of the TextTopic5_raw variables to Rejected.

13. Run the Metadata node.

14. Copy and paste one of the Regression nodes in the diagram after the Metadata node
and attach it to the Metadata node. This preserves the properties used by the previous four
Regression nodes. Rename the node Regression – Interpretable.
15. Run the Regression node.

16. Connect the Regression node to the Model Comparison node and run the Model Comparison
node. The following table is produced:

The interpretable model is competitive with the other models, but as expected, is is not as good
as the best model. This reflects the tradeoff that will usually be required going from a “black box”
model to an interpretable model.
17. Is the model interpretable? Open the Results window for the Regression – Interpretable node.
In the Output window, scroll down to the Analysis of Maximum Likelihood Estimates table.

The results from the Text Cluster node show the descriptive terms for each cluster.

Zooming in for clarity, you get:

Cluster 7 uses terms that describe activity on the ground (runway, taxiway, ground), which
is where incursions often occur. Thus, when a report falls into cluster number 7 with a high
unambiguous probability, it has a higher probability of having a Target05=1 label. In fact,
TextCluster5_prob7 has the largest regression coefficient. Not all clusters and topics have
such obvious interpretations, but overall, the model is clearly easier to understand.

Because this demonstration is optional, you do not need to use the derived model when
completing the exercises related to this diagram.

Practice

1. Adding a Decision Tree Node to the Flow to Compare to the Previous Four Regression
Models

a. Add a Decision Tree node to the flow used in the demonstration. Connect this to the
Text Topic node that is part of the flow using the Entropy term weight (because we
previously found this term weight to produce good results).

b. Rename the tree to Decision Tree - Entropy.

c. Make two changes in the tree default settings. First, set Leaf Size to 25 and then set
Assessment Measure to Average Square Error.
d. Connect the tree to the Model Comparison node and run it from there.
e. How does the tree compare to the other four models in terms of the ROC index on the test
data?

2. Adding a Text Rule Builder Node to the Flow to Compare to the Previous Five Models
a. Add a Text Rule Builder node to the flow used in the demonstration. Connect this to the
Text Filter node that is part of the flow using the Entropy term weight (because we
previously found this term weight to produce good results).

b. Rename the tree to TRB - Entropy.

c. Use the default setting for the Text Rule Builder node.
d. Connect the Text Rule Builder node to the Model Comparison node and run it from there.
e. How does the Text Rule Builder compare to the other five models in terms of the ROC index
on the test data?

4.2 Text Profile Node

Objectives
• Provide introductory details about the Text Profile node.
• Identify the property settings of Text Profile node.
• Illustrate how to use the Text Profile node with a news data set.

Text Profile Node

• The Text Profile node profiles a categorical target variable using terms
in a document collection.
• Unlike the text topic node, representative terms are not selected simply
based on term weight.
• To promote identification of truly discriminating terms, terms found for
a particular category are downweighted by the probability that a term
will appear in more than one category.

The documentation for the Text Profile node describes the methodology as follows:
“The Text Profile node enables you to profile a target variable using terms found in the documents.
For each level of a target variable, the node outputs a list of terms from the collection that
characterize or describe that level.
“The approach uses a hierarchical Bayesian model using PROC TMBELIEF to predict which terms
are the most likely to describe the level. In order to avoid merely selecting the most common terms,
prior probabilities are used to down-weight terms that are common in more than one level of the
target variable. For binary target variables, a two-way comparison is used to enhance the selection
of terms. For nominal variables, an n-way comparison is used. Finally, for ordinal and time variables
(which are converted to ordinal internally), a sequential, two-way comparison is done. This means
that the reported terms for level n are compared to those at level n-1. The exception for this is the
first level, which is compared to level 2 since there is no preceding level to compare it to.”

Text Profile Node

There are only two properties that you specify, and if you do not have a numeric variable with the
role Time ID, then the Date Binning Interval property is irrelevant. The SAS Enterprise Miner 15.1
Reference Help provides an example of profiling with a Time ID variable.
The Maximum Number of Terms property defines how many terms you want to use to profile each
category.

Text Profile Node

For nominal target variables with more than two categories, a Beliefs by Value plot is produced.
If you request M terms per category, and if there are K categories, K>2, you will get an (M*K) by K
crosstabulated color map chart. If M or K (or both) is too large, SAS Enterprise Miner graphing
algorithms will reduce the dimensions to produce a viewable plot. If categories share high-belief
terms, then you will get fewer than M*K vertical axis cells.

Text Profile Node

Enlarged with columns omitted:

The eight (default) terms along with the term role are
given in the Profiled Variables table.

The terms with the highest belief scores are given in the Profile Variables table.

Text Profile Node

Each term has a belief score for each category. The top eight (default) scoring terms for
each category are kept. A crosstabulation table for terms by category is color coded based
39
on belief score to show separation between categories.
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

The above three categories are well separated, as evidenced by the red diagonal elements
corresponding to the eight terms for the category, and the blue off-diagonal elements, corresponding
to the terms for the other categories. The above 24 by 3 table would have fewer than 24 vertical cells
if categories shared high-belief terms. This is unlikely in that terms are downweighted if they are
common to more than one category. However, if categories are independent of terms, then the color
map appears as a somewhat random mixing of various shades of red and blue. The diagonal will
always be a darker red because diagonals correspond to the highest belief terms, even if the
difference in beliefs is small.

Text Profile Node

A portion of the table that produces

Text Profile Node

Example of Target Independent of Documents

For a target variable that has categories assigned randomly, independent of terms, off-diagonal
elements are more washed out, with only a few cells having a distinct blue color.

Text Profile Node

A projection of the eight (default) term beliefs into a two-dimensional space provides an idea of how
well the target values are separated. The terms in the document collection allow for better separation
between medical news stories and hockey news stories than between medical news stories and
graphics news stories. Furthermore, graphics news stories and hockey news stories exhibit the best
separation based on the terms in the collection.

Profiling News Articles Using the Text Profile Node

This demonstration illustrates the capabilities and applications of the Text Rule Profile node applied
to a document collection of news articles.

Note: This demonstration is based on a demonstration given in the SAS Text Miner 15.1 Reference
Help.
The SAMPSIO.NEWS data set contains 600 brief news articles. For convenience, and to avoid
problems with user access or data modifications to the SAMPSIO library, the data has been copied
to the course data folder. It can be accessed using name DMTX51.News. The data set contains the
following variables.

TEXT a nominal variable that contains the text of the news article

graphics a binary variable that indicates whether the document belongs to the
computer graphics category (1-yes, 0-no)

hockey a binary variable that indicates whether the document belongs to the
hockey category (1-yes, 0-no)

medical a binary variable that indicates whether the document is related to

medical issues (1-yes, 0-no)

newsgroup a nominal variable that contains the group that a news article fits into

Follow these steps to create a diagram and process flow for profiling a document collection.
1. Create a diagram named News Profiling.
2. Create a data source for DMTX51.News. Set the measurement level for graphics, hockey,
and medical to Binary. Set the role of TEXT to text, and set the role of hockey to Target.

3. Drag the News data source into the News Profiling diagram. Edit the metadata to correspond
to the following table:

4. Attach the following nodes, in order, to the Input Data Source node: Text Parsing, Text Filter,
and Text Profile. Here is what the process flow should look like.

Use the default settings for all nodes.

5. Run the Text Profile node.

6. Open the Text Profile Results window.

The pie chart reveals that there are 200 documents classified as hockey news articles.

The number of documents for each category is also given in the Profiled Variables table.
The Profiled Variables table reveals eight key terms that help identify whether an article
is a hockey article.

An appealing feature of the Text Profile node is that it captures part-of-speech/entity categories.
These are presented in the term table in the Text Filter Results window as a role characteristic,
but in clustering, cluster terms do not provide a role, and in the Filter Viewer, you cannot query
by role.
The default setting for the Text Profile node is Maximum Number of Terms=8. You can use
the eight derived terms to create a custom Hockey topic.
7. Attach a Text Topic node to the Text Filter node. Create a custom Hockey topic. Use the
following table:

The custom table specifies the eight terms derived by the Text Profile node.
Note: You can use the belief score derived by the Text Profile node for each term, but if
separation is good, belief scores will be near 1, which is the weight used for all terms
in the above table.
8. Run the Text Topic node.

9. Open the Text Topic Interactive Topic Viewer.

The Hockey topic appears first because it is the only custom topic. Examine the Documents
window. The custom topic seems to do a good job of rank ordering documents with respect
to the hockey target variable. To verify this, a SAS Code node is attached to cross-correlate
hockey by TextTopic_1, where TextTopic_1 is the binary variable for the Hockey custom topic.
The program is called PROC_FREQ_News_Profiling_1.sas and is in
D:\Workshop\Winsas\DMTX51\sassrc. The results follow.

The precision for the custom Hockey topic is

Precision=157/(157+36)=81.3%
Recall is given by

Recall=157/(157+43)=78.5%
The misclassification rate is
Misc=(43+36)/600=13.2%
The custom topic derived using the Text Profile results does a pretty good job of classifying news
articles with respect to hockey.

10. The Text Profile node contains additional diagnostic results when a nominal target variable has
more than two categories. In the same diagram, drag a News data source into the diagram.
Reject all target variables except newsgroup.
11. Add Text Parsing, Text Filter, and Text Profile nodes to the data source having target variable
newsgroup. Run the Text Profile node.
12. Open the results for the Text Profile node.

With more than two categories, the results contain a Beliefs by Value matrix. If the categories are
well separated, you will see a red diagonal with blue off-diagonal cells. If the categories are not
well separated, there will be a less dramatic color change for off-diagonal elements, resulting
in more neutral colored cells, and few cells having a blue color.

13. Expand the Beliefs by Value plot.

The plot shows that the eight terms associated with each category have a high belief score,
whereas terms identified for other categories have a low belief score for the given category.
If you see belief scores near zero, indicated by a darker blue color, in the off-diagonal cells, then
the categories are well separated based on terms used in the documents. If cell belief values
in the off-diagonal are closer to 0.5, then a more washed-out or neutral color will indicate a lack
of separation.

Here is a Beliefs by Value table derived using a target value that is independent of document
contents.

Note that the eight high-belief terms for newsgroup=hockey category in the previous plot are
the same terms found for the hockey=1 category in the first process flow.

Practice

3. Creating Custom Topics Using Text Profiles

a. Using the diagram from the previous demonstration, attach a Text Profile node to the
process flow for the target variable hockey. Set Maximum Number of Terms to 20.
b. Attach a Text Topic node to the Text Filter node for the Hockey process flow. Create
a custom topic table using the terms identified by the Text Profile node using a maximum
of 20 terms.
c. How does your custom hockey topic compare to the one derived in the demonstration?

4.3 High-Performance (HP) Text Miner

Node (Optional)

Objectives
• Identify the benefits of the HP Text Miner node.
• Run a process flow using HP components.
• State the capabilities within the HPTMINE procedure.

High-Performance Text Mining

The HPTMINE procedure and node is designed to
• process extremely large amounts of text data
• do so quickly.

The high-performance environment requires a special implementation

of Enterprise Miner:
• grid enablement
• SMP (symmetric multiprocessing – multi-threads)
• MPP (massively parallel processing)

You must use a high-performance configuration to effectively take advantage of this capability.

High-Performance Modes of Operation

Distributed Mode
• Several configurations
• “Alongside the database”

Single-Machine Mode
• With local data
• Does not use the MPP environment
• (MPP = Massively parallel processing)

Distributed mode is the more typical deployment of a high-performance configuration where

significant reductions in elapsed time and improvements in efficiency can be achieved. The high-
performance procedure runs on several computers that are called an appliance. The appliance can
include a database management system. “Alongside the database” mode supports parallel reading
of data into a high-performance analytical procedure running on the database appliance.
In single-machine mode, a single computer operating system oversees the running of the HP
processes. Processors, disks, and memory in this mode can be used by the analytical processes.
The multithreading capabilities built in to Enterprise Miner work in this mode. It is possible to
combine some HP nodes with non-HP nodes successfully in an Enterprise Miner process flow,
although this is not considered a best practice.
For additional details, refer to

The book: Base SAS® 9.4 Procedures Guide: High-Performance Procedures, Second Edition
The course: SAS® Enterprise Miner™ High-Performance Data Mining Nodes

Performance Results:
SAS Global Forum Paper 400-2013
Traditional vs. HP Text Mining
3500
3000
2500
seconds

2000
1500
1000
500
0
Total Text Mining Predictive Modling
Traditional HP

The results above were presented in the SAS Global Forum 2013 paper 400-2013.
http://support.sas.com/resources/papers/proceedings13/400-2013.pdf
The data consisted of more than 680,000 paragraphs from the Consumer Complaints data set.
The traditional elapsed times were run on a Windows server with two CPUs and 128 GB of memory.
The HP nodes were run on a cluster containing 16 computing nodes, each with two CPUs and 64
GB of memory. The high- performance text mining procedures can reduce a 30-minute task to less
than a minute in a grid computing environment according to findings presented in this paper.
High-performance results can vary and depend on the configuration of your SAS environment
and your specific parallel processing machine configuration.

High-Performance Nodes
The HP Text Miner node (HPTM) is one of many tools available on the
SEMMA tab shown here (HPDM). HP Text Miner executes two phases:
• text parsing
• transformation

HPTM Properties

The HP Text Miner node property selections include Detect, Filter, and Transform. Text parsing,
natural language processing, and entity detection are all supported.
Your own pre-existing customized tables can be specified for multi-word terms, synonyms, and stop
lists just as in the regular Text Mining nodes. The SVD (Singular Value Decomposition) resolution
can be any number from 2 to 500. A higher number generates a better data summary but takes more
computing power to finish.

The Max SVD Dimensions property (maxdim) accounts for p% of the total variance. High resolution
always generates the maximum number of SVD dimensions (maxdim). For medium resolution, the
recommended number of SVD dimensions accounts for 5/6*(p% of the total variance). For low
resolution, the recommended number of SVD dimensions accounts for 2/3*(p% of the total variance).

Compare Text Mining Node Properties

Unlike the Text Filter node, the HP node does not have a Filter Viewer
ellipsis.

The property shown here is from the regular Text Filter node. This option was not made available
in the High-Performance Text Miner node.

Data Set Requirements

The HP Text Miner node requires an input data set that contains
the following
• a variable with the role Text. This variable cannot be an interval variable.
• a variable with the role Key. The key variable contains a unique identifier
for each observation in the input data set (similar to the role ID).

Note: The role Key must be set when the data source is defined to Enterprise Miner. It is not
available in the Metadata node. There are seven variables in this data set, including four
inputs, a text, a key, and a target. A target variable is not necessary for unsupervised text
mining.

Process Flow
A data source is connected to the HP Text Miner node in this diagram.
It has been run, and the results are on the next slide.
All properties in this example were left as their defaults.

The Topics and Terms table results are contained in the node results. The maximum number
of terms shown in the table is 20,000 by default. The other plots in this window provide the same
insight into the document collection as described in previous chapters.

Output Window
The Output window lists the tasks that were run. The HPTMINE procedure
consolidates the actions that would be performed by several non-high-
performance nodes.

Topics
The SVD process was used to derive the list of topics from the document
collection. This list used the default low-resolution setting.

Predictive Modeling with the HP Text Miner Node

This demonstration illustrates the capabilities and results of the High-Performance Text Miner node
in a single-machine environment.

1. Create a new diagram named High-Performance Text Mining in Enterprise Miner.

2. Create a new data source from the DMTX51 library using the HPDMINE table. Make sure that
you set the role to Key for the ClaimNo variable. Set the Role and Level attributes for each
variable according to the display capture below. Pull the data source into the diagram.

3. From the HPDM SEMMA tab, pull an HP Text Miner node into the diagram and connect the data
source to it.

4. Run the node with all the default settings, and open the results.

5. Maximize the Output window. The results indicate that the HPTMINE procedure ran in single-
machine mode. The mode depends on the type of Enterprise Miner implementation. The Output
window shows that the document collection was parsed, terms were analyzed and filtered,
and singular value decomposition was done.

6. Examine the fields of the Terms and Topics tables that were created. Note how similar
(and familiar) they are compared to the Text Mining nodes that we ran in previous chapters.

PROC HPTMINE Syntax (Optional)

PROC HPTMINE can be coded to run in batch mode. The following slides
illustrate the syntax that will be used in the next demonstration.

PROC HPTMINE Syntax

proc hptmine data=txt.hpdmine;
doc_id claimno; var adjusternotes;
parse name of the text variable
nostemming notagging nonoungroups
termwgt = none ID variable identifying each document
cellwgt = none
reducef = 0
entities = std
outparent = outparent
outterms = outterms
outchild = outchild
outconfig = outconfig
;
performance details;
run;
61
proc print data=outterms; title "OUTTERMS Data Set"; run;
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Declare the data set, ID (or key) variable, and the name of the text variable. The data set that you
use in batch mode will likely not have the Data Mining metadata attributes, so you have to identify
these variables in the procedure as specified.

PROC HPTMINE Syntax

proc hptmine data=txt.hpdmine;
doc_id claimno; var adjusternotes;
parse
nostemming notagging nonoungroups
termwgt = none
cellwgt = none
reducef = 0 excludes specific text processing
entities = std
outparent = outparent
outterms = outterms
outchild = outchild
outconfig = outconfig
;
performance details;
run;
62
proc print data=outterms; title "OUTTERMS Data Set"; run;
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

These lines state that we want to parse the document collection. Stemming, tagging, and noun
grouping operations can be performed or suppressed by excluding or including these variables.

PROC HPTMINE Syntax

proc hptmine data=txt.hpdmine;
doc_id claimno; var adjusternotes;
parse
nostemming notagging nonoungroups
specifies weighting techniques
termwgt = none
cellwgt = none removes terms not appearing
reducef = 4 in this number of documents
entities = std
outparent = outparent identifies standard entities
outterms = outterms
outchild = outchild
outconfig = outconfig
;
performance details;
run;
63
proc print data=outterms; title "OUTTERMS Data Set"; run;
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Term and cell weights are applied to the compressed term-by-document matrix. Term weight can
be Entropy, MI, or None. Cell weight can be Log or None. Terms appearing in fewer than the
reduced number of documents will be excluded from analysis. Entities can either be identified
or ignored.

PROC HPTMINE Syntax

proc hptmine data=txt.hpdmine;
doc_id claimno; var adjusternotes;
parse
nostemming notagging nonoungroups
termwgt = none
cellwgt = none
reducef = 0 selects output data sets to create
entities = std
outparent = outparent
outterms = outterms
outchild = outchild
outconfig = outconfig
;
performance details;
run;
64
proc print data=outterms; title "OUTTERMS Data Set"; run;
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Term number, document, and count variables will be produced in these output data sets.
The OUTCONFIG= data set is used if a subsequent HPTMSCORE procedure will be run against
the document collection results.

PROC HPTMINE Syntax

proc hptmine data=txt.hpdmine;
doc_id claimno; var adjusternotes;
parse
nostemming notagging nonoungroups
termwgt = none
cellwgt = none
reducef = 0
entities = std
outparent = outparent
outterms = outterms
outchild = outchild
outconfig = outconfig
shows elapsed time of tasks
;
performance details;
run;
65
proc print data=outterms; title "OUTTERMS Data Set"; run;
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.

Using PROC HPTMINE

This demonstration illustrates use of the HPTMINE procedure.

1. From Enterprise Miner, select View  Program Editor.

2. From the Editor, select File  Open  d:\workshop\winsas\dmtxt51\sassrc\hptmine.sas.

The following program appears:

3. Submit the code by pressing F3 or clicking the Run icon. Check the log.

4. Open the Output window.

5. Observe the contents of the Output window to see the task timing and the OUTTERMS= data set
results.

6. (Optional) Open SAS Explorer, select the Show Project Data box, and select the Work library.
You will be able to see the output data sets created by the HPTMINE procedure. The
OUTPARENT= data set is shown below.

HP Data Partition Node

If your objective is prediction in a high-performance configuration, you must
use the HP Data Partition node from the HPDM tab for compatibility.

Training and Validation partitions are supported. A Test partition is not

available in this mode.

Note: If you are running in single-machine mode as in the classroom environment, the original Data
Partition node could be used and will not generate any errors. You could use either Data
Partition node in this specific case.

However, it is a best practice to use the HP Data Partition node with the HP nodes because
this is necessary in distributed mode.

HP Partition Properties
These are the selections available in the HP partition node:

The HP Data Partition node supports two types of partitioning: simple random and stratified. With
a class target variable, the default method is stratified partitioning. Otherwise, simple random
partitioning is used. The node supports up to two stratification variables.

HP Tree
There is a high-performance decision tree node that runs in a high-
performance environment. Find it on the HPDM tab of the SEMMA palette.

Although additional high-performance modeling nodes are available, we will look at only the HP Tree
node. Explore the other modeling nodes and compare the results to select your champion model.

HP Tree Property
A few new properties are in the HP Tree node.
• Nominal Target Criterion Fast Chaid
• Interval Bins – for interval variables
• Minimum KS distance for Fast Chaid trees

There is also a property to create a validation sample from the training data set. If you have
a validation data set available, it could be used in lieu of a sample.
Notable properties that are unavailable with this node include the Interactive Tree Viewer, Decisions
and Priors, and Cross Validation.

Process Flow
In this demonstration, copy the previous process flow and add the HP
Partition and HP Tree nodes with the default properties. The copied and
modified flow is shown below. The data source has a binary target variable:
SubroFlag.

Results
The fit statistics from the results of running the HP Tree node with default
settings are below. This run of the node used the Validation partition from
the HP Part node because we did not request that it create its own
validation set.

Predictive Modeling Using High-Performance Nodes

This demonstration illustrates how to perform predictive modeling using high-performance nodes.
1. Open the High-Performance Text Mining diagram from the previous demonstration.

2. Copy the two nodes in the diagram (HPDMINE and HP Text Miner), and paste them in the
diagram below the originals.
3. Drag an HP Data Partition node and an HP Tree node into the diagram from the HPDM tab.
4. Connect the nodes in the order shown below. Run the process flow from the end with default
settings.

5. Close the completion window.

6. Examine the properties and the results of the HPPart node and verify that 70% of the data was
allocated to the role: Train. (341 / 1135 = .30)

7. Look at the results of the HP Text Miner node. Notice that more topics were derived compared
to the previous run.

8. Open the results of the HP Tree node. Are there any property settings that you would consider
testing to possibly create an even better predictive model? Look at the Leaf Statistics node for
a hint.

4.4 Chapter Summary

Enterprise Miner has many powerful predictive modeling nodes, as well as nodes that support
modeling in a variety of ways. The Text Miner nodes are integrated into this environment. The default
settings for the various Text Mining nodes (and other Enterprise Miner nodes) are designed to be
good across many situations, but are not always going to lead to the best results for your particular
data. Take an experimental point of view and use Enterprise Miner and all its nodes as a convenient
“workbench” to test out a variety of different property settings to see what gives you the best models
in your environment.
In predictive modeling, there is usually a trade-off between better model performance (as assessed
by various metrics, such as the ROC index and the average squared error.) and more interpretable
models. A reasonable strategy to manage this trade-off is to create both types of models: models
with the highest possible assessment performance, regardless of interpretability; and models where
there is high interpretability. In this way, you can involve management, clients and other interested
individuals in the decision of which model should be used for implementation. Both the Text Cluster
node and the Text Topic node can be used in ways that can enhance predictive power at the cost
of interpretability or vice versa.

The Text Rule Builder node provides a stand-alone predictive modeling solution for data having
a text variable and a categorical target variable. This node creates Boolean rules from small subsets
of terms to predict a categorical target variable. The node must be preceded by Text Parsing and
Text Filter nodes.
The Text Profile node derives belief scores based on word association with the levels of a categorical
target variable. The terms with the top beliefs scores help profile the levels of a categorical variable.
The derived terms can subsequently be used to enhance querying or to facilitate custom topic
identification.
High-performance text mining and predictive modeling procedures are designed to take advantage
of specially configured computing environments and technology. Distributed configurations can
enhance the speed of analysis by running multi-threaded parallel processing and I/O operations.
The path length of analysis is shortened even further when running analytical process either in the
database or alongside the database. Reducing the number of passes through the data results in less
total read time when data is kept in memory providing fast analysis when needed.
SAS Enterprise Miner includes high-performance nodes on the HPDM tab. The High-Performance
Text Miner node offers simplified selections and combines tasks performed by several individual
Text Mining nodes. Text parsing, filtering, and topic creation all are accomplished in the one node.
The results can be combined with additional data for supervised predictive modeling applications.

References
E. Allan, M. Horvath, C. Kopek, B. Lamb, T. Whaples and M. Berry. Anomaly detection using
nonnegative matrix factorization. In M. Berry and M. Castellanos, Editors, Survey of Text Mining II:
Clustering, Classification, and Retrieval, pages 203–218. Springer-Verlag London Limited, 2008.

4.5 Solutions
Solutions to Practices
1. Adding a Decision Tree Node to the Flow to Compare to the Previous Four Regression
Models
The tree should be added to the previous flow by attaching the Decision Tree node to the
Text Topic node corresponding to entropy.

The Model Comparison node shows the following results for the Decision Tree node.

Tree ROC Index

for Test Data

2. Adding a Text Rule Builder Node to the Flow to Compare to the Previous Five Models

The results highlighting the Text Rule Builder ROC index are given below.

Text Rule Builder ROC

Index for Test Data

Recall that the Text Rule Builder cannot use any non-text inputs. This limitation did not affect
the ASRS project because no other inputs were available. The above results for Target05 are
consistent with the results that you obtained earlier for Target02 in that the Text Rule Builder
is not superior to conventional predictive modeling techniques, but it is competitive with such
techniques while adding an element of interpretability.
3. Creating Custom Topics Using Text Profiles
a. Using the diagram from the previous demonstration, attach a Text Profile node to the
Hockeyprocess flow for the target variable hockey. Set Maximum Number of Terms to 20.

b. Attach a Text Topic node to the Text Filter node for the Hockey process flow. Create a
custom topic table using the terms identified by the Text Profile node using a maximum
of 20 terms.

If you copy and paste the original Text Topic node  SAS Code node and attach the copied
nodes to the Text Filter node, you can append to the original custom topic table. The final
table follows.

c. How does your custom Hockey topic compare to the one derived in the demonstration?
In the SAS Code node, change the variable TextTopic_1 to TextTopic2_1. The results are
given below.

Precision=100*183/(183+42)=81.3% (compare to eight-term result=81.3%)

Recall=100*183/(183+17)=91.5% (compare to 78.5%)
Misc=100*(42+17)/600=9.8% (compare to 13.2%)
Although precision is the same, the other results are better, so that overall the 20-term topic
appears to be superior to the 8-term topic for identifying hockey news articles.

Solutions to Activities and Questions

4.01 Multiple Choice Question

Which term weight resulted in a regression model with the most inputs
as determined by forward selection?

a. None (19 inputs)

b. Entropy (22 inputs)
c. Inverse Document Frequency (11 inputs)
d. Mutual Information (20 inputs)

Get Started 31
No ratings yet
Get Started 31
92 pages
Getting Started With SAS Text Miner
No ratings yet
Getting Started With SAS Text Miner
102 pages
Applying Data Mining Techniques Using SAS Enterprise Miner
No ratings yet
Applying Data Mining Techniques Using SAS Enterprise Miner
308 pages
SAS Text Miner: Automate Discovery and Insights From Document Collections
No ratings yet
SAS Text Miner: Automate Discovery and Insights From Document Collections
4 pages
SAS® Enterprise Miner™ Tour: Hands-On Workshop
No ratings yet
SAS® Enterprise Miner™ Tour: Hands-On Workshop
110 pages
Getting Started With SAS Enterprise Miner 7.1
No ratings yet
Getting Started With SAS Enterprise Miner 7.1
72 pages
Eminer Help
No ratings yet
Eminer Help
72 pages
SAS EMiner Guidelines - FULL
No ratings yet
SAS EMiner Guidelines - FULL
84 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
Lec1 PDF
No ratings yet
Lec1 PDF
20 pages
Getting Started With SAS Enterprise Miner
No ratings yet
Getting Started With SAS Enterprise Miner
76 pages
Data Mining Using SAS Enterprise Miner
No ratings yet
Data Mining Using SAS Enterprise Miner
108 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
Curso Web y Data Mining 3 Predictive Modeling Using Logistic Regresion
100% (6)
Curso Web y Data Mining 3 Predictive Modeling Using Logistic Regresion
208 pages
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
No ratings yet
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
7 pages
Data Mining SAS
100% (4)
Data Mining SAS
484 pages
EDA Website Material (As Per VTU Syllabus)
No ratings yet
EDA Website Material (As Per VTU Syllabus)
160 pages
LWEG171 E70539 05apr2016p374 v11
No ratings yet
LWEG171 E70539 05apr2016p374 v11
374 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
110107129
No ratings yet
110107129
655 pages
The Text Mining Handbook
No ratings yet
The Text Mining Handbook
423 pages
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
No ratings yet
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
73 pages
Visual Text Analytics Users Guide
No ratings yet
Visual Text Analytics Users Guide
134 pages
Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit
No ratings yet
Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit
17 pages
U10a1 DATA ANALYTICS INTERNSHIP TEXT MINING APPLICATIONS Hal Hagood Dante Durrman
No ratings yet
U10a1 DATA ANALYTICS INTERNSHIP TEXT MINING APPLICATIONS Hal Hagood Dante Durrman
8 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
Chapter 07 - in Class
No ratings yet
Chapter 07 - in Class
49 pages
Curso Sas Enterprise Guide
No ratings yet
Curso Sas Enterprise Guide
348 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
Text Mining: 2 History
No ratings yet
Text Mining: 2 History
8 pages
Week 3 Text, Web, and Social Media Analytics
No ratings yet
Week 3 Text, Web, and Social Media Analytics
58 pages
3510-6510 Ch5
No ratings yet
3510-6510 Ch5
73 pages
SAS Enterprise Guide Querying and Reporting - Course Notes PDF
100% (1)
SAS Enterprise Guide Querying and Reporting - Course Notes PDF
292 pages
Prof. Mohammed Tanzeem Agra
No ratings yet
Prof. Mohammed Tanzeem Agra
33 pages
Guia Formacion SAS
No ratings yet
Guia Formacion SAS
16 pages
Statistical Programing in SAS Second Edition A. John Bailer Available All Format
No ratings yet
Statistical Programing in SAS Second Edition A. John Bailer Available All Format
135 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
No ratings yet
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
91 pages
SAS® Programming 3 Advanced Techniques PDF
100% (3)
SAS® Programming 3 Advanced Techniques PDF
244 pages
Applied Text Analysis
No ratings yet
Applied Text Analysis
13 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Text Mining
No ratings yet
Text Mining
3 pages
Pmad61 001 WM
No ratings yet
Pmad61 001 WM
244 pages
Module 4
No ratings yet
Module 4
63 pages
Introduction to Text Mining Course
No ratings yet
Introduction to Text Mining Course
45 pages
Fast Access Testbank Text As Data Computational Methods of Understanding Written Expression Using SAS Wiley and SAS Business Series 1st Edition
No ratings yet
Fast Access Testbank Text As Data Computational Methods of Understanding Written Expression Using SAS Wiley and SAS Business Series 1st Edition
329 pages
Data Science New
No ratings yet
Data Science New
9 pages
Applied Text Mining
100% (1)
Applied Text Mining
505 pages
Predictive Analytics DSA
No ratings yet
Predictive Analytics DSA
5 pages
PhonePe Statement Mar2025 Apr2025
No ratings yet
PhonePe Statement Mar2025 Apr2025
12 pages
Mismatches in Pattern Simulation Due To Clock Gating
No ratings yet
Mismatches in Pattern Simulation Due To Clock Gating
2 pages
Angel All Data
No ratings yet
Angel All Data
12 pages
Mathematical Notation Guide
No ratings yet
Mathematical Notation Guide
7 pages
Software Reverse Engineering in Digital Forensics
100% (1)
Software Reverse Engineering in Digital Forensics
36 pages
Yokogawa - AE PH Sensor FU20
No ratings yet
Yokogawa - AE PH Sensor FU20
2 pages
Ramdump Wcss Msa0 2023-03-31 19-46-59 Props
No ratings yet
Ramdump Wcss Msa0 2023-03-31 19-46-59 Props
14 pages
Co Dda4233 3D Character Animation 2023-2024
No ratings yet
Co Dda4233 3D Character Animation 2023-2024
9 pages
Project - Ario Wicaksono
No ratings yet
Project - Ario Wicaksono
26 pages
Mad Practical 77
No ratings yet
Mad Practical 77
5 pages
Unit-3. - Knowldge Representaion Reasoning
No ratings yet
Unit-3. - Knowldge Representaion Reasoning
20 pages
Ic 8855
No ratings yet
Ic 8855
36 pages
1 Feross Aboukhadijeh
No ratings yet
1 Feross Aboukhadijeh
52 pages
(Template) Surname Initials Student Number-T3 Project Final
No ratings yet
(Template) Surname Initials Student Number-T3 Project Final
34 pages
VIsual Design Assignment
No ratings yet
VIsual Design Assignment
2 pages
Gujarat Technological University (Established Under Gujarat Act No. 20 of 2007)
No ratings yet
Gujarat Technological University (Established Under Gujarat Act No. 20 of 2007)
27 pages
9000C User Manual
No ratings yet
9000C User Manual
5 pages
Grade 9 Exam Syllabus-Term 1-2024-2025-Final and Checked
No ratings yet
Grade 9 Exam Syllabus-Term 1-2024-2025-Final and Checked
7 pages
Yashwantrao Chavan Maharashtra Open University
No ratings yet
Yashwantrao Chavan Maharashtra Open University
1 page
The Report of On The Job Training
No ratings yet
The Report of On The Job Training
21 pages
Aiep1 S3 SC
No ratings yet
Aiep1 S3 SC
3 pages
Exam 220-1101 (Core 1)
100% (2)
Exam 220-1101 (Core 1)
252 pages
Automotive Parts Specification
No ratings yet
Automotive Parts Specification
1 page
Project File Abhay Yadav
No ratings yet
Project File Abhay Yadav
63 pages
DSD w20dsd Gtu
No ratings yet
DSD w20dsd Gtu
1 page
Keller Leorecord User Manual
No ratings yet
Keller Leorecord User Manual
8 pages
Field Safety Notice EPIQ Affiniti Patient Data Error FCO79500532 Final 27oct2020
No ratings yet
Field Safety Notice EPIQ Affiniti Patient Data Error FCO79500532 Final 27oct2020
5 pages
3 Daily-Expense-Tracker-synopsis
67% (3)
3 Daily-Expense-Tracker-synopsis
17 pages
Cryptography & Security Analysis Assignment
No ratings yet
Cryptography & Security Analysis Assignment
7 pages
Ramalinga CV
No ratings yet
Ramalinga CV
3 pages