LWDMTX5 001
LWDMTX5 001
Text Miner
Course Notes
Text Analytics Using SAS® Text Miner Course Notes was developed by was written by Terry
Woodfield with a significant contribution from Rich Perline based on a previous edition of the course.
Additional contributions were made by Peter Christie and George Fernandez. Instructional design,
editing, and production support was provided by the Learning Design and Development team.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Text Analytics Using SAS® Text Miner Course Notes
Copyright © 2019 SAS Institute Inc. Cary, NC, USA. All rights reserved. Printed in the United States
of America. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise,
without the prior written permission of the publisher, SAS Institute Inc.
Book code E71427, course code LWDMTX5/DMTX5, prepared date 24May2019. LWDMTX5_001
ISBN 978-1-64295-272-8
For Your Information iii
Table of Contents
Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner.............. 1-1
Demonstration: Text Analytics Illustrated with a Simple Data Set ....................... 1-8
Practice............................................................................................................ 1-120
Practice.............................................................................................................. 2-46
Practice.............................................................................................................. 2-71
Practice.............................................................................................................. 3-31
Practice.............................................................................................................. 4-28
Practice.............................................................................................................. 4-42
To learn more…
For information about other courses in the curriculum, contact the
SAS Education Division at 1-800-333-7660, or send e-mail to
[email protected]. You can also find this information on the web at
http://support.sas.com/training/ as well as in the Training Course
Catalog.
For a list of SAS books (including e-books) that relate to the topics
covered in this course notes, visit https://www.sas.com/sas/books.html or
call 1-800-727-0025. US customers receive free shipping to US
addresses.
vi For Your Information
Lesson 1 Introduction to SAS®
Enterprise Miner™ and SAS® Text
Miner
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-3
Course Data
Text analytics is a vigorous field of research with many applications. The purpose of this course is
to teach you how to solve analytic problems that include relevant textual data. This is done using
SAS Enterprise Miner (EM), a very general data mining product that incorporates text analytic tools
among many other statistical and machine-learning tools.
Access to real business data is always problematic. Because text fields often contain confidential
information, access to business data that include text is even more difficult. Most data sets used in
this course are publicly available. All data used in this course are either artificially created or
modified in some way. Modifications include the following:
• deletion of sensitive entries
• deletion of potentially embarrassing or misleading entries
• editing or deletion of entries with named individuals or business organizations
• editing of text fields having obscure or confusing references
• resolution of ambiguities that might be incorrect
• modification or deletion of entries to promote educational goals
Because of these modifications, the data should not be used for any purpose other than education.
All publicly available data sets are introduced with a reference to the source of the actual data. You
should acquire data directly from the source if you want to use the data for business or scientific
purposes.
SAS Enterprise Miner works with the hierarchy ProjectDiagramProcess FlowNode. A typical
organization will have multiple projects, even if the same data are used. For example, in an
insurance company, a medical cost containment project and a fraud detection project might share
data that contain features of an insurance accident claim. Many demonstrations in this course would
be classified as projects, yet there is only one Enterprise Miner project for the entire course. The
project has many diagrams. Think of each diagram as a separate project. This “best practice” for
teaching economizes class time by eliminating transition periods between opening and closing
projects. Whereas it is an education best practice to use a single Enterprise Miner project to
accommodate many actual projects, a better practice for a typical business setting would be to use
a new Enterprise Miner project for each new business project.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-4 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Objectives
• Describe what text analytics is.
• Describe how SAS Text Miner is used with SAS Enterprise Miner.
• Briefly describe concepts related to document analysis.
• Identify some of the main items in SAS Enterprise Miner, including
the SEMMA tools palette.
• Illustrate text analytics with a simple document collection.
3
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Text Analytics
• You use the terms text analytics, text data mining, and text mining
synonymously in this course.
• Text analytics uses natural language processing techniques and other
algorithms for turning free-form text into data that can then be analyzed
by applying statistical and machine learning methods.
• Text analytics encompasses many sub areas, including stylometry, entity
extraction, sentiment analysis, content filtering, and content
categorization.
• Text analytics spans many fields of study, including information retrieval,
web mining, search engine technology, and document classification.
4
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
This course focuses on the use of SAS Text Miner, a fully integrated component of SAS Enterprise
Miner. SAS has a rich set of other text analytic products. SAS Text Miner can be regarded as the
most focused on discovery and prediction.
Visit http://support.sas.com for information about the latest text analytic offerings. Other courses
present topics related to products such as SAS Enterprise Content Categorization and SAS
Contextual Analysis.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-5
The two major components of data mining are pattern discovery (or exploratory analysis) and
predictive modeling. Text analytics generally covers the two broad areas of information retrieval
and text categorization, and these two areas often map into one of the two data mining
components. Because the abundance of text mining application areas can be overwhelming, some
clarification can be achieved by looking at what all text mining projects have in common. Reference
material on text analytics often includes different sub-categories of text mining. For example, Miner
et al. (2012) list 20 specialty areas ranging from document matching to web content mining. You can
also think about the two process-based categories: (1) projects that are almost exclusively text
analytics projects, such as automatic classification of tech support calls, and (2) projects with a more
general purpose, such as predicting insurance fraud, where text analytics supplies some of the
component parts of the solution.
Text Mining
Text mining as presented here has the following characteristics:
• operates with respect to a corpus of documents
• uses one or more dictionaries or vocabularies to identify relevant terms
• accommodates a variety of algorithms and metrics to quantify the
contents of a document relative to the corpus
• derives a structured vector of measurements for each document relative
to the corpus
• uses analytical methods that are applied to the structured vector of
measurements based on the goals of the analysis (for example, groups
documents into segments)
5
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-6 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
6
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
If you are not familiar with text analytics, one of the first books you should read is by Berry and
Browne (2005), who describe the mathematics behind search engine technology. Chakraborty et al.
(2013) provide a practical guide to text analytics using SAS software.
Discussion
What corpus will be used for your first
text mining project?
7
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
A simple example using 47 artificial documents shows how SAS Text Miner turns documents into
numbers. Using your experience with search engines as an information retrieval tool will help you
understand how the process works.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-7
8
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-8 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
This demonstration illustrates some text analytics concepts using a simple document collection.
Note: In this class, the project that you use and the diagrams are already set up, at least partially.
However, for each demonstration, you can rebuild your own version of each diagram. In
some cases, you can make additions to an existing diagram. Complete instructions are
provided even if the process flow has already been created for you.
If you have ever used a search engine, you have probably employed algorithms used by SAS Text
Miner. One of the more popular algorithms is called Latent Semantic Indexing. SAS Text Miner uses
a Latent Semantic Analysis algorithm, an extension of Latent Semantic Indexing developed to permit
applications beyond information retrieval.
During the development of this course, a search was initiated using the simple search keyword lions.
A specific commercial search engine was used, and the top 25 recommendations were considered.
This search engine provided the following top five recommendations.
1. A Wikipedia link on the word lions
2. The Twitter link for the Detroit Lions of the National Football League (NFL)
3. A link to the Detroit Lions website
4. A link to an ESPN web page dedicated to news coverage of the Detroit Lions
5. A link to a news article about a game played the day of the query featuring the Detroit Lions
Recommendations 10, 11, and 12 linked to articles related to African lions, as were most of the
remaining recommendations in the top 25. Other recommendations included links for the Lions Club,
a non-profit civic organization. Using a different commercial search engine, and examining the top 25
recommendations as before, the third recommendation was a link to an organization that promotes
humane treatment of captured African lions housed by zoos or animal preserves, and the fourth
recommendation was a link to the Lions Club. Recommendation 1 linked to Wikipedia, and
recommendations 2 and 5 linked to information about the Detroit Lions NFL team. The different
results can be attributed to many factors. Search engines are dynamic, and they dynamically react
to a user’s personal search history as well as to the dynamically changing Internet. One reason for
different results is that the browser used to initiate the searches had no history of any searches using
the second search engine, so the second search engine had no data to learn user preferences.
A general problem in text mining is word sense disambiguation. A search engine cannot
disambiguate lions into the proper category of African lions or any animal classified as a lion, the
professional sports team Detroit Lions, or the civic organization the Lions Club. The use of a search
history can help the search engine guess what information the user is seeking. If a high percentage
of recent searches were for NFL scores, American football scores, sports teams, or the sport of
American football, then results like those encountered above are likely to be satisfactory to the
searcher. In fact, on the day of the query, the browser had been used to make numerous searches
using the first search engine to get updates on scores and to find game schedules for the NFL.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-9
To see how search engines dynamically learn from individual users, you can perform an experiment
with friends or colleagues. Search on a simple term like lions. Have others perform the same search
using the same search engine on their individual computers. The top 25 results should be different,
at least in ranking. Consider the top five results listed above. A search by the same person on the
same computer using the same browser, but on a different day produced different results. The
results also differed from those obtained by a colleague using a different computer. Langville and
Meyer (2006) provide some insight into search technology for the popular Google search engine.
As mentioned before, Berry and Browne (2005) introduce search engine mathematics, and their
book is highly recommended.
Using the simple 47 document corpus described above, the term lion or lions appears in seven
documents. All seven documents are classified as animals documents.
Zooming in on the text, you can verify that these are lions documents.
The above results were obtained using the Filter Viewer in the Text Filter node. The Filter Viewer
accommodates queries and is explained later. SAS Text Miner provides several strategies for
retrieving information from a document collection. The simple example was chosen to illustrate basic
concepts and avoid the complexities of word sense disambiguation. Lion always refers to the animal
lion in the simple collection.
If you want to find information on the Internet, you might use Google or Bing or some other search
software. This demonstration shows you ways to find information in a document collection using
SAS Text Miner.
Your instructor will explain how to start SAS Enterprise Miner for your computing environment.
On Microsoft Windows clients, you can use Start All Programs SAS SAS Enterprise Miner
Client. If you are using the workstation version, then select SAS Enterprise Miner Workstation.
The version number is included in the selection (for example, SAS Enterprise Miner Client 15.1).
Alternatively, there might be a desktop icon for Enterprise Miner. If so, simply double-click the icon,
and the SAS Enterprise Miner login window appears. There is no login window for the workstation
version of Enterprise Miner.
There might be differences in the SAS Enterprise Miner main menu depending on version number.
The version number is given in the menu.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-10 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-11
SAS Enterprise Miner provides an analytical laboratory for solving business and scientific problems.
10
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Project panel
11
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-12 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Properties panel
12
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Help panel
13
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-13
Diagram workspace
14
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Process flow
15
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-14 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Node
16
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
17
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-15
Assess
Model
Modify
Explore
Sample
18
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The available SAS Enterprise Miner tools are contained in the tools palette. The most commonly
used tools are arranged according to a process for data mining referred to as SEMMA. This
is an acronym for the following:
Sample You sample the data by creating one or more data tables. The samples should be large
enough so that you have confidence in the reliability of the results.
Explore You explore data to better understand relationships, anomalies, and problems.
Modify You modify the data by cleaning, selecting, and transforming the variables considered for
modeling.
Model You model the data using the available analytical tools.
Assess You assess and compare alternative models to find the best results that you can obtain. t
Additional tools (nodes) are available on the Utility tab. Additional nodes can be licensed, such
as the Credit Scoring nodes and the Text Mining nodes.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-16 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Sample
Explore
Modify
Model
Assess
19
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The Help panel provides a terse description of a particular property. You can access the full
documentation for SAS Enterprise Miner and SAS Text Miner by clicking Help Contents.
20
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Details about features and algorithms, such as singular value decomposition, help you understand
how to use the software.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-17
21
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Having covered the basics of the SAS Enterprise Miner interface and SEMMA methodology, you
need to understand how to construct a diagram and a process flow. To create a new diagram, select
File New Diagram, or right-click the Diagrams entry in the project panel and select Create
Diagram. Enter the name of the diagram in the window that appears.
A process flow usually begins with one of the following: Input Data Source node, File Import node,
or Text Import node. These three nodes bring the data into a process flow.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-18 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
A process flow is created by dragging nodes into the diagram workspace and then releasing the
node at the desired location. To connect two nodes, move the cursor to the right of the first node until
you see a pencil icon. Then click and drag the arrow until it touches the second node.
When the arrow touches the second node, you can release the mouse button.
Your instructor will provide additional information, such as how to move or delete groups of nodes.
Information about data sources in SAS Enterprise Miner will be provided after the conclusion of the
demonstration.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-19
Consider the exported data property of the Text Cluster node. You can see the choices for browsing
or exploring the exported data.
In the above window, only the TRAIN data set exists. If you select the TRAIN data, you get access to
the three buttons at the bottom.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-20 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
The browse button opens a scrollable view of the data. The explore button provides a common
interface for exploring data.
The guide Getting Started with SAS® Enterprise Miner™ is available from
http://support.sas.com. The guide provides examples of exploring data with SAS Enterprise Miner.
Because an exploration is dynamic, you do not want to copy a large amount of data over the network
for exploration. Consequently, the explore window shows only explorations based on a sample of the
data. Use Options Preferences to control how explorations are performed. In particular, you
probably want to change Sample Method to Random, and Fetch Size to Max. Following is an
example of a plot obtained using the Plot Wizard and the 3D Scatter Plot option for the TRAIN data
exported by the Text Cluster node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-21
Sports Cluster
Animals Cluster
Weather Cluster
Dictionary Any list of words, with an optional role or weight attribute (or both) for each word.
Term A token (character string) or group of tokens (multi-word terms) having a specific
meaning in a given language.
Start List A dictionary of relevant terms to be used in the analysis.
Stop List A dictionary of irrelevant terms to be ignored for the analysis.
Stemming Mapping variations of a term into a single parent term. Variations can be caused by
verb tense (present tense/past tense), noun and verb singular/plural context,
grammatical gender of a verb or noun (so-called romance languages).
Synonyms More general than the usual definition of synonyms used in a language arts class,
for text mining, a synonym is any child term mapped to the same parent term, where
the mapping can be the result of true synonymity, mapping of misspelled words to
correctly spelled words, or stemming.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-22 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
For any corpus, you must specify either a start list or a stop list. A default stop list is provided
in the language that you are using for the analysis.
There are additional things that you could know but are not required to know to complete the project,
such as why certain properties are selected. For example, the Text Cluster node Max SVD
Dimensions property defaults to 100, but you will change this to 3 in the demonstration. The
detailed explanations of properties are provided in subsequent chapters.
Weather-Animals-Sports Project
1. Insert a File Import node in the diagram. The File Import node is on the Sample tab. This is the
first node on the left that you see in the diagram above. In this demonstration, you use a data set
that is completely stored in a single Microsoft Excel spreadsheet. Importing a file is one way of
getting relatively small text mining data sets into SAS Enterprise Miner. On the properties panel
for the File Import node, specify the import file as the data set
D:\workshop\winsas\DMTX51\WeatherAnimalsSports.xls. Run the File Import node.
2. To see the data set after the File Import node is run, go to the Exported Data line of the
properties panel. Click the ellipsis button ( ). Then select the Train data and click Browse near
the bottom of the window. You see the rows of the data set. The first seven rows are shown
below.
The data set has two fields: Target_Subject (with values A, S, W) and TextField, which consists
of short sentences. As indicated earlier, the sentences are about one of three subjects: animals
(A), sports (S), and weather (W).
Note: It is important to understand that the Target_Subject field was created by a person
interpreting the content of each TextField. It was not created automatically by the Text
Miner nodes. Consequently, the labeling of each document is subject to human error.
3. Read through a few of the rows and make sure that you understand the nature of the data set
and how it is structured. The variable TextField is what is referred to as a document. All the rows
of TextField together (47 rows of data) are referred to as the corpus.
4. You can save the SAS table derived from the Excel file by attaching a Save Data node from the
Utility tab to the File Import node. Use the properties of the Save Data node to name the table
and specify the library where the table is to be stored.
5. Attach a Text Parsing node to the File Import node. This node has the language processing
algorithms and has many different options that can be set by the user. For this demonstration,
use the default settings. Run the Text Parsing node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-23
6. Attach a Text Filter node to the Text Parsing node. Change the Minimum Number of
Documents value in the properties panel to 2. This option filters out terms that are not used
in at least two documents in the corpus collection. Because you use a very small data set,
the default value 4 is too large and eliminates too many terms. Run the Text Filter node.
7. Open the Filter Viewer in the properties panel. This is also called the Interactive Filter Viewer.
8. Look at the two main windows that open in the Filter Viewer. You see what is shown in the
display below.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-24 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
The first window, labeled Documents, simply lists each document and any other variables
in the training data set (in this case, only the variable Target_Subject). The second window,
labeled Terms, gives information about each of the terms that came out of the Text Parsing
node. A term does not have to be a single word. The term table contains the corpus dictionary—
that is, it contains every term in the document collection defined by the training data set, after
certain parts of speech have been eliminated. To be technically accurate, the term table is
related to the corpus dictionary, but it actually represents a mapping of the corpus dictionary into
a set of terms influenced by Text Parsing and Text Filter properties. For example, you can force
certain terms, such as addresses, compound words, or dates, into the term table.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-25
For this example, the entire 47 document collection is used as a training data set. SAS
Enterprise Miner accommodates the use of three data sets for analysis: train, validate, and test.
You can use the Data Partition node on the Sample tab to partition a raw data table into any or
all of the three analysis tables. The purpose of the three data tables is explained later. The
validate and test data sets are used to verify that scoring models or algorithms generalize to new
data. Because this is an exploratory analysis, only a training data set is used.
The Terms window contains the following information:
FREQ = number of times the term appears in the entire corpus.
#DOCS = total number of documents in which the term appears.
KEEP = whether the term is kept for calculations. The keep status reflects whether a
term is in the start list (KEEP=Y) or in the stop list (KEEP=N).
WEIGHT = a term weight. Term weights are explained later. The default term weight is
mutual information when a categorical target variable is present in the data.
ROLE = part of speech of the term (if the Different Parts of Speech property is set
to Yes).
ATTRIBUTE = one of abbr, alpha, mixed, num, or punct. Attributes num (number) and punct
(punctuation) are ignored by default.
Go to the Terms window and confirm that the word the is not listed. (If the TERM column
is not already in alphabetical order, you can sort a column by clicking on the heading.)
The table is
sorted by
term.
Why does the most common word in the English language not appear in the list? To understand
why, click the Text Parsing node so that the properties panel for that node is visible. Look
at the properties near the bottom. You can see that there is an Ignore Parts of Speech property.
By default, this excludes certain terms that are very common.
On the Text
Parsing node
properties panel
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-26 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
In particular, 'Det' represents Determiner, which is a class of common words and phrases such
as the, that, and an. These are eliminated unless you modify this property. Eliminating a word
because of the Ignore Parts of Speech property is different from adding a word to the stop list.
Words in the stop list appear in the term table, but are assigned a weight of zero. Words that are
ignored are excluded from the term table. The distinction is that words in the stop list can be
moved to the start list dynamically without re-parsing the document collection, whereas words
excluded from the term table can be added only by modifying properties in the Text Parsing node
and re-running the node. Because parsing typically consumes 80% to 90% of the processing
time for a Text Miner process flow, you want to avoid re-parsing, especially for very large
document collections.
Go back to the Text Filter node. Why are some of the terms kept (KEEP is selected KEEP=Y),
but others are not kept (KEEP is cleared KEEP=N)? There are several reasons why a word
is not kept, and these can depend on settings in both the Text Parsing node and the Text Filter
node. One reason is that the word does not appear in enough documents, such as what
happens for the word antelope. You previously set the Minimum Number of Documents
property to 2 for the Text Filter node. Because antelope occurs in only one document, it is not
kept.
Another reason a term is not kept is if it appears on a stop list specified in the Text Parsing node.
The default stop list for the English language is SASHELP.ENGSTOP. If you open the stop list
by clicking the ellipsis icon, you see a list of many terms that are excluded from further
computations.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-27
If you open SASHELP.ENGSTOP from the Text Parsing node, you see that all is listed as a term
not to be used, as in the display below. Therefore, all is selected as KEEP=N in the Text Filter
node.
The interface for accessing the stop list is dynamic. You can edit the table directly using the two
buttons on the upper left. Use the starburst button to add terms, and use the delete button
to delete selected terms. You can append to the table using the Add Table button, or you can
replace the table with a different table using the Replace Table button. The table interface
is common to the following properties.
Node Property
The variables included in a table might be different, but the interface is the same.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-28 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
If you select Replace Table, you have the option to specify that no table is to be used.
Selecting this
option specifies
that no table is to
be used.
9. Return to the Text Filter Viewer. In the query window, type lions. Click the Apply button.
Five documents exhibit the word lions. The TEXTFILTER_RELEVANCE score is a function of the
word frequency normalized by the highest frequency encountered. For example, the word lions
appears twice in the first document, and one time in the remaining documents. Thus, the last
four documents have a relevance score of ½=0.5. The calculation is more complicated for
compound queries.
Only two documents are returned. This verifies that the query is based on searching for words,
not for sequences of characters. If the search was for any occurrence of the letters l-i-o-n, then
seven documents would be found. The search feature applies what could be called a token-
based search contrasted with a character-string-based search.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-29
Note: A token is a string of characters separated from other tokens by a separator, where a
separator is usually a blank (space) or mark of punctuation. Words in a document are
tokens.
The operator ># can be used to find documents containing the word or any synonym or stemmed
version of the word. Thus, >#lion returns all documents that contain lion or lions as words.
Seven documents contain one or both of lion or lions. The query is a compound query, looking
for multiple words. Note that all seven documents are classified as animals documents.
You can use the TEXTFILTER_RELEVANCE score to rank order documents with respect
to a query. Consequently, the Filter Viewer provides search capabilities similar to those found
in commercial Internet-based search engines.
Note: The TEXTFILTER_RELEVANCE score 0.987 for the third and fourth selected documents
shows that compound queries use a more complex calculation than simply comparing
word frequencies. Otherwise, a simple frequency-based calculation would produce
a score of 0.5. As mentioned above, the actual formula used is not documented.
Fewer query operators are supported by the Filter Viewer than a typical Internet search engine.
Although the Filter Viewer provides a useful mechanism for information retrieval, because
it is based on searching for tokens, it is not as powerful or as efficient as a linear-algebra-based
search, such as Latent Semantic Indexing. The Text Topic node facilitates linear algebra type
queries.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-30 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Note: The linear algebra approach to information retrieval derives a numeric vector for each
document and translates the query into a numeric vector. Calculations, like vector inner
products, are used to evaluate how well a document satisfies a query. The highest
scoring documents as scored by the vector inner product are returned by the search
engine. If a commercial search engine uses a linear algebra approach, the actual search
software probably incorporates many other tools and features in addition to linear
algebra calculations. Translating the query into a numeric vector might use proprietary
algorithms that use recent search history information. Initial results can be reweighted
based on current search trends influenced by all users of the search engine. In a general
sense, many search engines “learn” by recording which links a user clicked and which
links were ignored. There might be success/fail measures related to how long it took a
user to abandon a link and return to the original search results.
Many text-based software products support a FIND function, and SAS Text Miner is no
exception. You can select Edit Find, and enter a search term. If you enter lion, the Find
feature steps through all seven documents found above. Find jumps to the document in the
Documents or Terms window, but it does not subset the document collection like the Search
window. The Find feature is a character- string-based find as compared to a token-based find.
Thus, if you enter lion in the Find Text window, you will identify documents with the words lion,
lions, lioness, ganglion, and so on.
SAS programmers might suspect that the query is equivalent to using a SAS function such
as INDEX or FIND. See “Text Mining Basics for SAS Programmers” at the end of this chapter
to learn more about how the Text Filter Viewer finds documents and terms.
11. The next few steps introduce the two main analytic text mining tools, the Text Clustering node
and the Text Topic node. Attach a Text Clustering node to the Text Filter node. The Text
Clustering node takes the 47 documents in the example data set and separates them into
mutually exclusive and exhaustive groups (that is, clusters). The number of clusters
to be used is under user control. You modify four of the default settings.
Use these
indicated settings
for the Text
Cluster node.
Regarding the Text Cluster properties, remember that you are using a very small and simple data
set. You know that there are basically three types of documents (animals, sports, weather). It is
reasonable to think in terms of creating a small number of clusters (for example, 3 to 5). Use 3.
In practice, with real and complex text data, you want to experiment with these parameters. Run
the node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-31
12. Open the Text Cluster node results and examine the left side of the Clusters window as shown.
Exactly three clusters were created, as requested in the properties panel. The Descriptive
Terms column shows up to 15 terms that are given to help the user interpret the types of
documents that are put into each cluster. (The number can be changed.) These terms are
selected by the underlying algorithm as being the most important for characterizing the
documents placed into a given cluster. Reading these, you can see that Cluster 1, which has
16 documents, has terms such as favorite zoo, big cat, and so on. These documents are likely
about animals. The + indicates that a term has multiple versions either from stemming or from
having synonyms. Cluster 2 has 14 documents that are likely related to sports. Cluster 3 has
17 documents that likely deal with weather.
Note: Stemming is the process of mapping a collection of terms into a single term based on
verb tense or noun/verb singular/plural considerations. For example, you have already
seen that lions is a stemmed version of lion. Similarly, cooks, cooked, and cooking are
all stemmed versions of cook. The Text Parsing node distinguishes between all forms
of a word, but treats the term and all stemmed versions of the term as a single term
when counting words and exploring word associations.
13. To see the new variables that were generated by the Text Cluster node, close out of the results.
Select Exported Data from the properties panel.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-32 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
The upper right window (Sample Statistics) shows a list of variables that were exported from
the Text Cluster node.
Several new variables have been added to the original variables Target_Subject and TextField:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-33
14. SAS Enterprise Miner provides many ways to do further explorations. The StatExplore node
provides basic statistics and crosstabulations for input variables. On the Utility tab, drag
a Metadata node into the diagram and attach it to the Text Cluster node. Change the role
of Target_Subject and TextCluster_cluster_ to Input.
15. Run the Metadata node. On the Explore tab, drag a StatExplore node into the diagram and
attach it to the Metadata node. Select the Cross-Tabulation property and add the
Target_Subject*TextCluster_cluster_ crosstabulation. To do so, select each input, select
the right arrow, and then select Save when both inputs are selected.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-34 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Misclassified
document
This crosstabulation shows that cluster 1 (which was seen previously to have descriptive terms
such as favorite zoo, big cat, and so on) consists of 16 documents defined that have to do with
animals (A) as labeled by the human reader. Cluster 3 (hot weather, winter day, and so on)
consists of 17 documents, and 16 of them were defined as weather-related, and one cluster 1
document was animal related. Cluster 2 (basketball team, play, and so on) consists of 14
documents with a target value always equal to S. The three clusters line up almost perfectly
with the labels given to the documents. There is a single misclassified document. It would
be wonderful if real data worked out this well, but do not expect that!
Because the data set is so small, you can simply examine the exported data to find the single
misclassified document. In the Text Cluster properties panel, select Exported Data. Select the
TRAIN data set and then click Browse. Scroll to the bottom. Notice that document 44 is the
misclassified document (_TextCluster_cluster_=3, Target_Subject=A).
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-35
17. The Text Topic node is used to identify topics in a document collection. Although a cluster
is a mutually exclusive category (that is, each document can belong to one and only one cluster),
a document can have more than one topic or it can have none of the derived or user-specified
topics. Attach a Text Topic node directly to the Text Cluster node. Make one change to the
default properties by specifying 3 as the number of multi-term topics to create. Just as the
number of clusters created is a parameter with which you want to experiment when you use the
Text Cluster node, this parameter for the number of topics to create is typically something that
you might try with different values. In this example, the artificial data set was purposely created
with three different topics, so a reasonable value to start with would be 3 to 5 and not the default
value of 25. You use 3.
18. Run the node. Then click the ellipsis for Topic Viewer on the properties panel. The Topic Viewer
is an interactive group of three windows. The Topics window shows the topics created by the
node.
The three topics created by the algorithm also have key descriptive terms to guide interpretation.
The five most descriptive terms for each topic are shown. By default, the first topic is selected
when you open the viewer. In this example, the first topic has descriptive terms starting with
snow, hot, …, and seems to relate to weather. The second topic has descriptive terms lion, tiger,
…. This is evidently a topic related to animals. The descriptive terms for the third topic (baseball,
basketball, …) are interpretable as having to do with sports. With this simple data set, the
algorithm did very well in identifying what are known to be the three underlying topics in the
documents. However, the # Docs column indicates that the node did a poor job of classifying
documents. At most, 25 documents have been associated with one or more of the three topics,
leaving at least 22 documents with no topic assignment.
The Text Topic node adds a number of variables to the exported data set.
TextTopic_raw1 - TextTopic_raw3 – These are numeric variables that indicate the strength
a particular topic has within a given document. Three topics were generated because this was
specified on the properties panel. These variables are the same as the topic weight values for
the documents given in the Documents window of the interactive Topic Viewer. Each of these
variables (topics) has a label (the five most descriptive terms) to identify it and help the user
interpret the topic.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-36 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
TextTopic_1 - TextTopic_3 – These are binary variables defined for each document and
constructed from the TextTopic_raw1 - TextTopic_raw3 values based on the document cutoff
values given in the Topic table. For example, TextTopic_1 is set to 1 if a document has a
TextTopic_raw1 value greater than the cutoff value for this particular topic, which in the table
above is given as Cutoff=0.411. Otherwise, TextTopic_1 is set to 0. The labels for the TextTopic
variables are the same as for the TextTopic_raw raw variables, except that they have _1_0_ as
prefixes. This indicates that they are binary variables. Each label shows the five most descriptive
terms that are identified with that topic.
In the Topics window, there is a column labeled Term Cutoff. For each created topic, the
algorithm computes a topic weight for every term in the corpus. This measures how strongly the
term represents or is associated with the given topic. Terms that have topic weights above the
term cutoff appear in yellow in the Terms window shown below, and terms with topic weight
below the cutoff appear in gray.
The term cutoff for the first topic is 0.178. You can manually change this value, and you are
compelled to do so when you see that several terms below the cutoff seem to be associated with
weather.
One choice is to select a term cutoff between the topic weights for winter day and animal.
For example, you could change the term cutoff from 0.178 to 0.080.
Another choice is to modify the topic weight for animal, perhaps changing it to zero because
the word is not associated with weather. After “zeroing out” unrelated terms, you could then
select a smaller cutoff. You could select a term cutoff value of 0.050, and use this value to
replace the original cutoff value of 0.178. The following screen capture reflects the latter choice.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-37
When you click Recalculate, the weights and counts are updated.
If you make any change to a topic, the Topic Viewer changes the category from Multiple
(computer- software-generated multi-term topic) to User (custom user-defined topic). If you want
to be technically accurate, there are three categories: (1) software-derived topics; (2) user-
defined topics; and (3) user-influenced topics. The third category is not identified by SAS Text
Miner, but it is useful conceptually to differentiate between pure domain knowledge topics and
computer-generated topics that have been modified based on domain knowledge.
Pure domain knowledge topic dictionaries maintained by an organization are often independent
of any specific corpus, at least in the early stages of text mining integration. You should
recognize that language and knowledge are dynamic, so dictionaries must be viewed as
dynamic resources that should be routinely updated based on the latest data and domain
knowledge. When data contradict domain knowledge, you need to modify domain knowledge,
or improve the data, or both.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-38 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Note: Text analytics dictionaries, including synonym and topic dictionaries, should be treated
like software in a development environment. You should incorporate something like a
source code control system that permits changing dictionaries while keeping track of the
changes through something like software version control. You can use software metrics
for concepts such as “maturity” to see whether dictionaries have stabilized. Just like what
you often witness with new software projects, you are likely to see many changes in the
early stages of text analytics implementation. When text mining becomes a routine
activity, the number of changes to text analytic dictionaries should drop off substantially.
SAS does not provide version control software.
19. The manually entered changes add seven terms to the topic definition, but only increase
the document count from seven to nine. You know that there are 16 weather documents,
so additional changes are required to improve the topic definitions. The Document Cutoff for
the first topic is 0.411. Examine the document table. The 14 terms and their corresponding
weights do a good job of rank ordering the documents with respect to weather.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-39
20. The last document selected to be associated with topic 1 has a topic weight of 0.414, and is
shaded in yellow. Documents below the cutoff are shaded in gray. If you changed the cutoff from
0.411 to 0.060, you would identify 17 documents as exhibiting topic 1, and all 17 would be in
cluster 3 derived by the Text Cluster node, including the ambiguous document that has been
labeled A. Change the cutoff to 0.060 and click Recalculate.
21. Because you are using domain knowledge to exploit the successful rank ordering of documents
into topic 1, the weather topic, you might as well edit the topic description. Click the cell for topic
1, and replace snow,+hot,weather,+cold,winter with Weather.
The following topic table reflects changes to Term Cutoff and Document Cutoff for the
remaining two topics. These changes improve the identification of documents related to the
sports and animals topics.
22. You can make one more change to get seemingly perfect results. Select the animals topic.
Change the topic weight for monkey to 0.150. You change a topic weight by clicking the topic
weight cell that you want to change, and then use the edit keys (Backspace and Delete)
if necessary to type the replacement value. When you recalculate, you will see that 17
documents are classified as exhibiting the animals topic. There is one document that is flagged
as both a weather topic and an animals topic. This ambiguous document is the same one that
was identified by the StatExplore node. Close the Topic Viewer, and save the changes that you
made.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-40 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
23. To exploit the custom topic capabilities of the Text Topic Viewer, copy and paste the Text Topic
node and attach the copied node to the Text Cluster node. Rename the new Text Topic node
“Custom Topics.” Change the number of multi-term topics to zero. Run the new node, and then
open the Topic Viewer. The following table shows that the ambiguous document, the one
beginning with “If you like hot weather,” is just above the document cutoff.
24. Examine the weather topic. You can see that the ambiguous document receives a relatively high
topic weight.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-41
25. Close the Topic Viewer. Select User Topics and observe the custom topic table created by your
original edits in the Topic Viewer.
You can see the name of the table: EMWS2.TextTopic2_INITTOPICS. The table is managed
by Enterprise Miner, and you have no easy way to export the table. However, because you know
the table name, you can use a SAS Code node to make a permanent copy of the table. If you
do so, then the table can be used for other projects. You specify the table with the User Table
interface. You can replace an existing table, or add (append) to an existing table.
Note: If you created the process flow exactly as described, the custom topics table name would
actually be EMWS2.TextTopic_INITTOPICS because the custom table is derived from
the original Text Topic node that was placed in the diagram.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-42 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
26. The final part of this demonstration is to use the Score node to score a new data set. Following
the top part of the diagram shown at the beginning of this demonstration, bring in a new File
Import node. Rename it File Import Score Data. The import file for scoring is
D:\workshop\winsas\DMTX51\Score_WeatherAnimalsSports.xls. (Pathname variations are
possible as stated above.) In the properties panel, change the role of the data set to Score.
27. Run the node and look at the Exported Data window. This SCORE data set has 16 documents.
They are related to the three subjects (animals, sports, or weather). (As is usually the case with
a data set to be scored, there is no target field on this data set.) The object now is to classify
these documents using the previous analysis. To do that, bring in a Score node and connect
it to the output of the File Import Score Data node and also to the output of the last Text Topic
node (Custom Topics).
28. Run the Score node. Then go to the Exported Data window through the properties panel. Select
the SCORE data to view and click Browse.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-43
29. When the Browse window appears, move the column headings so that TextField is the first
column heading and the other scored segment values are to the right of the text field. Recall that
the clusters lined up as 1=animals, 2=sports, 3=weather. The custom topic segment variables
are clearly labeled. Read through the 16 rows and check to see whether any of the
classifications looks incorrect to you.
Do any of the topic indicators disagree with the cluster segments? Which observations appear
to be misclassified? In particular, document 1 is classified as cluster 1 (animals), but the three
binary text topic variables are zero, indicating that the first document exhibits none of the three
topics. Document 7 is misclassified by the cluster ID and by the topic flags. Document 16 is
ambiguous and could be both weather and animals. It is classified as a weather document by the
cluster ID and the weather binary variable, but it could also be classified as an animals
document.
Because of language challenges such as word sense disambiguation, the underlying text mining
and modeling algorithms make mistakes. In this case, the very small number of training
examples that were used likely influenced the quality of the results. Overall, the results look very
promising
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-44 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
22
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The description of text mining provided earlier mentions dictionaries or vocabularies. The terms
dictionary and vocabulary are interchangeable in the literature on text mining. One author might
describe how “a person’s vocabulary is like DNA in uniquely identifying an individual.” Another author
might provide a dictionary of action verbs to help score documents to achieve some analytical
objective. The document collection has a dictionary or vocabulary that is the union of all the terms
contained in each document. Consequently, text mining references use dictionary or vocabulary
to refer to the collection of terms that are used in the analysis. For convenience, this course will tend
to use dictionary rather than vocabulary to refer to terms used in an analysis.
The demonstration provided some details about start lists, stop lists, and synonym tables.
These tables have various names in the text mining literature.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-45
24
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The corpus dictionary contains every word used in the corpus. A subset of the corpus dictionary
contains relevant terms—that is, terms that will aid in achieving the analytical objective of the text
mining project. This dictionary of relevant terms is called a start list. Terms not in the start list are
ignored, except possibly for use in determining relative frequencies or other calculations that require
a count of all words in a document.
Zipf’s Law, discussed in a later chapter, helps identify terms in a dictionary that should be included
in an analysis. In particular, Zipf’s Law suggests that terms appearing with low frequency and terms
appearing with high frequency are irrelevant. Identifying irrelevant terms with the aid of Zipf’s Law
and domain knowledge is often easier than constructing a dictionary of relevant terms. The
dictionary of irrelevant terms is called a stop list. By default, the Text Parsing node specifies a stop
list in the selected language. You decide whether to use the default stop list, create your own stop
list, or create a start list. You can also use existing stop or start lists created for your organization.
You specify a stop list or a start list, but not both, because the corpus dictionary will uniquely define
one of these lists when the other is supplied.
Text mining works with a collection of documents. The collection can be dynamic—that is,
documents can be added to the collection. You can use the collection to train a model, and you can
apply the model to new documents coming into the collection. New documents are scored relative
to how they compare to the original documents in the collection. If a new document contains a new
term, then text mining is ignorant of this new term until that document is used in a new training step.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-46 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
25
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Text mining training can be performed using only nodes within the SAS Text Miner group of tools.
However, SAS Text Miner nodes export data, and these data can be imported into pattern discovery
and predictive modeling nodes of SAS Enterprise Miner. Thus, a trained model can be obtained by
using a combination of SAS Text Miner nodes and SAS Enterprise Miner nodes. Although many
commercial text mining products have strong text-analytics capabilities, many of these lack data
mining capabilities beyond text analytics. The ability to score new documents using a decision tree
or a neural network presents new opportunities to improve text mining outcomes (for example,
making it possible to use variables derived from text analytics in predictive models).
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-47
0.2 0.1
26
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Text Scores
The value or values associated with a document can be
• segment identifiers related to text categorization or more general
predictive modeling
• cluster identifiers related to grouping documents based on similarity
of content
• probabilities of membership in segments or clusters
• numeric values representing document content based on weighted
averages of transformed word frequencies.
27
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
As new documents appear, they can be scored using the model trained on the original corpus.
Eventually, the model can be updated by being retrained on the corpus with the new documents
added.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-48 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Some data mining references imply that scores are associated with predicting a target variable
in a supervised learning setting. The above slide makes it clear that any numeric or class variable
added to a data set by a Text Miner node can be treated as a score. This interpretation makes
it clear that you can add a Score node after a Text Cluster, Text Profile, or Text Rule Builder node.
The Score node will add the numeric or class variables to a score data set. This is appropriate for
supervised and unsupervised learning problems.
There are unique challenges in scoring new documents. A score data set in SAS Enterprise Miner
contains all of the features necessary for scoring, including the document. For supervised learning,
a score data set does not need to have a target variable, because the goal of supervised learning
is to predict a target variable when it is not known. To produce scores for a single observation, the
document must be parsed. Because score data is “new” data, it is possible that the score document
contains words that are not in the corpus dictionary. All new words must be treated as exclusion
words. Zipf’s Law suggests that any new words encountered will fall into the large collection of low
frequency terms. As mentioned above, low frequency words are usually added to the exclusion
dictionary (stop list).
SAS Text Miner uses documents in the training data only to modify the dictionaries supplied by the
user. Unlike some predictive methodologies in data mining, text mining does not use validation data
for tuning except in the Text Rule Builder node. The validation data have no impact on the scores
that are produced by the Text Cluster and Text Topic nodes. For comparison, decision trees are
allowed to use the validation data for pruning, and because pruning modifies the tree, the derived
scoring mechanism is impacted by pruning, and hence scoring calculations are impacted by the
validation data. The Text Rule Builder node uses the validation data to influence rule derivation.
Hence, the Text Rule Builder node was not used in the first demonstration. Opinions vary, but you
probably want at least 1,000 documents with at least 100 documents in the smallest target category
before using the Text Rule Builder node.
28
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-49
Experts differ in how they characterize applications of text mining. For example, Miner et al. (2012)
describe the following three areas: information retrieval, information summary, and information
extraction, whereas above only information retrieval is listed.
29
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
If you have N documents, and M documents belong to the desired category, the possible outcomes
from a query or classification request are given in the following table.
In Selected Category TP FN M
Recall is given by
Recall=100*TP/(TP+FN)=100*TP/M
Misc=100*(FP+FN)/N
The F1 score is
F1=1/(0.5*(1/Precision)+0.5*(1/Recall))
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-50 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
30
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The Text Cluster, Text Topic, and Text Rule Builder nodes add columns to the imported data sets that
are related to scoring. These columns will also be added to score data sets if scoring is requested
through the Score node. Scoring usually creates one or more columns with the role of segment or
prediction, but some of the columns that are added have roles of Input and Rejected. The
Enterprise Miner Metadata node can be used to change variable roles.
31
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-51
Note: Data mining analysts know how a predictive model scores new data. However, some
analysts might be unaware that unsupervised learning models (that is, data without a known,
available target) can also generate scores, and new data can be scored using the model.
For example, the Text Cluster node divides a document collection into mutually exclusive
clusters. For the expectation-maximization algorithm, a new document is scored by
calculating the probability of membership in each cluster, and then it is assigned to the
cluster associated with the highest probability.
Data mining is often described with respect to two general application areas: pattern discovery
(unsupervised learning – no target variable) and predictive modeling (supervised learning – target
variable). (Specific examples of these two application areas are presented in this course.)
32
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
For text mining, pattern discovery encompasses the area of information retrieval, and prediction
encompasses the area of text categorization. If you are using text mining to suggest groupings
of documents that do not have pre-assigned labels, you would be categorizing documents in an
unsupervised learning setting.
A text mining project usually falls into one of two areas: information retrieval or text categorization.
If you are working on a project that has text mining components, such as predicting insurance fraud,
then you will be using text mining in a support role so that the general problem is prediction
(supervised learning), and prediction can be enhanced by incorporating text mining input variables.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-52 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
33
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Anomaly detection can sometimes be a first step toward creating a target variable if none exists.
34
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-53
35
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
37
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Users who are new to the world of analytics often have a naïve notion about noise. Science fiction
movies include computers and robots that speak and understand human languages. Television
police dramas have detectives who make perfect predictions about where crimes will occur. The
reality is that noise permeates existence. You might have an expectation that, after you master text
mining, you can perfectly predict customer behavior based on responses to an online survey. This
expectation is unrealistic.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-54 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Psychologists know that human beings might react differently to the same stimulus if sufficient time
elapses between exposures. On Monday, when you are hungry at lunchtime, you eat a sandwich.
Yet, on Tuesday when you are hungry, you opt for a salad. This tendency for different outcomes to
occur with similar inputs is attributed to noise, which is unpredictable. You can predict with almost
certainty that you will eat lunch next Thursday, but you cannot predict what you will eat with the
same certainty. (Of course, if you bet someone a million dollars that you will eat a spinach salad next
Thursday, then you will almost certainly eat a spinach salad!) Analytic experts expect errors in
prediction related to noise, so methods are developed to minimize errors in the presence of noise.
The incremental value that text mining can provide to your predictive models should be assessed by
comparing the quality of a model (accuracy, ROC index, and so on) without incorporating text mining
to that achieved after text mining is added.
INPUT
38
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The above graphic illustrates the pure signal situation. In this case, the training data can be perfectly
separated into primary or secondary outcomes using a linear decision boundary. You rarely expect
to see this in practice. Unfortunately, some people who are new to text mining are disappointed
when the methods do not perfectly categorize documents with this type of accuracy.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-55
INPUT1
INPUT2
39
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
At the other extreme is the pure noise situation. In this case, the training data appears to have no
patterns upon which to base a model that can separate the primary outcomes from the secondary
outcomes. This situation is more common than you might like. Although pure signal is very rare, pure
noise can actually occur in practice.
INPUT1
INPUT2
40
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-56 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
The most common situation in practice is a mixture of signal and noise. You can predict more
accurately than randomly guessing. How well you predict depends on whether data is dominated
by systematic variation or random variation.
X1
Separation
X2
When no target variable is available, you can still investigate whether a natural separation occurs
in the data with respect to the analytic objective. For example, fraud cases are often unusual in
higher dimensional space because the human beings that commit fraud have difficulty controlling
outcomes so that they look normal in many dimensions. This example could represent insurance
claims data for automobile accidents.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-57
X1
Separation
with Noise
X2
Even with good separation, noise is usually present. For the fraud example, most of the claims with
a long distance from claimant to physician and a high ratio of bodily injury to property damage costs
are fraudulent (dark circle), but a few are legitimate cases (light circle). Other fraud cases are not
so separated, perhaps because the fraudulent physician had a practice near the claimant’s home.
In this example, BI is some quantitative measure of bodily injury and PD is a quantitative measure
of property damage.
…physician… …chiropractor…
…physician… …dentist…
…podiatrist…
…chiropractor… …chiropractor…
…physician
SVD1 …dentist… …physician… …
…chiropractor…
…dentist… …dentist… …dentist……dentist…
…physician…
…chiropractor…
…chiropractor… …dentist…
SVD2
…fraud… …no fraud…
Note: SVD1 and SVD2 are variables that are created in text mining.
43
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-58 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
A string of fraud rings operating in Southern California in the 1990s had the common elements of a
lawyer, a chiropractor, and a recruiter. The recruiter approached people who received unemployment
benefits and told them that they could obtain worker’s compensation benefits from their previous
employers. The recruiter referred a candidate to an unscrupulous lawyer, who scheduled treatments
with a chiropractor, a partner in the fraud ring. After a few weeks of treatments, the lawyer filed a
claim for three to five times the chiropractor bills (a fairly common practice in insurance litigation).
Claims adjusters often receive training in fraud prevention. When information about the fraud rings
was disseminated, a claims adjuster would often add a comment to the adjuster notes when unusual
activity involving claimant representation by a lawyer and incoming chiropractor bills became known.
1 3 0 National
2 5 0 National
3 7 0 National
4 8 0 National
5 0 4 International
6 0 5 International
7 0 3 International
8 0 7 International
Perfect Separation: No Mixing of Subjects
44
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Some document collections are well separated for analytic purposes. The hypothetical example
above shows eight documents, with four that describe national news items exclusively, and the
remaining four describing international news items exclusively. Suppose that you could identify a set
of terms that are associated with national news and another set of terms associated with
international news. These terms could then be used to classify the documents in the corpus.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.1 Data Mining and Text Mining 1-59
11 3 Words
1 National
12 8 2 National
13 7 6 Mixed
14 8 1 National
15 1 4 International
16 2 5 International
17 3 3 Mixed
18 1 7 International
Good Separation: Little Mixing of Subjects
45
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
With the same topic and analytic objective, another document collection has documents that might
mention a heterogeneous set of news articles. You still get good separation, but noise creeps in due
to the fact that a document can include multiple subjects.
21 3 Words
4 Mixed
22 8 2 National
23 7 6 Mixed
24 8 1 National
25 4 4 Mixed
26 6 5 Mixed
27 3 3 Mixed
28 1 7 International
Poor Separation: Substantial Mixing of Subjects
46
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-60 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Finally, the above example shows that if you have a collection of documents that mention many
topics and mixes topics, then trying to classify documents into clean categories is difficult. However,
if you can accommodate a classification system that assigns multiple categories to documents, such
as the Text Topic node and topic identification, then you can still successfully apply text mining
techniques.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Working with Data Sources 1-61
Objectives
• Describe SAS Enterprise Miner metadata and detail the types of roles
and measurement levels that are supported.
• Explain how to create data sources that can be used by SAS Enterprise
Miner projects.
• Provide examples of data sources that are relevant for text mining.
48
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Workspaces
49
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-62 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
SAS Enterprise Miner organizes projects by placing components of the project in separate folders
or directories. The Datasources folder contains metadata for each data source. The Workspaces
folder holds all of the details about each diagram, including property settings of nodes used in each
process flow.
SAS Enterprise Miner can import data from many sources, including common PC file formats such
as Microsoft Excel and common commercial relational databases (for example, Sybase, Teradata,
and Oracle), as well as from SAS data sets. The functionality of SAS Enterprise Miner comes from
the assignment of roles and levels to variables in a data set. Initially assigning metadata roles makes
the building of process flows much easier. Data properties do not need to be repeated or copied
and pasted for each new task.
One of the first tasks in any project is to identify one or more relevant data sources. Although you
can merge tables inside SAS Enterprise Miner, a best practice is to use the query optimization
features of the native database to build the analysis table and then import this table into SAS
Enterprise Miner.
• Select a table.
• Define variable roles.
• Define measurement levels.
• Define the table role.
SAS
Foundation
Server
Libraries
50
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Working with Data Sources 1-63
Variable Roles
• Assessment • Referrer
• Censor • Rejected
• Classification • Residual
• Cost • Segment
• Cross ID • Sequence
• Decision • Target
• Frequency • Text
• ID • Text Location
• Input • Time ID
• Label • Web Address
• Prediction 51
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Additional variables in the data set usually have roles of ID, Input, Target, or Rejected. An ID
variable identifies the document uniquely. An input variable can be used for segmentation or
predictive modeling. Only input variables are used to derive segments or clusters. SAS Text Miner
converts each document into a collection of inputs. For predictive modeling, the goal is to predict
the value of a target variable. Only input variables are used to predict the target.
Any other variable in the data that has no purpose for the analysis has a role of Rejected.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-64 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Measurement Levels
• Categorical (Class, Qualitative)
• Unary
• Binary
• Nominal
• Ordinal
• Numeric (Quantitative)
• Interval
Ratio*
•
* All methods that accommodate an interval measurement scale
in SAS Enterprise Miner also support a ratio scale.
52
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Elementary statistics textbooks for social science majors usually describe four measurement levels:
• nominal
• ordinal
• interval
• ratio
Other statistics textbooks might speak only of categorical and numeric data.
A variable with a nominal measurement scale is purely categorical in nature. There is no numeric
interpretation, and there is no natural ordering. Examples include eye color, political party affiliation,
and country of origin. An ordinal variable is a categorical variable that has an inherent ordering.
Thus, ordinal variables are also called ordered categorical variables. Examples include course letter
grade, response on a Likert scale, or items on a top-10 ranking list.
Note: Nominal data can be ranked by frequency of occurrence, price, personal preference, and
so on. If the ranking is meaningful and exploited by the analysis, the nominal variable
becomes an ordinal variable.
A binary scale implies a nominal scale with only two distinct values.
A variable with an interval measurement scale has a numeric interpretation so that the difference
between two numeric values is meaningful. A variable with a ratio measurement scale is valid
as an interval-scaled variable, but in addition, the ratio of two numeric values is meaningful.
Temperature in degrees Celsius is on an interval scale, but not a ratio scale, because 20 degrees
divided by 10 degrees being equal to 2 does not mean that 20 degrees are twice as hot as 10
degrees.
Most numeric data are on a ratio scale. Most analytic methods only require that the data be on an
interval scale. All of the methods used by SAS Enterprise Miner that work for numeric data also work
for interval- or ratio-scaled values. Consequently, the ratio scale is not supported.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Working with Data Sources 1-65
Different nodes expect specific table roles. The Score node scores raw, training, validation, test,
and score data sets. The Association node acts on transaction data sets.
Table Roles
• Raw
• Training
• Validation
• Test
• Score
• Transaction
53
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Using the Data Partition node, you can split raw data into training, validation, and test data sets. This
is an important step in predictive modeling. You want to achieve good generalizability of the model
by avoiding the problem of overfitting—that is, creating a model that looks good on the training data
but does not generalize well to a holdout sample.
Analysis Data
Training
➔ ➔ Validation
Test
Raw Data
Data Partition
Node
54
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-66 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Score data can be scored by the Score node if all of the required data elements are present.
The role of the score produced by the Score node is Prediction or Segment, depending on how
the score is produced. Consequently, you need to be familiar with variable roles even if they are not
assigned by you.
➔ ➔ Predictions
Score Score
Data Node
55
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
SAS Enterprise Miner and SAS Text Miner anticipate the need to create and modify data before
an analysis. For text mining, the Text Import node can be deployed in a SAS Enterprise Miner
process flow to process a document collection. (The Text Import node is discussed in the next
chapter.)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Working with Data Sources 1-67
56
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The next slide describes how text data is treated by SAS Enterprise Miner.
57
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Text parsing is always required as the first step in a text mining flow. This step accepts data sets with
the role of Train, Validate, Test, or Score data. At least one data source must be a data set with the
role of Train or Raw.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-68 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
The input data source must have at least one variable with a role of Text or Text Location. As stated
above, the Text variable can contain an entire document or a truncated piece of an entire document.
The Text variable is a character variable, and SAS can accommodate only character variables with
lengths up to 32K (32,767 bytes). If a document exceeds 32K in length, then SAS must read the
entire document from a location specified in the input data. If no location is specified, then the Text
Miner nodes process only the truncated documents.
To process documents that exceed 32K, a variable with the Text Location role must be included in
the input data. The text location must be the full pathname of the document folder with respect to the
Text Miner server. For example, a document might be visible on your Windows computer at this
location:
S:\MyProject\MyDocuments\Doc1.txt
The second form of the document location must be used in the input data.
The Text Filter node can access documents through the Interactive Filter Viewer. By default, the
Interactive Filter Viewer displays only the portion of the document stored in the Text variable. If you
want to see the entire document in the Interactive Filter Viewer, then you can include a variable with
the role of Text Location that provides the pathname of the file that contains the full document.
If the input data source contains two or more variables with a role of Text, and the Use status is Yes
for these variables, then the Text Parsing node chooses the variable with the largest length. If the
lengths are the same, then the variable that appears first in column order is selected. If your data
has two or more text variables, you should set the Use status to No for all Text variables except the
one to be included in the analysis.
If you want to include two or more Text variables in your text mining project, then you must connect
Text Parsing nodes in parallel and change the Use status of the variables as needed.
In many cases, you need to preprocess textual data before you can import it into a data source.
The Text Import node is designed for this purpose. The Text Import node can be used in file
preprocessing to extract text from various document formats or to retrieve text from websites by
crawling the web. The node creates a SAS data set that you can use to create a data source to
use as input for the Text Parsing node. Depending on which structure (of the two described
above) that you use, you must adjust the roles of the variables accordingly in the Data Source
Wizard.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.2 Working with Data Sources 1-69
58
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The software distribution of SAS Enterprise Miner includes the following sample data sets in the
Sashelp library:
Engstop Stop list for the English language Text Parsing node
Engsynms Synonym list for the English language Text Parsing node
The keyword language is chosen to correspond to one of these supported languages: Arabic,
Chinese (simplified and traditional), Czech, Danish, Dutch, English, Finnish, French, German,
Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese,
Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, and Vietnamese. (However, only
English, French, German, Italian, Portuguese, and Spanish have built-in stop lists and multi-term
lists.)
Note: You can specify multiple languages in the Text Parsing node. One choice for a multi-
language stop list is to take the union of the individual language stop lists.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-70 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Objectives
• Describe the SAS Text Miner nodes.
• Explain the SAS Text Miner node properties.
60
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
• Text Cluster
• Text Filter
• Text Import
• Text Parsing
• Text Profile
• Text Rule Builder
• Text Topic
61
I Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-71
62
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Note: Custom entities are not discussed in this course. You can refer to SAS Text Miner 15.1
Reference Help for information about how to bring in results to the Text Parsing node from
SAS Concept Creation for SAS Text Miner in SAS Content Categorization Studio.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-72 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
64
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
65
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-73
The stop list is typically used to remove low information terms that add only
noise to the analysis. Noisy data has no descriptive or predictive value.
66
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
67
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-74 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
68
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The term table contains all terms parsed from the document collection. If a table with the role Raw
or Train is imported into the node, then the entire document collection is used. If more than one table
is imported (for example, train, validate, and test data), then the term table contains all terms found
in the train data. If you select the term table and then select File Save as, you can save
a permanent copy of the table. This table is dynamic and can change based on properties specified
in successor nodes.
The three plots in the Results window are similar to plots found in the Text Filter Results window.
These plots are explained later.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-75
69
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
70
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-76 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
71
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
72
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
If you do not specify a spelling dictionary, the spelling checker takes the terms in the bottom 5% with
respect to frequency as candidates for misspelling, and uses the terms in the top 95% as the
spelling dictionary. Although this can be successful, it can also lead to many incorrectly identified
misspellings. For example, in the Weather-Animals-Sports data set, this approach identifies baseball
as a misspelled version of basketball.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-77
73
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
74
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Although it is true that most document collections of reasonable size can be approximated by Zipf’s
Law, there are exceptions. The exponentially decaying plot is typical for the English language, but
it might be less typical for Chinese languages or languages such as Swahili. Lu, Zhang, and Zhou
(2013) examine deviations from Zipf’s Law for modern languages.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-78 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
75
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The Text Filter node Results window facilitates investigation of data quality. Deviations from expected
relationships in the Zipf plot and in the Number of Documents by Frequency plot usually indicate
data problems.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-79
77
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
78
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-80 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Some insight can be obtained by understanding a document collection. For example, guidelines for
documents that might be used in legal proceedings suggest that certain adjectives and adverbs be
restricted to “objective” or “verifiable,” with recommendations to avoid terms that cannot be backed
by evidence. For example, you might be cautioned to avoid the adjective reckless or the adverb
recklessly. This can lead to different frequency distributions for parts of speech. Thus, there is no
overall expected frequency distribution for the Role by Freq table. Domain knowledge can suggest
expected distributions. However, the dominance of nouns, verbs, adjectives, and adverbs is a
universal characteristic of English documents.
The properties of the Text Parsing node also affect the Role by Freq distribution. If you clear all of
the choices from the Ignore Parts of Speech property, you might see prepositions or determiners
begin to dominate. The following plot was obtained by permitting the detection of all parts of speech
for the Medline data, which is described in a later chapter.
Even when all parts of speech are allowed, some might never appear in a document collection.
Three parts of speech are not detected in the Medline data.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-81
79
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The use of term weights enables shorter documents to have the same influence as larger documents
in providing understanding of the document collection.
80
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-82 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
81
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
82
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-83
83
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
84
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-84 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
85
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
86
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-85
Query Limitations
• You cannot combine operators, for example,
+dog ->#term is not supported because combining - and ># is not
supported.
• Query length is limited to 100 characters.
87
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
88
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
In the Text Filter Viewer, you can select a term in the term table, and if you right-click the term, one
choice is View Concept Links. In the above slide, the analyst is investigating the term price. The
selected term is called the parent term, and the derived concept linked terms are called child terms.
Up to nine different terms will be displayed. The displayed terms are the terms with the strongest
association to the parent term.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-86 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
89
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-87
90
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The child term walmart is contained in 238 documents, and 39 of these documents contain the
parent term price.
91
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-88 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
If you right-click the child term walmart and select Expand Links, five new terms are identified.
These five terms are said to have a second-order association with the parent term price. These are
the so-called “friends of friends.” The identification of associated terms through concept linking helps
construct queries or custom topics to enhance information retrieval. For example, if you are looking
for documents that discuss price, adding the word discount to the query can identify more
documents that might be of interest.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-89
Spell Checking
This brief demonstration shows how to use the spell-checking feature of the Text Filter node.
A table of correctly spelled English words is provided as DMTXT.Englishdictionary. You can easily
obtain a spelling dictionary by using a search engine and searching for “spelling dictionary table.”
SAS does not supply a spelling dictionary.
1. Create a diagram named Spell Checking.
2. Drag a File Import node from the sample tab into the diagram.
4. Attach a Text Parsing node to the File Import node. Use default settings.
5. Attach a Text Filter node to the Text Parsing node. Set Check Spelling to Yes, and specify
DMTX51.Englishdictionary as the dictionary. Run the Text Filter node.
6. Open the Spell-Checking Results table. It will be stored in the workspace library for the diagram.
Assume that the library is EMWS3. Then the spelling results table is
EMWS3.TextFilter_spellDS.
The table needs to be edited, but this can be accomplished through Text Parsing properties.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-90 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-91
Some of the terms do not need to be in the table. You can use the Delete button to eliminate
them.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-92 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
The last term, teem, was manually added using the starburst button.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-93
93
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
94
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The Text Cluster node divides a document collection into mutually exclusive clusters. By default,
15 terms are displayed that are most strongly associated with each of the clusters. These descriptive
terms help the analyst understand the types of documents that are in a given cluster. The Weather-
Animals-Sports demonstration produced a cluster with descriptive terms such as cold, rain, snow,
and winter. This highlights the fact that this cluster consists mostly of documents about the weather.
(It is possible that a descriptive term displayed for one cluster can also be important for describing
other clusters.)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-94 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
The interpretable value of the descriptive terms becomes clear as you work through some hands-on
examples with the Text Cluster node.
From Help Contents:
“The Text Cluster node uses a descriptive terms algorithm to describe the contents of both EM
clusters and hierarchical clusters. If you specify to display m descriptive terms for each cluster, then
the top 2*m most frequently occurring terms in each cluster are used to compute the descriptive
terms.”
“For each of the 2*m terms, a binomial probability for each cluster is computed. The probability
of assigning a term to cluster j is prob=F(k|N, p). Here, F is the binomial cumulative distribution
function, k is the number of times that the term appears in cluster j, N is the number of documents
in cluster j, p is equal to (sum-k)/(total-N), sum is the total number of times that the term appears
in all the clusters, and total is the total number of documents. The m descriptive terms are those that
have the highest binomial probabilities.”
“Descriptive terms must have a keep status of Y and must occur at least twice (by default) in a
cluster.”
95
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-95
• Basically, documents that have similar term usage tend to be put in the
same cluster. In this case, Document 2 and Document 3 look somewhat
alike. 96
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The use of singular value decomposition (SVD) is called the linear algebra approach to text mining,
information retrieval, web analytics, and so on. As mentioned in a previous section, this algebraic
operation is the foundation of an approach that has many names:
• Latent Semantic Indexing (LSI)
• Latent Semantic Analysis (LSA)
• Vector Space Model (VSM)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-96 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
98
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
There are competing methodologies to LSA/SVD: Latent Dirichlet Allocation (LDA), Correlated Topic
Modeling (CTM), and Non-Negative Matrix Factorization (NNMF). NNMF is available in PROC
IMSTAT, a procedure in the SAS In-Memory Statistics suite of software products. Competing
methodologies exist because there is no universally optimal way to quantify text. LSA is popular
in commercial software and academic publications, but the competing techniques have been shown
to be useful for specific document collections. A demonstration in a later chapter shows how LSA
provides superior results to published findings for NNMF, but no technique can be expected to be
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-97
superior for all data sets. In general, the competing methodologies do not tend to give dramatically
different results. Arguments for a particular methodology often include considerations of computing
efficiency, dimensionality reduction, and ease of interpretation.
99
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Cluster 4
Cluster 10
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-98 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
101
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-99
103
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
104
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-100 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
105
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
106
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-101
107
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Weights can be any numeric value, positive or negative. Negative weights imply that the term
supports the negative, or opposite, of the concept. A 0-1 system is the easiest to use.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-102 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
109
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
110
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-103
111
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
112
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-104 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-105
115
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
116
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-106 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Active Learning
Requirements
• Target values are “fuzzy” or subject to error.
• Historic data and expert opinion might help discover observations where
the target value is likely to be incorrect.
• A team of domain experts is available to identify likely miscoded target
values. Experts can assign a “degree of belief” value to suspicious cases.
A cutoff can be derived to decide whether suspicious cases should
be recoded.
Relevance
• Changing target values that are likely to have been miscoded can lead
to discovery of better scoring rules.
118
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-107
Active learning can be explained using an actual text mining project for illustration. An insurance
company wants to automatically classify a claim as fraudulent or not. Data are collected, a model
is selected, and the model is used to score training, validation, and test data sets. Some cases with
high scores have a fraud flag set to zero. While investigating these anomalous cases, the researcher
notices that several of the cases have very suspicious attributes. She forwards the cases to trained
insurance fraud investigators and asks them to provide a value between zero and 100 indicating the
probability that a claim experienced fraud. After deliberation, a cutoff of 75% was selected, so any
reviewed case with a probability score above 75 was recoded as a fraud case. The model was refit
with the recoded data.
When using a methodology like that supported by the Text Rule Builder node, this form of active
learning can lead to the discovery of additional rules.
The scenario makes it clear that trained experts are required to participate in the active learning
process. If active learning is not done properly, it becomes an exercise in boosting model diagnostics
without really improving the model.
119
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-108 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
120
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
121
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-109
122
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-110 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Technical Details
The following material is extracted from the Reference Help for SAS Enterprise Miner.
SAS Text Miner can identify the part of speech for each term in a document based on the
context of that term. Terms are identified as one of the following parts of speech:
• Abbr (abbreviation)
• Adj (adjective)
• Adv (adverb)
• Aux (auxiliary or modal)
• Conj (conjunction)
• Det (determiner)
• Interj (interjection)
• Noun (noun)
• Num (number or numeric expression)
• Part (infinitive marker, negative participle, or possessive marker)
• Pref (prefix)
• Prep (preposition)
• Pron (pronoun)
• Prop (proper noun)
• Punct (punctuation)
• Verb (verb)
• VerbAdj (verb adjective)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-111
When you create a stop list table, start list table, synonym table, or custom topic table, the role
variable takes the value of one of the above abbreviations, or it will use a noun group or entity role.
Following is an example of a custom topic table that uses part of speech roles for the variable
_role_. The label for the variable _role_ is Role, which is displayed as the column heading.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-112 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
SAS Text Miner can identify noun groups, such as clinical trial and data set, in a document
collection. Noun groups are identified based on linguistic relationships that exist within
sentences. Syntactically, these noun groups act as single units. Therefore, you can choose
to parse them as single terms.
• If stemming is on, noun groups are stemmed. For example, the text amount of defects is parsed
as amount of defect.
• Frequently, shorter noun groups are contained within larger noun groups; both the shorter
and larger noun groups appear in parsing results.
An entity is any of several types of information that SAS Text Miner can distinguish from general
text. If you enable SAS Text Miner to identify them, entities are analyzed as a unit, and they are
sometimes normalized. When SAS Text Miner extracts entities that consist of two or more
words, the individual words of the entity are also used in the analysis.
Out of the box, SAS Text Miner identifies the following standard entities:
• ADDRESS (postal address or number and street name)
• COMPANY (company name)
• CURRENCY (currency or currency expression)
• DATE (date, day, month, or year)
• INTERNET (email address or URL)
• LOCATION (city, country, state, geographical place/region, political place/region)
• MEASURE (measurement or measurement expression)
• ORGANIZATION (government, legal, or service agency)
• PERCENT (percentage or percentage expression)
• PERSON (person’s name)
• PHONE (phone number)
• PROP_MISC (proper noun with an ambiguous classification)
• SSN (Social Security number)
• TIME (time or time expression)
• TIME_PERIOD (measure of time expressions)
• TITLE (person’s title or position)
• VEHICLE (motor vehicle including color, year, make, and model)
You can also use SAS Content Categorization with Teragram Contextual Extraction to define
custom entities and import these for use in a Text Parsing node. When you create compiled
custom entity files, ensure that you specify September 14, 2009 as the compatibility date.
(Valid files have the extension .li.) Otherwise, the files cannot be used in SAS Text Miner.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-113
When a document collection is parsed, SAS Text Miner categorizes each term as one
of the following attributes, which gives an indication of the characters that compose that term:
• Alpha, if characters are all letters
• Num, if term characters include a number
• Punct, if the term is a punctuation character
• Mixed, if term characters include a mix of letters, punctuation, and white space
• Entity, if the term is an entity
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-114 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-115
As you try to emulate SAS Text Miner with SAS code, you quickly realize that there are many
challenges related to stemming, part-of-speech tagging, and identifying multi-word terms. The code
presented below is purely educational to help illustrate some of the issues related to string-based
versus token-based querying. The goal is to engage in information retrieval using a simple SAS
program, and to compare how different SAS functions support querying documents to find specific
words.
The SAS language supports many character functions. The following program illustrates some
of the text mining features of the SAS language. It is stored in the course folder and has the name
SAS_Language_IR.sas. The comments describe the SAS functions that are used for information
retrieval. For example, illustration 1 shows that the INDEX function is a string-based search function.
libname DMTX51 "D:\workshop\winsas\DMTX51";
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-116 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
To illustrate that the INDEX function is string based, you can alter the query slightly by using
the singular form of lion.
title2 "Search on LION using INDEX";
data work.templion;
set DMTX51.WeatherAnimalsSports_train;
if (index(lowcase(TextField),"lion")>0) then output;
run;
proc print data=work.templion;
run;
The results follow.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-117
Why was the “House cats…” document not retrieved? The INDEXW function accepts an optional
third argument. Without this third argument, the default delimiter to identify words is a space.
title2 "Search on LIONS using INDEXW and Delimiters";
data work.templions;
set DMTX51.WeatherAnimalsSports_train;
if (indexw(lowcase(TextField),"lions",' ,')>0) then output;
run;
proc print data=work.templions;
run;
With the comma delimiter added to the space delimiter, the additional document is identified.
When you specify the singular lion query, documents with the word lions are not identified using
the INDEXW function.
title2 "Search on LION using INDEXW and Delimiters";
data work.templion;
set DMTX51.WeatherAnimalsSports_train;
if (indexw(lowcase(TextField),"lion",' ,')>0) then output;
run;
proc print data=work.templion;
run;
The following table is produced:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-118 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
You can parse a document collection and calculate term statistics using the SAS language.
The following code produces a simple text relevance score to accompany a query, similar
to the results of the Text Filter Viewer:
title2 "COUNTW/SCAN Parsing for LIONS";
data work.templions;
set DMTX51.WeatherAnimalsSports_train;
attrib Word length=$32;
keep Target_Subject TextField NumWords WordFreq;
NumWords=countw(TextField);
FoundFlag=0;
WordFreq=0;
do WordNum=1 to NumWords;
Word=lowcase(scan(TextField,WordNum,
' ,.;:@*&|\/+=<>?!$%^()[]{}'));
if (Word="lions") then do;
WordFreq+1;
FoundFlag=1;
end;
end;
if (FoundFlag=1) then output;
run;
data work.templions;
set work.templions;
TermRelevance=WordFreq/(&MaxFreq);
run;
proc print data=work.templions;
run;
The SCAN function steps through a text string and identifies tokens based on the separators that are
provided. A later chapter briefly discusses elements of document parsing. PROC SQL and the MAX
function are used to get the normalizing constant for computing relevance. The term relevance score
is just the normalized term frequency for the selected term. The results are given below.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.3 Using SAS Text Miner 1-119
The program SAS_Language_IR.sas contains additional code illustrating queries for lion and for lion or
lions.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-120 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
Practice
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.4 Chapter Summary 1-121
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-122 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
1.5 Solutions
Solutions to Practices
1. Changing Text Miner Properties to See How This Affects Results
If you use SAS Enterprise Miner 15.1 and you ran the exercise with the specified property settings,
the following Text Cluster results are obtained:
With the request for four dimensions and four clusters, it appears that clusters 1 and 3 are two
separate animals clusters.
You should have discovered that copying and pasting the Metadata node and the StatExplore node
require some changes because the Text Cluster results are for the new TextCluster2 node added
by the copy-paste operation. Consequently, you must set the new role for TextCluster2_cluster_
to Input.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1.5 Solutions 1-123
Similarly, with four topics, there now are two (Topic 2 and Topic 4) that involve animals.
The decision tree resulting from these settings is shown below and indicates one misclassification
on the training data.
Did you try some other property settings that give good results?
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
1-124 Lesson 1 Introduction to SAS® Enterprise Miner™ and SAS® Text Miner
23
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
123
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 2 Overview of Text
Analytics
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-3
Objectives
• Describe how the Text Import node is used for processing document
collections and creating a single SAS data set for text mining.
• Show how the SAS data set created from Text Import can then be merged
with another SAS data set containing target information and other
non-text variables.
• Show how to compare two models, one using only conventional input
variables and another using the conventional inputs and some text mining
variables.
3
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Often the most challenging part of the data mining process is obtaining and preprocessing the data.
SAS provides a rich set of tools for data preparation.
4
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-4 Lesson 2 Overview of Text Analytics
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-5
6
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Perl regular expressions enable you to use terse scripts for complex data operations on text files.
For example, to preserve confidentiality, you might want to convert all postal codes to a generic
phrase.
%macro PrivateUSAPostalCode(TextVar);
&TextVar = prxchange(
's/\d{5}/_PRIVATE_USA_POSTAL_CODE_/',
-1,&TextVar);
%mend PrivateUSAPostalCode;
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-6 Lesson 2 Overview of Text Analytics
8
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
SAS programs are used sparingly in this course. Preference is given to the use of the SAS Code
node for running SAS programs.
9
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Because data preparation can be the most arduous task in text mining, SAS Text Miner includes
the Text Import node that facilitates reading all popular commercial document formats.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-7
Previous versions of SAS Text Miner included the %TMFILTER SAS macro for reading document
collections. The %TMFILTER macro is available in the current release of SAS Text Miner. The Text
Import node provides the functionality of the %TMFILTER macro without requiring the user to create
and execute a SAS program.
A SAS macro is a script that can be compiled and executed to perform complex tasks.
At the simplest level, a SAS macro is a script that generates SAS code to be executed
by the SAS supervisor. These scripts are often stored in SAS catalogs that can be accessed
and viewed by users. Proprietary scripts are stored in compiled form and cannot be read by users.
Some of the SAS macros included with SAS Enterprise Miner and SAS Text Miner are compiled.
Sample SAS macros are stored in the catalog SASHELP.EMUTIL. For example, the SAS source file
SASHELP.EMUTIL.EXTDEMO.SOURCE provides examples of SAS Enterprise Miner functionality
that can be exploited using a SAS Code node. The catalog SASHELP.EMTEXT contains macros
related to text mining. Your ability to access and view macros in SASHELP catalogs depends
on the version of SAS that you are running and unique factors related to how your organization has
chosen to install SAS software. For example, your Information Technology (IT) or Management
Information Systems (MIS) Department might have restricted Read access to SAS installation files.
For more information about SAS programming and Enterprise Miner, see SAS® Enterprise Miner™
15.1 Extension Nodes: Developer’s Guide, which is the current programming guide for the SAS
version used for this course. The following link was active when this course was developed.
http://supportprod.unx.sas.com/documentation/cdl/en/emxndg/67980/PDF/default/emxndg.pdf
If the link is inactive, search for Enterprise Miner Extension Node Developer’s Guide at
http://support.sas.com
10
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-8 Lesson 2 Overview of Text Analytics
11
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
You can use a SAS Code node to modify the SAS table produced by the Text Import node.
For example, you can choose to drop variables such as LANGUAGE, TRUNCATED, OMITTED,
and EXTENSION, because these variables are rarely used beyond the data preparation stage. Many
document collections use a naming convention so that the path given in the URI field or the filename
given in the NAME field can be used to derive ID or index variables.
data &EM_EXPORT_TRAIN;
attrib ClaimNo length=$12
label="Claim Number"
AdjusterNotes length=$256
label="Adjuster Notes";
set &EM_IMPORT_DATA;
AdjusterNotes=Text;
ClaimNo=substr(Name,1,12);
keep ClaimNo AdjusterNotes Size;
run;
12
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-9
In the above example, the SAS Code node modifies the data produced by the Text Import node
so that it can be merged with claims data indexed by the variable ClaimNo. This code is used
in the demonstration below.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-10 Lesson 2 Overview of Text Analytics
This demonstration starts with using the Text Import node to read in insurance adjuster notes for
an insurance subrogation modeling example. The Text Import node is set up and run differently from
the File Import node used in Chapter 1.
Note: The Text Import node uses the SAS Document Conversion Server. This server must
be running as a service under Microsoft Windows. Your instructor might provide additional
information about starting the SAS Document Conversion Server.
After the document collection is processed by the Text Import node, the exported data must be
merged with a claims features data set that includes the target variable and several input features.
You use a SAS Code node to pre-process the data for merging, and then you use a Merge node
to merge the two data sets. After the two data sets are merged into a single raw data set, you follow
that up with a typical text mining flow. This shows many of the steps that you might follow when you
work with your own data.
Note: Some background and definitions are helpful here. The term subrogation refers to a legal
right that an insurance company has to sue a third party in order to recover any
compensation payouts. For example, if you have a car accident that is caused by some other
person who was at fault for hitting you, your insurance company compensates you directly.
However, it can also try to recover money from the insurance company of the person who hit
you. This is called subrogation. Typically, it costs time and money for your company to use
a lawyer to initiate subrogation proceedings, so the company does not pursue this unless
it thinks that there was a good chance of winning a lawsuit or reaching a settlement. Most
subrogation cases are settled amicably when evidence clearly supports a subrogation claim.
In the demonstration that follows, you work with a data set in which the target variable
(SubroFlag) is defined by whether insurance claims were successfully subrogated.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-11
2. Drag a Text Import node into the diagram. For the Import File Directory property, navigate to
the pathname for the insurance adjuster notes (D:\workshop\winsas\dmtx51\InsClaimsNotes).
In general, you also need to create a destination directory for the output of the Text Import node.
In this course, use the D:\workshop\winsas\dmtx51\InsClaimsNotesDestination folder.
(If it does not exist, create it in Windows.) Navigate to this destination directory so that it is
selected in the properties panel. Modify the text size from the default 100 to the maximum
32000. An example of the completed properties panel appears below.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-12 Lesson 2 Overview of Text Analytics
4. Open one of these documents to see what it contains. For example, the file 001924817308.txt
contains the sentence shown below.
The pie charts for Omitted/Truncated Documents and Document Languages are mostly solid
blue. There are few omitted or truncated files, and almost all documents are in English. The pie
chart for Document Types is not blue because there are multiple document types. However,
the frequency of the files that are not TXT files is small, so the three documents show up only
as a thin line in the pie chart.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-13
Of particular importance are omitted and truncated documents, which are flagged by the binary
variables OMITTED and TRUNCATED created by the Text Import node. A file is omitted
if it cannot be converted. Following are reasons for a file being omitted:
• The file format is not supported by the Text Import node, or the file format has been specifically
excluded by the Extensions property.
• The file has security features that prevent it from being read or processed.
• The file is corrupted.
A file can be truncated if it exceeds the value given by the Text Size property, which is bounded
by the size of a SAS character variable. A SAS character variable cannot exceed 32,767
characters.
Note: It is important to be aware that truncation affects only how much of the document is
visible in certain windows. The full document is still analyzed downstream by all the
Enterprise Miner (including Text Miner) nodes. Whether a file is truncated for the analysis
depends on whether you properly use the role of text location.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-14 Lesson 2 Overview of Text Analytics
6. Click the ellipsis in the Exported Data line in the properties panel for the Text Import node.
Select Explore and look at the Sample Statistics window. All the variables that were created
by the Text Import node are shown.
Name – the name of the input file containing the particular adjustor note for that observation.
Filtered – the pathname for the converted file. This variable is given the role of text location..
URI – the pathname for the original file before conversion. When the Text Import node is used
as a web crawler, the pathname defines the location of the file extracted from the Internet and
copied to the Import File directory. The copied file is usually an HTML file, and if so, the role
of web address that is assigned signifies that HTML properties, such as font styles, should be
used by the Filter Viewer when displaying the document.
Here is a quick snapshot of the 19 documents and the first three variables. The Name variable
is important for match-merging with the claim features data set.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-15
7. Use the Plot Wizard to construct a bar chart for the file extension.
8. Use the Plot Wizard to produce a scatter plot of the size variables. The unconverted files will
have large variation because some files, like the docx, PDF, and xlsx files, will be larger because
they contain formatting information.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-16 Lesson 2 Overview of Text Analytics
The PDF file is 83,694 bytes before conversion, but only 31 bytes after being converted
to a TXT file.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-17
10. Because you selected the square corresponding to the file, you can use the dynamic linking
feature of Enterprise Miner to find the actual file. This could be time consuming for larger
document collections, so you might want to change the tip text to ClaimNo. Right-click in the plot
and select Data Options. Change the role of Name to Tip.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-18 Lesson 2 Overview of Text Analytics
The claim is highlighted in a lighter color, making it easy to find. However, the Enterprise Miner
explorer is not set up to read documents, so you will have to wait to read the entire document
until you can use the Filter Viewer in the Text Filter node.
Examining document size can indicate problems in data processing. It can also provide insight
into choice of term weight as discussed in a later chapter.
14. Bring in a SAS Code node and attach it to the Text Import node.
15. Select the Code Editor ellipsis in the properties panel for the SAS Code node. In the Training
Code window, right-click and select Open. Then navigate to the SAS program
D:\workshop\winsas\dmtx51\sassrc\SCN_SubrogationText.sas and bring it in.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-19
(If you have any problem finding the program, you can enter the seven lines of code manually.)
This code uses &EM_IMPORT_DATA, a macro variable that refers to the SAS data set created
by the Text Import node. The variable AdjusterNotes is defined in the Text field. (This is not
really necessary because you can keep the name Text.) The tricky part is to define the variable
ClaimNo, which is obtained using the SUBSTR function by extracting the first 12 characters from
the variable Name. This is why it is essential to give each text file a name that is the same
as the claim number. Run the SAS Code node.
Note: The slide show provided source code for basic processing of exported data coming from
the Text Import node. One of the documents in the subrogation collection is stored in xlsx
format, and the converter adds sheet number to the beginning of the document. The
code shown in the slide show has been modified to remove the unnecessary xlsx sheet
information. This could have been accomplished in many different ways. If you are going
to process a document collection having many different Microsoft Excel formats, then
you might want to use the Extension variable to decide how to modify the converted
document. If you are not a SAS programmer, you might need help with accomplishing
document post-processing.
16. Bring the SAS data set SUBROGATION_TARGET into the diagram. The data set was created
and is shown in the project panel under Data Sources. Look at the variables for this data set.
SubroFlag is the binary target variable (1=successful subrogation, 0=unsuccessful). ClaimNo
is the variable to use for matching against the adjuster notes data that were created with the Text
Import node. The remaining variables are potential input variables that can be used to predict
the target and the variables that the text mining nodes create.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-20 Lesson 2 Overview of Text Analytics
17. Bring in the Merge node and connect it to the two data sets. They are matched by ClaimNo.
This node is found on the Sample tab at the top of the window.
18. The data set SUBROGATION_TARGET is already ordered by ClaimNo. It can be matched one-
to-one with the SAS data set created by the Text Import node and the program in the SAS Code
node. The default setting is to perform a one-to-one merge. However, you can also perform
a match merge using the BY variable ClaimNo. Select Merging Match in the properties
panel, and select the Variables property to select the appropriate BY variable. Change Merge
Role for ClaimNo to By.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-21
19. Select the By Ordering property to verify that ClaimNo has been selected. The By Ordering
property is necessary when you have two or more BY variables.
Note: While match merging is less efficient than one-to-one merging, it avoids the problem
of mismatches when one of the source data sets is sorted in place by something other
than ClaimNo.
21. Connect a Save Data node to the Merge node. Specify DMTX51 for the SAS Library Name
property, and specify Subro for the Filename Prefix property. The Save node can be used
to create a permanent copy of a SAS data set created by SAS Enterprise Miner.
22. Run the Save Data node. A copy of the training data set exported from the Merge node will
be saved as DMTX51.Subro_train. If you archive or delete the Text Import diagram, you can still
access the subrogation data without rerunning the Text Import node and remerging the data.
23. Connect a Text Parsing node and a Text Filter node to the output of the Merge node. Run the
default settings in both cases. For the Text Filter node, the default weightings on the properties
panel are Log for Frequency Weighting and Mutual Information for Term Weight, if there
is a target variable present. (There is in this case.) It is usually a good idea to explicitly show
these choices rather than keep Default showing. Run the program from the Text Filter node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-22 Lesson 2 Overview of Text Analytics
24. Open the Results window for the Text Filter node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-23
The three primary diagnostic plots do not uncover anything unusual about the subrogation data.
The Zipf plot reflects the exponential decay consistent with the power rule formulation provided
in a later chapter. The Number of Documents by Frequency plot is approximately linear. There
are more nouns, verbs, and adjectives than other parts of speech. The number of adverbs is low,
which is consistent with how objective reports are constructed. Adverbs tend to reflect opinions
rather than fact. For example, “The claimant recklessly operated the fork lift which caused the
collision with the stack of pallets” versus “The claimant struck a stack of pallets while operating
the fork lift.” Close the Results window.
25. Open the Filter Viewer. For an actual project, you can explore the quality of the data with
the Filter Viewer. Two explorations are illustrated, but many more are possible.
26. The first exploration compares the adjuster notes to other attributes of a claim. Sort the
document window by the variable Body (label=Body Part). Scroll down to head injuries.
Note: You can use the Edit Find feature to quickly scroll down to the desired point. If you
select head, you will encounter a few documents that mention the head with respect
to other injuries (such as eye injuries), but this strategy might be faster than simply using
a scroll bar, especially for very large document collections.
One of the documents seems to be in disagreement with the assigned body part.
This actually appears to be a multiple body part injury. Although this technique of selectively
scanning the document collection can uncover data problems, it is very labor intensive.
The second exploration is more systematic. Previous exploration reveals that back, finger, arm,
hand, and head injuries seem to dominate. The following plot was obtained from a variable
property window using the Explore feature.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-24 Lesson 2 Overview of Text Analytics
27. The first exploration revealed a problem with a single claim. Because there are 233 cases
classified as head injuries, you can exploit the Subset Documents feature of the Text Filter node.
Drag a new Text Filter node into the diagram. Select the property Subset Documents. For
Column name, select Body (Body Part).
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-25
28. Under Value, click the ellipsis icon, and select Head.
29. Click OK, and observe how the query populates the Subset Documents property.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-26 Lesson 2 Overview of Text Analytics
31. Open the Filter Viewer. Verify that 233 documents were retrieved. Move the mouse pointer over
a column heading in the Documents window.
You see that 233 documents are contained in the documents table.
32. In the Terms table, sort by # DOCS.
Of the 233 documents retrieved, 73 have the noun head, 34 have the verb head, and 22 have
the adjective head. It might be surprising that most adjuster notes do not contain the word head
for head injuries.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.1 Using the Text Import Node 2-27
If you sort by TERM and scroll down to the first entry for head, you get the following view.
33. Because headache was a possible source of body part misclassification revealed above, extract
the eight headache cases. Select the headache row, right-click, and then select Add Term to
Search Expression. Select Apply.
Other explorations are warranted. For example, why would head used as a verb suggest a head
injury? Unfortunately, you cannot include role in the query.
The subrogation data are revisited later to illustrate text mining and predictive modeling.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-28 Lesson 2 Overview of Text Analytics
14
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-29
Objectives
• Define stylometry and explain how it relates to text analytics.
• Illustrate how text mining can be used to support forensic linguistics
using stylometry techniques.
17
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
SAS users come from many different business and government organizations. Students in this class
are sometimes involved with various security and intelligence problems. The forensic linguistics
demonstration is intended to show how SAS Text Miner can be used for these types of problems.
Consider some background information for this example. Between 1978 and 1995, a person called
the “Unabomber” (who is known to be Theodore Kaczynski) mailed bombs to selected individuals
associated with technology research. His bombs killed three people and injured 23. In 1995, he sent
a long article entitled Industrial Society and Its Future to the FBI and demanded that it be published
in a major newspaper or he would strike again. This long article was eventually published in the
New York Times and The Washington Post. The style and content of the writing was recognized
by Theodore Kaczynski’s brother, and this ultimately led to Kaczynski’s arrest.
Note: In 1995, text analytics software was a rarity. However, several universities had faculty
members who were actively involved in text mining research. The FBI actually received help
from several researchers, but due to the lack of good control data, a text mining solution was
not forthcoming. In particular, scores generated for persons of interest were not sufficiently
high for the FBI to have good cause for further investigation. (Personal communication,
M2006 Data Mining conference.)
This demonstration uses 232 paragraphs extracted from Kaczynski’s long article, and 1726
paragraphs extracted from the writing of five other authors. The latter are used as comparison
documents. There is a total of 1958 documents (paragraphs). You run both the Text Cluster and
Text Topic nodes on this training data and then create a decision tree model in order to attempt
to accurately classify the documents by their authors. Classification such as this is really a form
of prediction modeling. In addition, 11 documents are used as unknowns. You use the two models
to classify these 11 unknown paragraphs with regard to their likely authors. (Spoiler alert: In this
setup, all 11 of the unknown cases were written by Kaczynski.)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-30 Lesson 2 Overview of Text Analytics
Stylometry
Stylometry is defined as the use of linguistic style to characterize written
language.
Applications:
• attributing authorship of anonymous or disputed literary works
• detecting plagiarism
• forensic linguistics (for example, identifying Theodore Kaczynski
as the Unabomber based on his writing style)
18
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
continued...
Forensic Linguistics
Special Case:
Stylometry Applied
to Forensics
19
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Forensic linguistics typically uses predictive modeling to score a document of unknown, but
suspected, authorship. The score represents an estimate of the probability that the document was
written by a suspect. The value of text mining applied to forensic linguistics is that suspects can
be identified for investigation. The text mining results are rarely if ever used as evidence in
prosecuting a suspect, although testimony might include a discussion of techniques in describing
how the suspect was identified.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-31
continued...
Forensic Linguistics
TK is a suspect.
Corpus: 1,958 paragraphs from six authors taken from written works and interviews
20
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The data for this study is real, but the situation is hypothetical. Separation of documents was
enhanced for educational purposes. In actual forensic linguistic studies, there are rarely such pure
results as those achieved here.
The six authors in the training data are coded as AM, CD, DM, DO, FE, and TK. The initials were
changed for the first five authors. TK is Theodore Kaczynski, the so-called Unabomber. The TK
documents are paragraphs from the manifesto written by Kaczynski and published in The New York
Times and Washington Post. Obviously, when the manifesto was published, the author was not
known to be TK. The 11 unknown documents are excerpts from interviews with Kaczynski after
he was convicted of murder. Thus, although based on real data, this example is artificial.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-32 Lesson 2 Overview of Text Analytics
Forensic Linguistics
Score Data Set: Eleven documents from the same unknown author
Problem: Build classification models on the known documents with six
different authors. Apply these models to the unknown 11 documents
to determine the likely author of each one.
21
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-33
This demonstration illustrates how to use text mining nodes and other Enterprise Miner nodes
to build classification (prediction) models in a forensic setting. You analyze writing samples from six
authors. For five authors, the writing samples in the training data have to do with technical material
about statistics and SAS courses. For one of the authors (TK), the writing samples are paragraphs
from his published manifesto.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-34 Lesson 2 Overview of Text Analytics
7. Attach a Text Cluster node to the Text Filter node and run it with the default settings. Open
the results. Cluster 5 has descriptive terms, such as power and people, that are clearly
associated with the Unabomber’s long published manifesto. Close the window.
8. Attach a default Text Topic node to the Text Cluster node and run it. Open the Interactive Topic
Viewer and look for topics that are likely to be from the Unabomber’s writing. For example, select
the third topic shown below. Look at the Terms and Documents windows associated with this
topic.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-35
11. Open the Forensics_score data set and designate it as a Score data set. This data set contains
the 11 paragraphs that were drawn from TK’s interview after he was captured. You want to see how
accurately the tree model classifies these paragraphs. To do this, open a Score node and attach it to
both the Forensics_score data set and the Decision Tree node. Your complete diagram should look
as shown below.
12. Run the Score node and open the exported data from the properties panel. Select the Test data
and click Explore. Use the Plot Wizard to construct a bar chart. Use the following properties:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-36 Lesson 2 Overview of Text Analytics
The dominance of a single color for each author illustrates how accurate the decision tree model
is for scoring the test data. You can verify by positioning the cursor over the dominant color for
any of the authors.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.2 A Forensic Linguistics Application 2-37
14. Scroll to the far right in the browsing window and look at the last two columns.
The last column gives the model’s predicted author category (Prediction for Author) and the
second-to-last column gives the model probability for this category (Probability of Classification).
All 11 paragraph extracts are correctly classified as written by TK. (Remember, the data for this
demonstration was enhanced to ensure such a clear-cut result!)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-38 Lesson 2 Overview of Text Analytics
Objectives
• Describe information retrieval and explain how it is done in the interactive
Text Filter Viewer.
• Use the Medline medical abstracts data to illustrate an application
of information retrieval.
24
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Information Retrieval
“Information retrieval (IR) is finding material (usually documents)
of an unstructured nature (usually text) that satisfies an information
need from within large collections (usually stored on computers).”
– Manning, Raghavan, and Schütze (2008)
25
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
One of the more publicized success stories in information retrieval concerns the discovery by
Don Swanson (1988, 1991) that magnesium deficiency could be a source of migraine headaches.
Swanson queried medical reports for articles about migraines and nutrition.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Information Retrieval 2-39
For a given corpus of documents, information retrieval (IR) groups documents based on the
similarity of contents. An IR query can be a Boolean query, a query based on latent semantic
indexing, or a query based on some other method of quantifying document content. The Text Filter
node uses a weighted cosine similarity measure to compute the similarity between a document
and the query. Documents that are most similar to the query are returned.
26
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
27
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The Interactive Filter Viewer does not recognize ># operators mixed with the + operator.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-40 Lesson 2 Overview of Text Analytics
Query:
+diabetes
28
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Information Retrieval 2-41
You often want to explore a document collection by searching on various terms of interest. This does
not require a target variable and is efficiently done with the Text Filter node. As always, you first run
the Text Parsing node. This demonstration illustrates how to do this using medical information from
Medline data.
The MEDLINES data source contains a sample of 4,000 abstracts from medical research papers
that stored in the MEDLINE data repository.
1. Create a new diagram and name it Medline Information Retrieval. Drag the MEDLINES data
source into the diagram. Look at the variables.
There is more than one variable with the role of Text. In cases like this, the Text variable with
the longest length is the one that is selected for analysis by the Text Parsing node. If two or more
Text variables have the same length, the one appearing first in alphabetical order is selected.
In this example, ABSTRACT (2730 bytes in length) is the longest of the Text variables and
is the one that is analyzed.
2. Attach a default Text Parsing node to the Input Data Source node. Notice that the default Text
Parsing node populates the properties panel with certain tables. For example, there is a default
Synonyms table named SASHELP.ENGSYNMS. (This actually contains only one row (one
synonym) and is present only as a template). There is also a default stop list named
SASHELP.ENGSTOP. (The use of such tables and others is discussed in a later chapter.)
3. Attach a Text Filter node to the Text Parsing node. The default frequency weighting is Log.
When there is no target variable, the default term weight is Entropy. It is a good idea to make
this explicit, as shown.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-42 Lesson 2 Overview of Text Analytics
5. Select Filter Viewer from the properties panel. This accesses the Interactive Filter Viewer where
searching on terms in the documents is performed.
6. In the Terms window, right-click any term in the table. Select Find.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Information Retrieval 2-43
The table jumps to the portion of the table that contains the term glucose.
8. Expand to see the stemmed versions of glucose. The term occurred 263 times in its singular
form and one time as a plural.
9. Right-click on the first row of glucose and select Add Term to Search Expression.
The term glucose is added to the Search window. The preceding symbols (>#) indicate that
all stemmed versions (or synonyms if any were defined) of the term are searched for.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-44 Lesson 2 Overview of Text Analytics
10. Click Apply. The following results appear in the Documents window:
The abstract is shown on the left. Stretch the column labeled TEXTFILTER_SNIPPET so that
you can see the term glucose in every row. This indicates the part of the abstract where glucose
appears. (This is the first occurrence if there are multiple occurrences.)
11. Place your mouse pointer above the TEXTFILTER_SNIPPET label. You see the following
message: “Left-click on column header to sort 93 rows of the table.” This indicates that 93
documents were selected because glucose, glucoses, or both) are found at least once in each
document.
12. The TEXTFILTER_RELEVANCE column returns a measure of how strongly each document
is associated with the search term. This is a relative measure. The most relevant document
is given the highest value of 1. The calculation of this metric considers factors such as the
number of times a term (or its stemmed versions and synonyms) appears in a document. To get
an idea of this, click twice on the column heading for TEXTFILTER_RELEVANCE until you see
the most relevant document in the first row (the one with TEXTFILTER_RELEVANCE=1.0).
Then select that row.
13. Select Edit Toggle Show Full Text to see the complete document with the highest relevance
score.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Information Retrieval 2-45
Reading through the full document, it is obvious that glucose is used many times. This explains
why this document has the highest relevance measure for a query based on this term. Select
Edit Toggle Show Full Text to go back to the original way of viewing the documents.
14. Ninety-three documents were retrieved by the query. It is also useful to be able to retrieve
documents that contain one term, but do not contain another term. For example, in order to take
these 93 documents and eliminate any that contain the term diabetes, in the Search window,
enter –diabetes. (That is, precede the term with a minus sign as shown below.) Click Apply.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-46 Lesson 2 Overview of Text Analytics
Practice
b. Connect the default Text Parsing and Text Filter nodes to this data set.
c. Identify the name of an actor or actress of interest to you. Using the Interactive Filter Viewer
of the Text Filter node, find all of the movies in the data set that have a synopsis that
mentions the name that you selected.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.3 Information Retrieval 2-47
Frequency filtering is a methodology to create or add to a stop list. You can run the Text Parsing
node with the default stop list and then use frequency filtering to add terms to this list. Frequency
filtering specifies a cutoff frequency. Terms with a frequency below the cutoff are added to the
stop list. You can also specify a cutoff frequency at the high end so that terms with a frequency
above the cutoff are added to the stop list. For creating a start list, keep terms with frequencies
between the high and low cutoff values. The start list DMTX51.SASCOURSESTART contains
a start list that was obtained using domain knowledge and frequency filtering.
a. Create a diagram named SAS Course Outlines. If you need to, create a data source for
the DMTX51.SASCOURSES data set. (The metadata is presented above.) Drag this data
source into the diagram.
b. Attach a Text Parsing node to the Input Data Source node. Change the Synonyms
property so that there is no synonyms table and add DMTX51.SASCOURSESTART
as a start list.
c. Attach a Text Filter node to the Text Parsing node. Change frequency weighting from
Default to Log. Change term weighting from Default to Inverse Document Frequency.
Log is the default frequency weight, but Entropy is the default term weight. Inverse
Document Frequency is recommended for documents larger than a paragraph. Run
the Text Filter node.
d. Open the Filter Viewer (also called the Interactive Filter Viewer). Determine how many
documents contain the term neural network by doing a search on this term. How many
documents did your search return? Why is this number not 22?
e. Select the document corresponding to the course with the code BDMT61. Select Edit
Toggle Show Full Text. You can read the course outline for BDMT61.
f. Select Clear Apply to return all of the documents in the collection. Navigate back to the
neural network row in the Terms table. Right-click the neural network TERM cell and select
View Concept Links. The concept link plot appears. What are some of the terms strongly
associated with neural network?
g. Close the Filter Viewer. Attach a Text Topic node to the Text Filter node. For User Topics
in the properties panel, select the data set DMTX51.SASTOPICS. Keep all of the other
default settings. Run the Text Topic node.
h. Access the Results window. Which topic contains the most documents?
i. Close the Results window for the Text Topic node. Look at the exported data and determine
what variables were created by this node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-48 Lesson 2 Overview of Text Analytics
j. Open the Topic Viewer from the properties panel and explore the results.
Note: A custom topic is similar to a predefined query. The topic weight shown in the
documents window determines whether the topic is present. (That is, the query
is satisfied.) If the topic weight exceeds the document cutoff, then the document
is classified as having the topic.
k. Close the Topic Viewer. Attach a Text Cluster node to the Text Filter node. Most users
attach the Text Topic node and Text Cluster node directly to each other, but they work
independently. Neither requires any results from the other.
l. Use the default setting and run the Text Cluster node. Open the Results window. How many
clusters were created? Can you interpret some of the clusters from the displayed descriptive
terms? How many SDV variables were created?
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-49
Objectives
• Describe text categorization and explain how it can be accomplished
using the Text Rule Builder node.
• Use the Aviation Safety Reporting System (ASRS) to illustrate text
categorization.
32
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
continued...
Text Categorization
• Text categorization can be supervised or unsupervised.
• Supervised text categorization requires a nominal target variable.
• A training data set contains documents and one or more categorical target
variables, also referred to as labels.
• Labels are usually assigned by human judges, but they can be automated
labels assigned using a computer scoring method.
• If a human judge is used, one or more judges read and assess a document
to assign a label.
• The goal of text mining is to derive score code to accurately reproduce the
assigned label, and be able to score new documents that have not been
labeled.
33
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-50 Lesson 2 Overview of Text Analytics
Text Categorization
• Many text categorization problems are solved as complete self-contained
text analytics projects.
• The text provides all of the information used for scoring.
• Scoring documents is the primary purpose of the project.
• Some predictive modeling projects benefit from text analytics.
• There are several sources of information, including text.
• Scoring is in a more general area, such as fraud or direct marketing.
The predicted text categories provide inputs for the general area of interest.
34
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
This section focuses on text categorization as a self-contained text analytics project. A later chapter
addresses general predictive modeling with text analytics inputs.
35
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-51
This demonstration illustrates how to categorize documents with pre-assigned labels using SAS Text
Miner. In most text categorization problems, labels are assigned by human judges, so the labels are
often subject to error due to the usual problems of fatigue, environment, and so on. The labels are
classified as target variables.
The Aviation Safety Reporting System (ASRS) data set can be accessed from the following link:
http://asrs.arc.nasa.gov/
From the website:
“ASRS captures confidential reports, analyzes the resulting aviation safety data, and disseminates
vital information to the aviation community.
“More than 850,000 reports have been submitted (through October 2009) and no reporter’s identity
has ever been breached by the ASRS. ASRS de-identifies reports before entering them into the
incident database. All personal and organizational names are removed. Dates, times, and related
information, which could be used to infer an identity, are either generalized or eliminated.”
As with other data sets used in this course, data sets derived from ASRS have been modified.
The original data for this demonstration was extracted from the ASRS, pre-processed, and provided
to competitors in a text mining competition sponsored by SIAM and the NASA Ames Research
Center. The competition results were presented at the Seventh SIAM International Conference
on Data Mining held in 2007 in Minneapolis, Minnesota. Participants were prohibited from using the
R language, SAS software, and most commercial software. A link that provides access to the original
data follows.
https://c3.nasa.gov/dashlink/resources/138/
A single report in the ASRS database can be a composite derivation of two or more reports filed for
the same incident. For example, one runway incursion incident can result in three reports: one from
the pilot, one from the copilot, and one from an air traffic controller. An incident involving two or more
aircraft can have reports filed from pilots of all aircraft involved, as well as from air traffic controllers.
In both examples, there will be only one ASRS report, but that report will be prepared by NASA
professionals based on all reports submitted.
Reports can be submitted by aviation professionals, such as pilots, flight attendants, and mechanics.
Reports can also be submitted by non-professionals, such as private pilots.
A report in the ASRS database has many fields, with one field representing a primary narrative
describing the incident. This primary narrative is stored in the Text variable. All of the other fields
have been omitted to simplify the text mining component of the analysis. In practice, an automated
labeling system would attempt to use all fields.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-52 Lesson 2 Overview of Text Analytics
If you examine an individual report, you will see rather unusual terms (for example,
instrumentlandingsystem as one word, rather than instrument landing system). The ASRS was
introduced in 1976, and the technology of that period was limited to the type of searches similar
to find and search features of the Text Filter Viewer. You can have pattern searches, looking for
specific patterns of characters regardless of the identification of terms, and you can have term
searches that require an exact match to a parsed term (token). To facilitate matching known
systems, factors, or events, NASA constructed a dictionary of keywords to facilitate rapid search
and retrieval. Reports were edited with the keyword dictionary in mind. Thus, instrument landing
system and ILS were replaced with instrumentlandingsystem. With modern tools like Latent
Semantic Indexing, the use of the keyword dictionary has less value, and the labor involved
in editing reports to match the keyword dictionary could be difficult to justify.
NASA manually assigns to each report 1 or more of 54 anomalies, 1 or more of 32 results, 1 or more
of 16 contributing factors, and 1 or more of 17 primary problems. For example, the report might
describe an event that was a “runway ground incursion” anomaly, with a “took evasive action” result,
that was a “human factor” contributing factor, and a “human factor” primary problem. These fields are
not available in the contest data. Instead, the contest data has 22 labels, with a value of 1 “if
document i has label j.” Otherwise, the label has a value of -1. Labels correspond to the topics
identified by NASA to aid in the analysis of the reports. The labels are not defined in the competition.
For the course data, the 22 labels are named Target01, Target02 up through Target22, and an
original coding of (-1,1) has been changed to (0,1), with a code of 1 indicating the presence of the
label in the document. A document can be associated with one or more labels.
The 22 target events vary considerably with respect to the difficulty of modeling them. An analysis
of the ASRS data was provided in the E.G. Allan et al. article “Anomaly Detection Using Nonnegative
Matrix Factorization” (2008). Descriptions of several of these target events, along with published
ROC index values for models that Allan et al obtained using an analytic method known
as Nonnegative Matrix Factorization, are shown in the table below.
You will create a diagram flow that tests out the effect of trying to predict Target02 and Target05
using the Text Rule Builder node and various predictive modeling nodes from SAS Enterprise Miner.
Target02 will be modeled in a demonstration, and Target05 will be left as an exercise.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-53
The ASRS training data contains columns that indicate which of the 22 manually assigned labels
relates to a given report. The goal is to develop a system to automatically detect incidents to avoid
the time, cost, and error associated with manually labeling the reports. In other words, you will be
building a model on a data set where experts have already read the reports and made evaluations.
(This is the sort of process that many people will use for sentiment analysis: create a data set of
labeled cases and then build an automatic classification/prediction system based on these known
cases.) In an actual operational system for this example, you would build 22 models to evaluate
whether each of these 22 types of events occurred. This collection of models would provide 22
predicted values, one for each target.
Note: Running the diagram setup for this demonstration will take several minutes.
Approximately 51% of the ASRS data exhibits a value of Target02=1. The balanced nature of the
data will help prevent problems often associated with a rare target. For binary targets with values in
(0,1), SAS Enterprise Miner chooses the value 1 as the primary event and 0 as the secondary event.
1. Create a new diagram and name it Aviation Safety Reporting System. (As we have stated
throughout, in virtually every case this diagram should have been already created for you
and the relevant nodes run.) The diagram will look like this when completed:
Enlargements of the process flow make it easier to identify the nodes that are used. Note that
the single Text Cluster node is in both flows.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-54 Lesson 2 Overview of Text Analytics
The last Text Rule Builder node at the bottom has the name Text Rule Builder — Aggressive
to indicate that additional computing time is allowed to try to find more complex rules. The full
name is not displayed because of space limitations.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-55
2. Create a data source for the ASRS training data using DMTX51.ASRS_TRAINING.
Use the following metadata:
The data set contains the 22 variables with the names Target01 throughTarget22 that were
defined above. We will use Target02 for this demonstration, so all the others should be rejected.
From a table in E.G. Allan et al. (2008, p. 215), the incident for Target02 has to do with
noncompliance with policy procedures. The variable Size is just the length of the report in bytes
and will not be used. Only the report itself (Text) is needed, but ID can be left as an ID variable.)
3. Drag the ASRS data source onto the diagram.
4. To investigate the robustness of the automated assignment, add a Data Partition node
to the partition data set DMTX51.ASRS_TRAINING. Use a 50/30/20 partition.
5. Attach a Text Parsing node to the Data Partition node. Leave all the defaults as is and run
the node.
6. Attach a Text Filter node to the Text Parsing node. Change the default weightings to Log and
Mutual Information. (In this case, these are the default properties, so they are set to specific
values to highlight the properties used for default settings.) Leave all else in default mode
and run the node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-56 Lesson 2 Overview of Text Analytics
7. Attach a Text Cluster node to Text Parsing. Change the Transform settings as follows:
A preliminary exploration was performed to decide what properties to use. There might
be a diagram on the virtual lab server named ASRS – Preliminary that shows some of the
explorations that were performed. Exactly 22 clusters are requested to see whether the derived
clusters might be highly correlated with the 22 target values. This would be unlikely given that
clusters are mutually exclusive and the 22 target values are not mutually exclusive—that is,
some documents contain more than one target variable having a value of 1.
8. SVD Resolution set to High and Max SVD Dimensions set to 50 generates a 50-dimensional
SVD solution. This means that 50 SVD variables will be added to the data. Run the node and
go to Exported Data in the properties panel. Select the TRAIN data set and then click Explore.
Verify that you have a 50-dimensional solution by looking at the number of TextCluster_SVD
variables.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-57
9. Attach a Decision Tree node to the Text Cluster node. Rename the node using the name
DT – Entropy – 50. This reminds you that you are using Entropy term weights and 50 SVD
dimensions.
10. Attach a Memory Based Reasoning (MBR) node to the Text Cluster node. Change
the Number of Neighbors property to 30. MBR implements a nearest neighbor algorithm,
so predictions are average target values for the 30 nearest neighbors to the observation being
predicted. Nearest neighbors are selected from the training data only.
The MBR node requires numeric inputs, and it assumes that the numeric inputs are orthogonal—
that is, the Pearson product moment correlation between all pairs of inputs is zero. The SVD
inputs are derived to be orthogonal, so this condition is satisfied. For a training data set with
more than 10,000 observations that have 50 input variables, larger values than the default 16
are recommended. Unfortunately, larger values can dramatically increase the execution time for
the node. It is common to try to find an optimal neighborhood size, but this can be very time
consuming.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-58 Lesson 2 Overview of Text Analytics
11. Attach a Regression node to the Text Cluster node. Change Selection Model to Stepwise,
and change Selection Criterion to Validation Error. Change Use Selection Defaults to No.
The use of validation error as a criterion causes the Regression node to apply the stepwise
selection algorithm until a stopping rule is satisfied and then select the particular step where
the minimum validation error is achieved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-59
13. Attach a Neural Network node to the Regression node. This causes the neural network model
to use only the subset of the 50 SVD values that were selected by the stepwise selection
algorithm in the Regression node. Change Model Selection Criterion to Average Error.
14. Attach a Text Rule Builder node to the Text Filter node. The Text Rule Builder does not use
the SVD inputs from the Text Cluster node. Use the default settings.
15. Attach a second Text Rule Builder node to the Text Filter node. Use the following properties:
Rename the node Text Rule Builder – Aggressive. The term aggressive reflects that you are
willing to sacrifice processing time to derive more complex and less pure rules. The Very High
generalization error attempts to offset the potential overfitting problem inherent in using a very
low setting for Purity of Rules and a very high setting for Exhaustiveness.
16. Now connect all six prediction nodes to the Model Comparison node. Set the properties panel
for Model Comparison to be ROC for the Selection Statistic property and Test for Selection
Table. As a consequence of these changes, the ROC index for the validation data will be shown
at the very beginning of the Model Comparison node Results window. Run the Model
Comparison node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-60 Lesson 2 Overview of Text Analytics
The Fit Statistics table shows that the neural network and regression scores achieve the highest
ROC index for the test data set. The aggressive Text Rule Builder results are slightly better than
those produced using the default settings.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-61
The worst ROC index in the table above is better than the value reported by Allan (2008), but
to properly compare the techniques presented here and those described in the Allan paper,
the same evaluation data set must be used. The ROC curves follow.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-62 Lesson 2 Overview of Text Analytics
In color, you can see that there is little difference between the MBR model, the neural network
model, and the regression model. Although the overall ROC index is similar for all models, the
decision tree and the default Text Rule Builder model appear to be uniformly inferior, whereas
the aggressive Text Rule Builder model seems to consistently beat the bottom two models.
Although the ROC index might be preferred, the misclassification rate is easy to interpret. Here
is the Fit Statistics table with misclassification rate displayed:
Note: The ROC index also has a rather easy interpretation, but not quite as obvious
as the misclassification rate. The area under the ROC curve represents the probability
that if you randomly draw a Target02=1 observation and also randomly draw a Targt02=0
observation, and score both observations with the model used to construct the ROC
curve, the ROC index value is the probability that the score for the Target02=1 case will
be higher than the score for the Target02=0 case. Thus, the ROC index reflects the
quality of the model in discriminating between the two cases. If the ROC index is 50%,
then you could just as easily flip a coin to decide how to classify an observation.
There might be compelling reasons to use a set of text-based rules for scoring. The SVD
variables are difficult to interpret, whereas the Text Rule Builder rules are easy to explain. With
a misclassification rate of about 32%, the aggressive Text Rule Builder model does not seem
to be very good, but we will explore the model to see why it provides a powerful alternative
to conventional predictive modeling nodes.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-63
18. Open the Results window for the Text Rule Builder – Aggressive node.
The Rule Success window provides a color-coded graphical display to help visualize how well
the rules classify documents.
When you see the colors, observing blue (target=1) and red (target=0) bar segments, you can
see whether a particular rule tends to favor a particular outcome. The most accurate rules tend
to be dominated by one color. Unfortunately, when many rules are generated, which happens
when you use aggressive property settings, the chart is truncated to fit into the Results window
and to have sufficient resolution for reading. The above chart displays only rules 114 through
187. A total of 187 rules were obtained. For comparison, the default setting produced 162 rules.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-64 Lesson 2 Overview of Text Analytics
19. Examine the Rules Obtained window for the aggressive settings.
The first rule is temporaryflightrestriction. (Recall, NASA edits safety reports to match certain
predefined keywords, which can be concatenated collections of words.) A temporary flight
restriction is exactly what you might think: it is a mandatory restriction on flight that is temporary
in nature, such as prohibiting flight over a sports arena when a game is in progress. Whereas
many temporary restrictions are expected, such as restricted flight over sporting events, others
might be so random as to be unexpected, such as visits by important government officials. When
a government leader (such as the President of the United States) or an important public figure
(such as the Pope of the Catholic Church) makes a public appearance, flight over the event is
usually restricted. Even flights over the transition route to the event might be restricted. Flights
might also be temporarily restricted because required navigational or communication equipment
is temporarily out of service. Even if a temporary flight restriction is unexpected, it is published in
a timely fashion, and all pilots are expected to be aware of all restrictions relevant for a planned
flight. The following two images show temporary flight restrictions (TFRs) being used in Northern
California related to three events.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-65
The temporary flight restriction is well defined by a textual description that describes specific
geographic boundaries and restricted altitudes. The above TFR for OAK (OAK is the three-letter
airport code for Metropolitan Oakland International Airport) and surrounding areas represented
by the shaded circle on the map are restricted to aircraft from the surface to 3,000 feet AGL
(above ground level). If a pilot flies into the airspace defined by the TFR without having been
given clearance by air traffic control (ATC), the pilot is violating a Federal Aviation Regulation
and is subject to penalties, including loss of pilot’s license. Such violations can be caused by
poor communication with ATC, misreading a chart, or inadequate preflight planning. Whatever
the cause, filing an ASRS report helps NASA and the FAA improve safety, and also immunizes
the pilot from civil penalties imposed by the FAA, assuming there was no injury or physical
damage related to the flight in the TFR airspace.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-66 Lesson 2 Overview of Text Analytics
The two shaded circles over San Francisco are for the Fleet Week Air Show and a Major League
Baseball (MLB) playoff game between the Chicago Cubs and the San Francisco Giants. Date
and time values are given for both events in the TFR information box, as well as altitude
restrictions. These two TFR areas are more confusing because they overlap, and a pilot might
fail to notice the lighter shaded area indicated in the above graphic.
For the ASRS training data, 94 documents exhibited the term temporaryflightrestriction, and 93
of these documents were flagged as an operation noncompliance event (Target02=1). This
produces a precision value for this rule of 93/94=98.94%. On the other hand, the training data
has 6,437 training observations with Target02=1, so recall for this rule is 92/6437=1.44%. When
the term is present in a document, it is almost certainly a noncompliance event, but there are
many noncompliance events that do not use the term. The F1 statistic is the harmonic mean
of precision and recall. SAS code to calculate this quantity would look like the following.
F1=1/(0.5*(1/Precision)+0.5*(1/Recall));
The F1 value measures the trade-off between precision and recall, and gets larger as precision
and recall get closer to each other in value. Because recall is so small, the F1 value for the first
rule is small: F1=2.85%.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-67
To score a document, the rules are applied in order. If a rule is satisfied, then the target value
associated with the rule is assigned, and the variable w_Target02 (Why Into: Target02) is
assigned the rule number. If no rule is satisfied, the secondary event target value is assigned.
As stated above, Enterprise Miner defines the secondary event value to be 0 for (0,1) binary
targets. If no rule applies to a document, then w_Target02 is assigned a missing value,
and I_Target02 is assigned a value of 0.
20. Close the Results window and select Exported Data in the properties panel for the Text Rule
Builder – Aggressive node. Click the training data set, and select Explore. Select the Plot
Wizard, and create a bar chart. Use the following properties:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-68 Lesson 2 Overview of Text Analytics
The frequency 182 is for the random sample of documents selected for exploration. Recall that
for dynamic exploration, SAS Enterprise Miner pulls a subset of the data from the server to the
client.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-69
Rule 107 is shown in the table that follows with columns rearranged.
Of the remaining documents, a total of 388 documents do not satisfy the first 106 rules, but
do satisfy rule number 107. The frequency of 338 documents have correctly identified the target
value of 0. These 388 documents are then removed from the data and the remaining rules are
then created.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-70 Lesson 2 Overview of Text Analytics
a. 0
b. 1
c. 0.5251
d. missing
37
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.4 Text Categorization 2-71
Practice
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-72 Lesson 2 Overview of Text Analytics
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-73
2.6 Solutions
Solutions to Practices
1. Finding a Text String in a Movies Data Set
Set up the flow diagram in the standard way.
The data set was created with the following variable definitions:
Synopsis is the Text variable to analyze. Run the nodes. (This takes a few minutes to do.)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-74 Lesson 2 Overview of Text Analytics
Go to the Interactive Filter Viewer. Suppose you are interested in finding movies that star
Sandra Bullock. There are at least two ways to search on her name. One way is simply to enter
“Sandra Bullock” (with quotation marks, but uppercase or lowercase makes no difference)
in the Search window. This retrieves 15 documents.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-75
Attach a Text Topic node to the Text Filter node. Specify the following user topics. Note that
the topic BPV stands for “Brad Pitt Vampire.”
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-76 Lesson 2 Overview of Text Analytics
There are 46 movies that satisfy this linear algebra query. Only one document is relevant,
and fortunately, it is the top-scoring document. The movie Seven mentions Interview with
the Vampire, but it is not a vampire movie. The following movies get a high score for brad
and pitt but not for vampire: Legends of the Fall, Fight Club, Troy, and Seven Years in Tibet.
If you want to emphasize the term vampire more than the proper nouns brad and pitt, you can
modify the topic weight for the term. Suppose you double the weight for vampire.
You eliminate a number of the non-vampire movies listed above, and in doing so you also
eliminate Brad Pitt movies. Legends of the Fall still finishes in the top 10. Because your goal
is information retrieval, and you have found the information that you were looking for, the false
positives do not pose a problem. The linear algebra approach to information retrieval can be
superior to a simple keyword search because you can use different weightings for different
words.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-77
2. Text Mining SAS Course Descriptions (Optional Exercise and Self-Guided Demonstration)
Below are the major steps of this exercise and demonstration. You are also introduced to some
features that were not previously mentioned.
After you set up the diagram and run the Text Filter node as specified earlier, the Interactive
Filter Viewer should resemble the following:
If you sort the TERM column in the Terms table by clicking the heading cell that contains the
word TERM, then you can use a quick-find feature. Select any term in the TERM column. Then
enter the first letter of the term that you want to find. The window moves to the first term starting
with that letter. You can also select Edit Find to go directly to a desired term.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-78 Lesson 2 Overview of Text Analytics
In the Filter Viewer, select Edit Find, and enter the two-word phrase neural network. After
you click OK, you are taken to the first cell in the TERM column that contains the phrase neural
network. Notice that there are 13 documents containing this phrase. Right-click in the cell
containing the phrase neural network, and select Add Term to Search Expression. The Search
window contains "neural network". The quotation marks are required if the search expression
has more than one word. If you look at the filter rules, you see that this expression searches for
documents containing neural network and any of the synonyms of neural network.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-79
Stretch the TEXTFILTER_SNIPPET column so that you can see the term neural networks for
all the listed documents. The TEXTFILTER_RELEVANCE column contains the result of the inner
product calculation described in the previous discussion of a Boolean query. The cutoff is not
displayed, but all documents that produced a result above the cutoff are returned. Typically,
the courses with lower relevance scores include neural network material, but the courses include
other material as well. The neural network portion is a small section of the course. You can
determine that 17 documents were returned by placing your mouse pointer over any of the
column headings.
On the other hand, when you look at the Terms window, you see that neural network appears
sometimes as a noun group and sometimes as a simple noun. These are treated as two
separate types of terms. Consequently, the number of documents in which these two types
of terms appear (13+9=22) does not have to agree with 17 from above. A single document could
contain neural network at least once as a noun group and then neural network appears
elsewhere in the document as a noun.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-80 Lesson 2 Overview of Text Analytics
You can also look at a full document. Select the document corresponding to the course with
the code BDMT61. Select Edit Toggle Show Full Text. You can read the course outline for
BDMT61.
The relevance score for this document (.25) is in the low end of the relevance values for returned
documents. It is not an extremely high value because most of the course outline discusses topics
that are not related to neural networks. You can select Edit Toggle Show Full Text again to
return to one row per document.
Of the 17 returned documents, most appear to be legitimately related to neural networks,
so the precision of this query (percent of the documents returned that are relevant to the search
query) is approximately 100%.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-81
Select Clear Apply to retrieve all of the documents in the collection. Navigate again to the
neural network row in the Terms table. Right-click the neural network cell. Select View Concept
Links. The concept link plot appears.
Right-click
neural network
and select View
Concept Links.
The Concept Linking window appears. You can see the terms most strongly associated with
neural network. You can also right-click on any of these terms and select Expand Links to look
at indirectly associated terms.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-82 Lesson 2 Overview of Text Analytics
After you attached a Text Topic node to the Text Filter node, you were asked to go the
properties panel. Under User Topics, open the customized topic list, DMTX51.SASTOPICS.
After you run the Text Topics node with the other settings retained as defaults, open the Results
window.
The three bars to the far left in both bar chart windows relate to the three custom topics: data,
programming, and statistics, which were specified in the DMTX51.SASTOPICS data set. The
remaining bars relate to (automatically) derived topics. The Number of Terms by Topics bar chart
reveals that only a few terms were used to define the custom topics. Perhaps more terms should
be used. The Number of Documents by Topics bar chart reveals that the custom topics are more
prevalent than the derived topics. The smallest custom topic, programming, appears in 127
documents. The most popular automatically derived topic appears in 96 documents. (See the
arrow pointing to the bar.) You can get the topic frequencies by positioning the cursor over the
bar related to a topic. The plots are dynamic.
Close the Results window. You were asked to determine what variables were created by the Text
Topic node. Opening the exported data for this node shows that the variables TextTopic_1 to
TextTopic_28 and TextTopic_raw1 to TextTopic_raw28 were created. Twenty-eight topics were
generated (3 user topics + 25 derived topics).
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-83
Open the Interactive Topic Viewer through the properties panel. The following window appears:
A custom topic is similar to a predefined query. The first three topics are always the user-defined
ones (in this case: data, programming, and statistics). The topic weight shown in the Documents
window determines whether the topic is present. (That is, the query is satisfied.) If the topic
weight exceeds the document cutoff, then the document is classified as having the topic. Close
the Topic Viewer.
Running the Text Cluster node with default settings leads to a 25-cluster solution. Maximize
the cluster table and examine the descriptive terms for each cluster.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-84 Lesson 2 Overview of Text Analytics
The descriptive terms help identify the courses that appear in each cluster. It would also be
useful to read some of the documents in each cluster to better understand what types of
documents belong to a cluster. Because courses often contain material from several subjects,
clustering into mutually exclusive categories might be less useful than the topics created from
the Text Topic node.
Looking at the exported data for the Text Cluster node shows that 36 SVD variables were
generated.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-85
b. Use a Metadata node to change the role of Comedy to Target. Even though you will only
be using the synopsis to create input variables, you should set the other genre binary flags
to Rejected.
c. Set up a process flow for predictive modeling, including a Data Partition node, appropriate
predictive modeling nodes, and a Model comparison node. The solution will use a Text Rule
Builder node and a Decision Tree node.
1) Attach a Data Partition node to the Metadata node. Set the partition to 75/25. (You might
have chosen a different partition, which is acceptable, but results will not match if you do
not use a 75/25 split.)
2) Attach a Text Parsing node to the Data Partition node. Use default settings.
3) Attach a Text Filter node to the Text Parsing node. Use default settings.
4) Attach a Text Cluster node to the Text Filter node. Use default settings, except to speed
execution, choose exactly 10 clusters.
5) Attach a Text Topic node to the Text Cluster node. Use default settings.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-86 Lesson 2 Overview of Text Analytics
6) Attach a Decision Tree node to the Text Topic node. Change Leaf Size to 25,
and change Assessment Measure to Average Square Error.
7) Attach a Text Rule Builder node to the Text Filter node. Attach the Decision Tree node
and the Text Rule Builder node to a Model Comparison node.
8) Run the Model Comparison node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2.6 Solutions 2-87
The decision tree and the rules produced by the Text Rule Builder node provide very
similar results based on the ROC curve. The decision tree has a slight edge with respect
to misclassification rate.
For further study: Can you use the Text Topic node to obtain a custom comedy topic that
competes with the results above?
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
2-88 Lesson 2 Overview of Text Analytics
15
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
a. 0
b. 1
c. 0.5251
d. missing
38
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 3 Algorithmic and
Methodological Considerations in
Text Mining
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-3
Objectives
• Explain tokenization and describe the transition from tokens to words
in a language.
• Define frequency (local) weights and term (global) weights and describe
how they are used.
• Provide guidelines for choosing weights.
• Explain the basic vector (metric) space model for representing documents
and terms.
• Explain how singular value decomposition projects documents and terms
into a smaller dimensional metric space.
3
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
4
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-4 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
(Wakefield 2004)
5
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The approach for parsing and quantifying your text data varies based on the task that you want
to perform. In this chapter, we discuss what is available in SAS Text Miner. In other text analytic
products, other techniques are used. For example, SAS Sentiment Analysis provides capabilities
for atomic fact extraction. However, many atomic fact extraction exercises must be customized.
For example, you could train a predictive model to mimic how a domain expert assigns categories.
Supervised classification often requires problem-specific tasks related to data preparation and model
building.
Characteristics of a Document
A document consists of the following elements:
• letters
• words
• sentences
• paragraphs
• punctuation
• possible structural items (chapters, sections)
6
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-5
Various weighting strategies are introduced later to modify simple counts for the terms.
Zipf’s Law
Let t1,t2,…,tn be the terms in a document collection arranged in order from
most frequent to least frequent.
Let f1,f2,…,fn be the corresponding frequencies of the terms. The frequency
fk for term tk is proportional to 1/k.
Zipf’s law and its variants help quantify the importance of terms
in a document collection. (Konchady 2006)
“The product of the frequency of words (f) and their rank (r)
is approximately constant.”
7
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
In practice, Zipf’s Law is derived as a Power Law, with free parameters that can be estimated based
on the document collection. The general formula is shown here:
f k = C /( + k )
n
where C is a constant such that, for given and , f
k =1
k = T , the total number of words in the
document collection. The parameters and are estimated for a given document collection.
Konchady (2006) relates Zipf’s Law to quantifying the importance of a term: “…the number of
meanings of a word is inversely proportional to its rank.” (Konchady 2006, page 87)
Application of Zipf’s Law permits identification of important terms for purposes such as describing
concepts or topics. You will not encounter Zipf’s Law (or similar theoretical laws) directly, but you can
see the results of Zipf’s Law in text mining applications (for example, in the list of terms used to
define a topic). Along with methods such as Hidden Markov Models (HMM), the implementation is
often hidden from the user. Only the results of the methodology are visible.
Note: General methodologies like HMMs have specific applications, like part-of-speech tagging.
An HMM views a document as a collection of states defined by words. There might be many
paths that could take a document from one word to another, but a specific document has
only a single realized path. If you can calculate probabilities associated with the states
(terms) and paths (intermediate terms) for a document collection, you can start answering
questions like, “Given a term and the words that come before and after the term, what
is the probability that the term is a noun?” General HMM software gives the user the ability
to define, for example, the maximum length of a path with respect to the number of states
contained in the path. For part-of-speech tagging and multi-word term identification, the
software contains “hardcoded” values, such as how many words to examine before the term
of interest, and how many words after the term can be used to calculate probabilities.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-6 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
8
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Quantification Steps
The basic strategy for the quantification of free-form text with the Text
Miner nodes involves the following:
• obtaining the corpus of terms that will be used after applying stemming,
synonym creation, filtering, and so on
• representing each document and each term in a vector space via the
document by term (or term by document) matrix
• projecting the documents and terms into a lower dimensional vector
space
• conducting clustering and topic generation for the documents in this lower
dimensional vector space
9
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-7
• For the table in this slide, each document is represented by a row vector
of 5000 frequencies.
• Doc 1 has the row vector (1, 1, 2, 2, 0, 1, …, 0, 0).
• Notice that Doc 1 and Doc N have somewhat similar vector values,
as do Doc 2 and Doc 3. 10
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Obtaining document by term frequencies shows how documents can be represented in a vector
space whose elements are the frequencies of each term. However, this is likely to be a high-
dimensional space with many 0 values. The dimensionality can be reduced by language processing
steps such as stemming, synonym creation, and filtering out low-frequency terms.
11
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The basic data table of document by term frequencies can of course be transposed into the term
by document frequency matrix.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-8 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
Transposing: Term12
by Document Matrix after Stemming,
Synonyms, and so on
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
However, there are various problems with this type of data. Even after stemming and filtering, there
are often a large number of terms remaining, so there is still the difficulty of a high-dimensional
vector space. Also, the data matrix is very sparse. There are usually 90% of the document-term
frequencies that are 0. Furthermore, by Zipf’s law, the frequency counts of terms are very long tailed.
That is, there is a small number of very common terms that are used over and over again in most
of the documents.
13
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-9
The dimensionality and sparseness problems will be addressed by projecting the document and term
vector spaces into a lower-dimensional space by means of a key theorem from linear algebra
referred to as the singular value decomposition (SVD). Before applying SVD, however, it has been
found that weighting the raw document-term cell counts usually produces better text mining results.
Weighting also helps alleviate the problem of the skewness of the higher frequency terms by making
them less influential.
14
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Frequency weights, which are often called local weights in the text mining and information retrieval
literature, are the first step in transforming the raw cell counts. (Actually, frequency weights are
a function of the raw cell counts, and the following three functions can be chosen by the user.)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-10 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
15
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Term ID D1 D2 … Dn
16
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Term weights, often called global weights in the literature, modify frequency weights to adjust for
document size and term distribution.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-11
17
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
A brief discussion of the formulas behind the weights begins below. Although you might gain some
insight by looking at the mathematics, experimentation rather than intuition is often the best strategy
for choosing weights. Experience with similar text analytic problems can help you develop your own
guidelines.
18
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-12 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
Low High
Information Information
19
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The entropy term weight is slightly misnamed here. The usual definition of entropy from Shannon’s
di
(1948) information theory is the expression − pij log 2 pij , so a better way to describe the term
j =1
weight used here would be 1 – normalized entropy.
Because the logarithm of zero is undefined, the product in the numerator is taken to be zero
if the proportion pij is zero. Two simple cases illustrating the calculation of this term weight are shown
below.
(1 / 1) log 2 (1 / 1) (1)( 0)
Gi = 1 + = 1+ =1
log 2 ( n ) log 2 ( n )
Simple Case 2: Term i occurs one time in each of the total n documents.
Gi = 1 +
(1 / n) log 2 (1 / n )
= 1+
n(1 / n )( − log 2 (n ))
= 1+
− log 2 (n )
= 1−1 = 0
log 2 (n ) log 2 (n ) log 2 (n )
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-13
Low High
Information Information
20
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
If a term appears in every document, then the IDF weight is 1 because then di = n . The maximum
weight for a fixed document collection occurs when the term appears in exactly one document,
and the weight becomes 1 + log 2 (n) . No upper limit exists because the number of documents n
in a collection can be arbitrarily large.
Entropy and IDF weights achieve a maximum when exactly one term appears in exactly one
document. This implies a very discriminating term, but not a very useful one, because it occurs
in only one document. In fact, by default, the Text Filter node removes terms that do not occur
in at least four documents, although this is under user control. Both weights are at minimum or near
minimum if a term appears exactly one time in every document. In this case, the term is not very
discriminating because it occurs everywhere throughout the collection of documents.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-14 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
Gi = max(over k)
log10
( P(ti , Ck )
P(ti ) P(Ck )
)
where
C1, C2, …, Ck are the k levels of a categorical target variable.
P(ti) is the proportion of documents containing term i.
P(Ck) is the proportion of documents having target level Ck.
P(ti,Ck) is the proportion of documents where term i is present
and the target is Ck.
(Note that 0 Gi < ∞ and the log is base 10.)
21
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Although Gi is theoretically unbounded, in practice it is usually less than 1. Here is a simple example
showing how it is calculated for the case of a binary target, where k=2:
Number of Number of
documents with documents with
target level C1 target level C2
Number of docs 10 50 60
where term ti is
present
110 75 185
From this crosstabulation, we get
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-15
Generalizing this to the case of a categorical target with k>2 merely requires extending
the crosstabulation to a 2 by k table and then computing the individual factors in the same way
as above.
Multiplying the local and global weights produces an adjusted count that is often superior to using
raw counts alone.
22
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The term-document frequency matrix, weighted or unweighted, is the foundation of the linear algebra
approach to text mining.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-16 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
Terms D1 D2 Dn
T1
T2
Tm
23
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
continued...
Term Weight Guidelines
• When a target is present, Mutual Information is the default. It is a good
choice when it can be used.
• Entropy and IDF weights give higher weights to rare or low-frequency
terms.
• Entropy and IDF weights give moderate to high weights for terms that
appear with moderate to high frequency but in a small number of
documents.
• Entropy and IDF weights vary inversely to the number of documents
in which a term appears.
• Entropy is often superior for distinguishing between small documents that
contain only a few sentences.
• Entropy is the only term weight that depends on the distribution of terms
24
across documents. Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.1 Methods for Parsing and Quantifying Text 3-17
25
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
A simulation study artificially creates a document collection and distributes terms across the
documents using various strategies (for example, creating rare terms and creating terms with
frequency counts that follow a certain distribution). Even though the data set is completely artificial
and simple, it is informative to examine these results.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-18 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
You can verify the IDF calculations using the Doc Freq column and noting that there are 100
documents in the simulation. For example, for the term armadillo, the IDF term weight is as follows:
The gray scale version is difficult to interpret, but the color version of the table highlights low
information terms with a red background and high information terms with a green background.
The target variable emulates a human judge assigning a label to a document that is perceived
to be about marine mammals. For the judge to assign a value of 1, there must be sufficient
information about marine mammals to warrant the document being labeled as a marine mammal
document. The presence of a marine mammal term is not sufficient, as evidenced by otter having
a mutual information weight of zero. The term otter only appears one time in a document that
mentions no other marine mammals. The term raccoon gets the third highest mutual information
score because
it happens to appear with high frequency in some of the marine mammal documents.
The results show that entropy and IDF weights tend to produce similar results. IDF is recommended
for larger documents, whereas entropy might be more appropriate for smaller documents. Of course,
these results cannot be extrapolated to all document collections. In particular, a typical document
in the simulated collection is small, so the results would be more useful for document collections
such as the Medline data, but less useful for multi-page reports.
a. IDF
b. mutual information
c. entropy
d. none (G=1)
27
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-19
Objectives
• Sketch how singular value decomposition (SVD) is used to project the
high-dimensional document and term spaces into a lower-dimension
space.
• Illustrate what is happening with a simple example.
• Discuss Text Topic and Text Cluster results in light of the SVD.
30
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
31
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-20 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
32
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
33
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-21
continued...
SVD Example
• Let’s look at Russ Albright’s example consisting of three documents:
Doc 1: Error: invalid message file format
Doc 2: Error: unable to open message file using message path
Doc 3: Error: unable to format variable
• These three documents generate the following 11 x 3 term-document
matrix A. doc 1 doc 2 doc 3
Term 1 error 1 1 1
Term 2 invalid 1 0 0
Term 3 message 1 2 0
Term 4 file 1 1 0
Term 5 format 1 0 1
Term 6 unable 0 1 1
Term 7 to 0 1 1
Term 8 open 0 1 0
Term 9 using 0 1 0
Term 10 path 0 1 0
Term 11 variable 0 0 1
35
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-22 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
continued...
SVD Example
• With the right software (for example, PROC IML), it is very easy to
compute the SVD decomposition for this little example and obtain the
separate matrices V, U , and .
• The product U T A produces the SVD projections of the original document
vectors. These are the document SVD input values that you have seen,
which are produced by the Text Cluster variable (except that they are
normalized for each document as explained on a later slide).
• This amounts to forming linear combinations of the original (possibly
weighted) term frequencies for each document.
36
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
continued...
SVD Example
• First project the first document vector d1 into a three- dimensional SVD
space by the matrix multiplication:
U T d1 = *
U T was obtained using the SVD matrix function in PROC IML applied to the A matrix
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-23
SVD Example
• The SVD dimensions are ordered by the size of their singular values
(“their importance”). Therefore, the document vector can simply
be truncated to obtain a lower-dimensional projection:
• 2-D representation for doc 1 is
• As a final step, the Text Cluster node then normalizes the coordinate
values so that the sums of squares for each document are 1.0:
• Using this document’s 2-D representation,1.632 + .492 = 2.847 and 2.897 = 1.70
.
• Therefore, the final 2-D representation for doc 1 would be .
• These are the SVD1 and SVD2 values that you would see for this document
by looking at the exported data coming out of the Text Cluster node.
38
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
continued...
Dimensionality Reduction
• The tiny example given here has a term-document matrix of rank r=3.
(The rank is always less than or equal to the minimum of the number
of documents and the number of terms.)
• In actual practice, the rank of the term-document matrix is usually
in the thousands, so the SVD algorithm is used to dramatically reduce
the dimensionality of the data.
• The SVD algorithm derives SVD dimensions in order of “importance”
(based on the singular values i ).
• The number of SVD dimensions to keep is based on looking at these
singular values and establishing a cutoff value k.
39
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-24 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
continued...
Dimensionality Reduction
• The user specifies a maximum dimension M (default=100 and highest
allowed value=500) for the number of SVD dimensions to keep.
• The SVD algorithm produces the M singular values in decreasing order.
• The sum of the M singular values (squared) acts as a metric for the
amount of information in the document collection. Treating the sum
of the top M squared values as the “total information” is useful for arriving
at a reasonable cutoff.
40
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
continued...
Dimensionality Reduction
• The user also specifies an SVD Resolution value:
• High=100%
• Medium=5/6=83.3%
• Low=2/3=66.67% (the default)
• Based on these two settings, the Text Cluster node uses a simple algorithm
to decide on the final number of SVD dimensions to use.
41
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-25
Dimensionality Reduction
• To illustrate the logic for deciding the number of dimensions to use:
• Suppose the user sets Max SVD Dimensions=100 and SVD Resolution=Low
(66.7%).
• Assume that you are working with a big document collection so that the rank
of the term-document matrix is much larger than 100.
100
• Let the sum of the first 100 squared singular values be given as 2i = C .
i =1
k
• The algorithm determines the minimum dimension k100 such that 2i / C .667
.
i =1
42
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The problem of dimensionality reduction is challenging in a text mining setting. The default settings
in the SAS Text Cluster often work very well, but you should be prepared to experiment with the
maximum number of SVD dimensions kept, as with many other parameter settings.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-26 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
1. Create a new diagram and name it SVD Dimensions. (As we have stated throughout, in virtually
every case this diagram should have been already created for you and the relevant nodes run.)
The diagram will look like this when completed:
2. Create a data source for the ASRS training data using DMTX51.ASRS_TRAINING. Use the
following metadata:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-27
The data set contains 22 variables with the names Target01 throughTarget22. We use Target02
for this demonstration, so all the others should be rejected. From a table in the E.G. Allan et al.
article “Anomaly Detection Using Nonnegative Matrix Factorization” (2008, p. 215), the incident
for Target02 has to do with noncompliance with policy procedures. The variable Size is just the
length of the report in bytes and will not be used. Only the report itself (Text) is needed, but ID
can be left as an ID variable.)
5. Attach a Text Parsing node to the Data Partition node. Leave all the defaults as is and run.
6. Attach a Text Filter node to the Text Parsing node. Change the default weightings to Log
and Mutual Information. (Although in this case, these are the defaults.) Leave all else in default
mode and run.
7. Attach a Text Cluster node to Text Parsing and rename it Text Cluster – 5. Change the
Transform settings as follows:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-28 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
8. This generates a 5-dimensional SVD solution. Run the node and go to Exported Data
in the properties panel. Select the TRAIN data set and then click Explore. Verify that you have
a 5-dimensional solution by looking at the number of TextCluster_SVD variables.
Five SVD
dimensions were
created.
9. Attach a Decision Tree node to the Text Cluster–5 node. Rename it to DT 5 Clus. Change
the Assessment Measure property to Average Square Error and Leaf Size to 25. (These are
fairly routine changes that are often found to produce better results with trees. Obviously 25
is not some magic number, but the default of 5 for Leaf Size is considered by many analysts
to be too small.) Run the node, but it is not necessary to look at the results yet.
10. Attach another Text Cluster node to the Text Filter node and rename it Text Cluster 10.
Change the Transform settings as shown below in order to get a 10-dimensional SVD solution.
Run this and verify that you see a set of 10 TextCluster_SVD variables.
11. Then copy down your previous DT 5 Clus node, rename it DT 10 Clus, and connect it to Text
Cluster 10. Run this tree, but there is no need to look at the results yet.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-29
12. Repeat the previous step with a new Text Cluster node renamed Text Cluster - Default. Run
this default Text Cluster node and determine that 34 TextCluster_SVD variables were
produced. Rename this third decision tree to DT default Clus and run it.
13. Now connect all three decision trees to the Model Comparison node. In the properties panel for
the Model Comparison node, set the Selection Statistic property to ROC and Selection Table
to Validation. As a consequence of these changes, the ROC index for the validation data will
be shown at the very beginning of the Model Comparison node Results window. Run the Model
Comparison node and view the results.
14. Open the results and look first at the ROC charts.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-30 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
In color, it is clear that DT 5 Clus is the inferior decision tree model. Examining the Fit Statistics
window confirms this. The DT 5 Clus tree has a validation ROC index of just .65 compared
to .71 and .70 for the other two models.
Tree2 and Tree3 have similar validation ROC index values. Tree2 (DT 10 Clus) was generated
from a Text Cluster node specifying a 10-dimension solution, whereas Tree3 (DT default Clus)
was calculated from the default Text Cluster node that generated 34 dimensions.
Remember that the decision tree algorithm itself incorporates variable selection logic. So even
though Tree2 was working with 10 SVD variables as candidate variables generated by the Text
Cluster node, it actually found only 9 of them to be useful in the tree. Tree3 had 34 candidate
SVD variables to work with, but only kept 17 for the final tree. (This information can be found
by going to tree results and selecting View Model and then viewing the Variable Importance
windows.)
The lesson here is that when you have a target variable and are using a modeling method with
variable selection logic, the decision about how many SVD variables to keep is less critical. It is
better to err on the high side and then let the variable selection algorithm from your model make
the choice. Typically, the default settings of SVD Resolution=Low and Max SVD
Dimensions=100 will work well in this situation. (You do not want to choose too few dimensions,
as we did with Tree (DT 5 Clus), and wind up with an inferior model.)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-31
Practice
1. Comparing the Effect of the Number of SVD Dimensions Using Regression Models
For this exercise, repeat all the steps done in the previous demonstration, but now use three
forward selection regression models. The easiest way to do this without having to rerun any
of the previous nodes is as follows:
a. Bring down a regression model and connect it to the Text Cluster-5 cluster node. Rename
the regression model to Regr 5 Clus. Set it up to do forward selection with assessment on
the validation error. Then connect the Regr 5 Clus node to a new Model Comparison node.
(Just copy down the previous Model Comparison node.) The reason for using a second
Model Comparison node is because with more than three models, the graphs become a little
too cluttered to interpret easily.
b. Copy the Regr 5 Clus node down and connect it to the Text Cluster-10 cluster node.
Rename this regression model Regr 10 Clus and connect it to the second Model
Comparison node in order to compare it to the first regression model.
c. Copy down the first regression node again. Connect this copy to the Text Cluster-Default
cluster node. Rename the regression node to Regr Default Clus. Again, connect it to the
second Model Comparison node as with the other two regression models. The right part
of your diagram should now look like this:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-32 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
d. After running everything through the second Model Comparison node, answer these
questions:
1) How many SVD variables were selected by each of the three different regression
models?
2) What was the validation ROC index for each of these three models? How do these index
values compare with values from the previous three decision tree models?
2. (Optional) Details of the SVD Calculations Performed by the Text Cluster Node
This optional exercise is for those students who are interested in the details of how the document
SVD variables are calculated. Although not all users of the SAS Text Analytics software want to
know this much, those who do can work through this exercise. In the first step, a very simple text
mining project is run to produce the SVD variables computed in the Text Cluster node. In the
second step, you use a PROC IML (Interactive Matrix Language) program
to explicitly see how the term-document frequency matrix is analyzed using the SVD algorithm
from linear algebra. In the end, you will be able to compare the SVD values from the Text Cluster
node to those computed by the PROC IML code and see that they are the same.
You are not expected to write the PROC IML program that is used. It is supplied for you.
If you are familiar with matrix algebra and can follow programming logic, you might be able to
understand the major parts of the internally documented PROC IML program.
a. Create a diagram named Optional Exercise SVD Calculations. (This diagram has already
been created for you, but you are encouraged to set it up on your own as you have been
doing in class.)
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-33
The data consist of 18 records (documents). Each record contains the names of some cat-
like (feline) animals, some dog-like (canine) animals, or both. If you look carefully, you will
see that the first 16 documents contain either pure feline or pure canine animals, but not
both. However, in documents 17 and 18, there is a mix of both types of animals.
c. Attach a Text Parsing node and change the defaults so that all the language algorithms are
turned off. The properties panel will look like this:
The purpose of turning off all the language algorithms in Text Parsing here (such as Different
Parts of Speech, Noun Groups, and Stop List) is to make this example as simple as possible
to follow.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-34 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
d. Attach a Text Filter node to the Text Parsing node. Change the defaults on the properties
panel to those shown below:
Note that we are setting the Frequency Weighting and Term Weight properties to None.
Again, the reason for this is to construct a simplified example. This produces a term-
document frequency matrix with raw counts rather than weighted counts. Also, remember
to change the Maximum Number of Documents property to 2.
e. Attach a Text Cluster node and set up the properties panel in this way:
The settings SVD Resolution=High and Max SVD Dimensions=2 will make Text Cluster
produce exactly two SVD variables.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-35
f. Run this flow from the Text Cluster node and look at the exported data. You can move the two
TextCluster_SVD columns by dragging them to the left as shown below.
Although in most realistic settings, the TextCluster_SVD variables are not very interpretable
(the document clusters are used for interpretation), in this simple example, you can interpret
the results quite easily this way:
• High TextCluster_SVD1 values are associated with documents containing the names
of canine animals.
• High TextCluster_SVD2 absolute values (or very negative values) are associated with
documents containing the names of feline animals.
g. Now that you have the TextCluster_SVD values generated from the Text Cluster node,
you will also obtain them by doing some calculations with PROC IML on the term-document
frequency matrix for this example.
With some special SAS data set programming, the 15 (terms) x 18 (documents) term-
document frequency matrix for this example could be obtained from the Enterprise Miner
project flow. However, in the interest of simplicity, for this exercise you easily obtain the
relevant numbers by hand and then verify the entries in the matrix below. These are the raw
term-document frequencies (unweighted because the weighting parameters in the properties
panel of the Text Filter node were turned off). This is what was called the A matrix in an
earlier slide.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-36 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
h. Bring in a SAS Code node. There is no need to connect it to any of the other nodes because
you will be running only a self-contained PROC IML program. Your diagram should now look
like this:
i. Go into the Code Editor on the properties panel for the SAS Code node. In the Training
Code window. Right-click and select Open. Then navigate to the directory
D:\workshop\winsas\DMTX51\sassrc and then select the program
Proc_IML_Optional_Exercise_Chapter3.sas. Here is a listing of this program, in two parts
to fit across pages:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.2 Dimension Reduction with SVD 3-37
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-38 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
j. Run this program in the SAS Code node and look at the results. The final normalized two
SVD variables for the 18 documents are at the bottom of the output:
These SVD values are the same as the values shown coming out of the Text Cluster node
except for an arbitrary and unimportant sign change for SVD2. That is, the SVD2 values
from the Text Cluster node are -1 times the values in this listing.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Chapter Summary 3-39
SAS Text Miner nodes provide numerous strategies for completing the above steps.
The linear algebra approach to text mining, using the singular value decomposition, creates
variables (the SVD document vectors) that are used for clustering the documents and for predictive
modeling. This approach also reduces the dimensionality of the space of documents. This same
approach (modified slightly) generates topics within documents.
Konchady, Manu. 2006. Text Mining Application Programming. Boston: Charles River Media.
Manning, Christopher D., and Hinrich Schutze. 2002. Foundations of Statistical Natural Language
Processing. Cambridge, Massachusetts: The MIT Press.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction
to Information Retrieval. New York: Cambridge University Press.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-40 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
Shannon, C.E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal.
Vol. 27, pp. 379–423 and 623–656.
Strang, Gilbert. 1993. “The Fundamental Theorem of Linear Algebra.” The American
Mathematical Monthly. Vol. 100, No. 9, pp. 848-855.
Thisted, Ronald A. 1988. Elements of Statistical Computing. New York: Chapman and Hall.
Wakefield, Todd. 2004. “A Perfect Storm is Brewing: Better Answers are Possible by Incorporating
Unstructured Data Analysis Techniques.” DM Direct, August 2004.
Wicklin, Rick. 2010. Statistical Programming with SAS/IML Software. Cary, NC: SAS Institute Inc.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Chapter Summary 3-41
Solutions to Practices
1. Comparing the Effect of the Number of SVD Dimensions Using Regression Models
The number of SVD variables actually used by the three regression models:
The easiest way to get these numbers is to go to the Results window for each regression
and look at the Effects Plot window. You have to remember to subtract the intercept term
to arrive at the values shown above.
Looking at the ROC charts from the Model Comparison tool used to assess the three regression
models, it is clear that the three models can easily be rank ordered by their predictive strength.
From the Fit Statistics window, the validation ROC index values are .651, .721, and .745.
This confirms that the Regr default Clus model outperformed the other two models and that
the default choice of selecting candidate SVD variables led to better results than underspecifying
the number of candidate SVD variables as either 5 or 10.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-42 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
Referring to the earlier Model Comparison results for the trees and comparing them to their
equivalent regression models in terms of their validation ROC index values gives the following:
The best results are for the Regr default Clus model, and this holds up when other fit statistics
are looked at as well.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3.3 Chapter Summary 3-43
a. IDF
b. mutual information
c. entropy
d. None (G=1)
28
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
3-44 Lesson 3 Algorithmic and Methodological Considerations in Text Mining
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
Lesson 4 Additional Ideas and
Nodes
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-3
Objectives
• Describe predictive modeling data sets.
• Explain predictive modeling projects and features of SAS Enterprise Miner
related to predictive modeling.
• Explain the trade-off between predictive power and interpretability.
• Discuss how the Text Cluster and Text Topic nodes can be set up to affect
this trade-off.
• Emphasize the need for experimenting with different predictive modeling
and text miner settings—that is, the “workbench” idea for Enterprise
Miner.
3
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
SAS Enterprise Miner has many predictive model nodes. Some nodes are general purpose, such
as the Decision Tree, Neural Network, and Regression nodes. Some nodes are specialized, such
as the MBR, Rule Induction, and Partial Least Squares nodes. This course has illustrated the use
of the Decision Tree and Regression nodes. The availability of many different modeling techniques
makes it easy to try out different approaches to find the best results for your data. Think of the
Enterprise Miner and its many different nodes as a workbench for analytic experimentation.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-4 Lesson 4 Additional Ideas and Nodes
4
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Note: On the High-Performance Data Mining (HPDM) tab, you also have access to random forests
and support vector machines. For specialized predictive models, the Applications tab
supports the Survival Data Mining node and the Incremental Response node.
5
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The minimum requirement for data mining predictive modeling is at least one target variable and
at least one input variable. A predictive model is constructed using a training data set. The model
attempts to predict the value of the target variable using only the values of a set of input variables.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-5
For example, input variables can measure customer attributes such as gender, age, income, location
of primary residence, and average purchases to try to estimate the probability that a customer will
respond to a particular promotion, such as a 20% off discount on purchases of $100 or more.
Predictive Model
Training Data
inputs target
Predictive model:
a concise representation
of the input and target
association
6
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
After a model is constructed using training data, the performance of the model can be assessed
using a holdout data set. When a final model is selected, it can be used to score new data to
determine, for example, which customers should be selected for a promotional offer. The term score
is synonymous with predict.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-6 Lesson 4 Additional Ideas and Nodes
Predictive Model
Training Data
Validation Data
Test Data
Score Data
inputs prediction
Predictions: output
of the predictive
model given a set of
input measurements
7
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
To choose from a variety of models, a holdout data set called a validation data set is used
to determine how well models will extrapolate to new data. This helps overcome the problem
of overfitting, which occurs when a model is constructed to fit the training data set so well that
it does not fit any other data well. For a model that has been selected for deployment, a second
holdout data set, called a test data set, is used to get an unbiased estimate of the accuracy of the
model in the live environment. A predictive model can score any data set that has the inputs used
by the model. It is important to ensure that models are applied to data commensurate with how
a model was constructed. For example, a model constructed using only customers who reside
in California might not be appropriate for scoring customers in Florida.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-7
On the slide above, the table has nine input variables that will be used as candidate predictor
variables, and two target variables that can be used to derive predictive models. The two target
variables were automatically recognized because they contain the prefix Target. Some predictive
modeling nodes can accommodate only one target variable, so you must reject a target variable
when using one of those nodes.
When you create a data source in SAS Enterprise Miner, the Data Source Wizard displays all the
variable roles so that you can check that the target and input variables have been specified in how
you need to build a predictive model.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-8 Lesson 4 Additional Ideas and Nodes
Decision Tree
10
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-9
A predictive model will provide a predicted value. In the case of a binary target variable, the
predicted value is the posterior probability of the primary event given the inputs. You can use this
probability to derive a decision rule. For example, if the probability exceeds a (selected or derived)
cutoff value of 0.37, send the promotion to the customer. Otherwise, do nothing. SAS Enterprise
Miner model nodes will add a variety of columns to the imported data when creating the exported
data. The nature of the added columns depends on the model used. The above table is displayed
by selecting the Variables property in a SAS Code node attached to a Decision Tree node. The
predicted value in this case is called P_SubroFlag1. This is the probability that the target variable
(SubroFlag) has the value 1. There is also a variable named P_SubroFlag0, which is the
complementary value, 1-P_SubroFlag1.
Regression
11
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-10 Lesson 4 Additional Ideas and Nodes
12
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The above slide summarizes the prediction variables that are usually of interest. A Decision Tree
node would produce, for example, P_Target1, which would be the posterior probability derived from
the tree after correcting for oversampling. Note that if the binary target variable has the name
WIDGET, then the posterior probability that WIDGET=1 is given by the variable P_WIDGET1.
13
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-11
Although lack of input variables can be a problem, frequently your problem is that you have a very
large number of candidate input variables to select from. Input selection, or variable selection,
is an important topic in predictive modeling. SAS Enterprise Miner facilitates input selection
in a number of different ways.
14
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-12 Lesson 4 Additional Ideas and Nodes
A Decision Tree node performs input selection by deciding which subset of input variables are used
to partition the data into separate leaves. If a variable is not useful in separating the data into more
pure leaf nodes, then the variable is discarded. In the above plot, a tree with 40 leaves is derived.
A 59-leaf tree was pruned to remove leaves that did not improve overall model accuracy. The pruned
subtree is used to choose the input variables that are useful for prediction. These variables will
be passed to successor nodes, whereas the other input variables will have their roles changed
to Rejected.
The Regression node also has options for performing variable selection. The above iteration plot
reveals that the Regression node tried a series of models, culminating in an eight-variable model.
But for a final model, it chose one that has only three variables.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-13
Model Assessment
• Fit Statistics
• Average square error
• Misclassification rate
• Information criteria
• Others
• Charts and Plots
• ROC chart
• Gains chart
• Lift chart
• Others
17
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Model assessment is performed using results from the model nodes and the Model Comparison
node.
18
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Fit statistics, such as average squared error (ASE), are defined in the SAS Enterprise Miner
documentation, which you can also access by selecting Help Contents.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-14 Lesson 4 Additional Ideas and Nodes
The Model Comparison node produces additional fit statistics and charts and plots that allow the
direct comparison of the performance of different models. Some plots are not available directly
through the GUI, but SAS Enterprise Miner provides extensive functionality through the SAS Code
node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-15
In this class, we have focused on using text mining tools to derive new input variables from free-form
text data. Therefore, it is valuable to make users of these tools aware of some of the strategies and
“tricks” that are used that can enhance your predictive models.
An important point to understand is that there is usually a trade-off between predictive power and the
interpretability of a model. That is, if you try to obtain the most powerful predictive model that you
can, you often end up with a more complicated and less interpretable model. On the other hand, for
many purposes, it is often important to obtain models that can be explained and understood by
others, including senior management and clients who will be using the model.
A practical strategy for handling this trade-off is to create both types of models: one that has the
strongest predictive power that you can obtain and another one that is more explainable. Then you
can present both of these models to your audience, and they can participate in the decision of which
one to use. This requires a fair amount of experimenting with settings and also understanding what
choices you have when you use the Text Cluster and Text Topics nodes. These choices can directly
make your model more likely to have greater predictive power or more easily interpretable.
21
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
As we saw earlier in Chapter 1, there are several variables that are produced by the Text Cluster
and Text Topic nodes. Text Cluster generates the document SVD vectors, which are then used to
cluster the documents. The SVD variables are automatically set to the role of Input, so they are
ready to be used for a prediction model. However, the actual clusters that are produced by these
SVD variables are set to the role of Segment and therefore are not immediately available for
prediction purposes. Nevertheless, all that is required to make this change is to convert the
TextCluster_cluster_ variable from the role of Segment to the role of Input. Because the
TextCluster_cluster variable is intended to help the analyst in interpreting results, many analysts
will consider using the TextCluster_cluster_ variable for modeling purposes instead of the SVD
variables. This is a deliberate trade-off that might mean sacrificing some predictive power
for greater interpretability.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-16 Lesson 4 Additional Ideas and Nodes
22
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Another “trick” that analysts sometimes use in prediction modeling is to use the Text Cluster node
segment probabilities as input variables. These are the TextCluster_probN variables that are
created when the Expectation-Maximization clustering algorithm is used. Using this algorithm,
if there are three clusters created (so TextCluster_cluster=1, 2, or 3), then a document will have
a probability TextCluster_prob1, TextCluster_prob2, and TextCluster_prob3 associated with
each of the clusters. The document will be assigned to the cluster for which it has the highest
probability, but these probabilities can be directly used as input variables if you want.
In either case, if you want to swap in the TextCluster_cluster_ or TextCluster_prob variables,
you have to use a Metadata node to do this. You would then reject the TextCluster_SVD variables
so that they are not used in the analysis.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-17
23
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
24
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
An analyst can also manage the Text Topic node in a way that can trade off between better
predictive power or clearer interpretability.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-18 Lesson 4 Additional Ideas and Nodes
25
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
26
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-19
Experiment!
• Regardless of whether you are aiming for better predictive power
or greater interpretability, you should adopt an experimental approach
to modeling.
• Enterprise Miner is like a workbench where you can easily try out different
approaches and compare their results.
• The default settings are meant to work well across a wide variety
of situations, but your particular analytic problem can very often
be improved by testing out different parameter settings.
• Do not take the defaults for granted as always producing the best results.
27
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-20 Lesson 4 Additional Ideas and Nodes
This demonstration further shows how to approach text mining analytics and predictive modeling
experimentally by trying out different parameter settings or different modeling techniques. In this
case, we experiment with different term weight (global weight) settings and use the Model
Comparison node to compare the models. These results produce somewhat surprising results
regarding the weights and provide motivation for trying out different approaches.
We will again use the ASRS data set. This time, though, change the target from Target02 (which has
to do with noncompliance events and was used in the earlier demonstration of Chapter 3)
to Target05 (which has to do with the occurrence of a collision hazard event).
The 22 target events vary considerably with respect to the difficulty of modeling them. Descriptions
of several of these target events (as given in E.G. Allan et al. 2008), along with published ROC index
values for models that Allan et al. obtained using an analytic method known as Nonnegative Matrix
Factorization, are shown in the table below. Note that we were able to improve on their ROC index
for Target02 in the last chapter, where we obtained .711 with a decision tree using the default SVD
variables generated from the Text Cluster node. In this demonstration, using Target05, we are also
able to improve on the reported Allan et al. results.
Some of the 22 ASRS target events with their ROC index values from published model results:
We will create a diagram flow that tests out the effect of trying the four different global weight (term
weight) options available in the Text Filter node. The flow will look like this when finished:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-21
1. Create a diagram named Global Weight Experiment. Drag in the ASRS training data. This time
make Target05 the target variable, with all the other targets rejected.
2. Bring in the Data Partition node and leave it at the default settings
of Training/Validation/Test=40/30/30.
Remember that Mutual Information would be the default here because a target variable
(Target05) has been defined.
Rename each of the four Text Filter nodes to identify which global weight has been used,
such as Text Filter - Mutual Info and Text Filter - IDF.
5. Connect each Text Filter node in the series to its own default Text Cluster node and then
its own Text Topic node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-22 Lesson 4 Additional Ideas and Nodes
6. The output of each Text Topic node is then connected to a Regression node and the following
settings are selected:
7. Rename each of the Regression nodes to indicate the term weight that was used earlier
by its Text Filter node—that is, Regression-Mutual Info, Regression-Entropy, and so on.
(The reason for renaming each of these Regression nodes is that it will make comparing results
easier when we look at the Model Comparison node.)
8. Connect each individual Regression node to a single Model Comparison node. Change
the Model Comparison node to make the ROC index on the test data the selection statistic:
9. Check to see that you have set things up as shown in the display capture at the beginning
of this demonstration and then run the entire flow from the Model Comparison node.
Note: If you are running this flow from scratch, it will take about 12 or 13 minutes using
the Virtual Machine image provided for this class.
10. Open the Model Comparison results and look at the Fit Statistics window to compare the
performance of the four models using different term weights:
First, note that all the models are producing very high ROC index values on the test data, as well
as on the training and validation data sets. This is also obvious from the dramatically high curves
in the ROC chart for Target05. Other statistics tell the same story. So in general, we have been
quite successful at classifying ASRS reports as either indicating a collision hazard event or not.
However, what is very surprising is that the Mutual Information term weight in this case
produced the worst results (Index=.929), even worse than using no weight (Index=.954).
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-23
The lesson here is to be experimental and try out different parameter settings! The default settings
generally work well, but the analysis of each data set should be explored in a number of ways for
best results.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-24 Lesson 4 Additional Ideas and Nodes
a. None
b. Entropy
c. Inverse Document Frequency
d. Mutual Information
29
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-25
This demonstration illustrates how to use the Metadata node to change the roles of variables
produced by the Text Cluster and Text Topic nodes to create interpretable inputs for predictive
modeling.
All of the models derived in the previous demonstration are of high quality, but all suffer from a lack
of interpretability. For example, for the entropy model, what does the fact that TextCluster2_svd1
has a coefficient of 2.6003 tell you about the predictions? (Some analysts would be concerned that
the standard errors for four inputs are larger than 2, but that is beyond the scope of this course.)
Using the insight provided earlier about interpretable text mining inputs, you can craft a predictive
model that uses only inputs that have a relatively clear interpretation.
1. Attach a Text Cluster node to the Text Filter – Entropy node.
10. Set the new role for all of the TextCluster5_prob variables to Input.
11. Set the new role for all of the TextTopic5_n variables to Input.
12. Set the new role for all of the TextTopic5_raw variables to Rejected.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-26 Lesson 4 Additional Ideas and Nodes
16. Connect the Regression node to the Model Comparison node and run the Model Comparison
node. The following table is produced:
The interpretable model is competitive with the other models, but as expected, is is not as good
as the best model. This reflects the tradeoff that will usually be required going from a “black box”
model to an interpretable model.
17. Is the model interpretable? Open the Results window for the Regression – Interpretable node.
In the Output window, scroll down to the Analysis of Maximum Likelihood Estimates table.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.1 Some Predictive Modeling Details 4-27
The results from the Text Cluster node show the descriptive terms for each cluster.
Cluster 7 uses terms that describe activity on the ground (runway, taxiway, ground), which
is where incursions often occur. Thus, when a report falls into cluster number 7 with a high
unambiguous probability, it has a higher probability of having a Target05=1 label. In fact,
TextCluster5_prob7 has the largest regression coefficient. Not all clusters and topics have
such obvious interpretations, but overall, the model is clearly easier to understand.
Because this demonstration is optional, you do not need to use the derived model when
completing the exercises related to this diagram.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-28 Lesson 4 Additional Ideas and Nodes
Practice
1. Adding a Decision Tree Node to the Flow to Compare to the Previous Four Regression
Models
a. Add a Decision Tree node to the flow used in the demonstration. Connect this to the
Text Topic node that is part of the flow using the Entropy term weight (because we
previously found this term weight to produce good results).
2. Adding a Text Rule Builder Node to the Flow to Compare to the Previous Five Models
a. Add a Text Rule Builder node to the flow used in the demonstration. Connect this to the
Text Filter node that is part of the flow using the Entropy term weight (because we
previously found this term weight to produce good results).
c. Use the default setting for the Text Rule Builder node.
d. Connect the Text Rule Builder node to the Model Comparison node and run it from there.
e. How does the Text Rule Builder compare to the other five models in terms of the ROC index
on the test data?
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Text Profile Node 4-29
Objectives
• Provide introductory details about the Text Profile node.
• Identify the property settings of Text Profile node.
• Illustrate how to use the Text Profile node with a news data set.
34
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
35
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-30 Lesson 4 Additional Ideas and Nodes
The documentation for the Text Profile node describes the methodology as follows:
“The Text Profile node enables you to profile a target variable using terms found in the documents.
For each level of a target variable, the node outputs a list of terms from the collection that
characterize or describe that level.
“The approach uses a hierarchical Bayesian model using PROC TMBELIEF to predict which terms
are the most likely to describe the level. In order to avoid merely selecting the most common terms,
prior probabilities are used to down-weight terms that are common in more than one level of the
target variable. For binary target variables, a two-way comparison is used to enhance the selection
of terms. For nominal variables, an n-way comparison is used. Finally, for ordinal and time variables
(which are converted to ordinal internally), a sequential, two-way comparison is done. This means
that the reported terms for level n are compared to those at level n-1. The exception for this is the
first level, which is compared to level 2 since there is no preceding level to compare it to.”
36
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
There are only two properties that you specify, and if you do not have a numeric variable with the
role Time ID, then the Date Binning Interval property is irrelevant. The SAS Enterprise Miner 15.1
Reference Help provides an example of profiling with a Time ID variable.
The Maximum Number of Terms property defines how many terms you want to use to profile each
category.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Text Profile Node 4-31
37
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
For nominal target variables with more than two categories, a Beliefs by Value plot is produced.
If you request M terms per category, and if there are K categories, K>2, you will get an (M*K) by K
crosstabulated color map chart. If M or K (or both) is too large, SAS Enterprise Miner graphing
algorithms will reduce the dimensions to produce a viewable plot. If categories share high-belief
terms, then you will get fewer than M*K vertical axis cells.
The eight (default) terms along with the term role are
given in the Profiled Variables table.
38
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The terms with the highest belief scores are given in the Profile Variables table.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-32 Lesson 4 Additional Ideas and Nodes
Each term has a belief score for each category. The top eight (default) scoring terms for
each category are kept. A crosstabulation table for terms by category is color coded based
39
on belief score to show separation between categories.
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The above three categories are well separated, as evidenced by the red diagonal elements
corresponding to the eight terms for the category, and the blue off-diagonal elements, corresponding
to the terms for the other categories. The above 24 by 3 table would have fewer than 24 vertical cells
if categories shared high-belief terms. This is unlikely in that terms are downweighted if they are
common to more than one category. However, if categories are independent of terms, then the color
map appears as a somewhat random mixing of various shades of red and blue. The diagonal will
always be a darker red because diagonals correspond to the highest belief terms, even if the
difference in beliefs is small.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Text Profile Node 4-33
For a target variable that has categories assigned randomly, independent of terms, off-diagonal
elements are more washed out, with only a few cells having a distinct blue color.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-34 Lesson 4 Additional Ideas and Nodes
42
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
A projection of the eight (default) term beliefs into a two-dimensional space provides an idea of how
well the target values are separated. The terms in the document collection allow for better separation
between medical news stories and hockey news stories than between medical news stories and
graphics news stories. Furthermore, graphics news stories and hockey news stories exhibit the best
separation based on the terms in the collection.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Text Profile Node 4-35
This demonstration illustrates the capabilities and applications of the Text Rule Profile node applied
to a document collection of news articles.
Note: This demonstration is based on a demonstration given in the SAS Text Miner 15.1 Reference
Help.
The SAMPSIO.NEWS data set contains 600 brief news articles. For convenience, and to avoid
problems with user access or data modifications to the SAMPSIO library, the data has been copied
to the course data folder. It can be accessed using name DMTX51.News. The data set contains the
following variables.
TEXT a nominal variable that contains the text of the news article
graphics a binary variable that indicates whether the document belongs to the
computer graphics category (1-yes, 0-no)
hockey a binary variable that indicates whether the document belongs to the
hockey category (1-yes, 0-no)
newsgroup a nominal variable that contains the group that a news article fits into
Follow these steps to create a diagram and process flow for profiling a document collection.
1. Create a diagram named News Profiling.
2. Create a data source for DMTX51.News. Set the measurement level for graphics, hockey,
and medical to Binary. Set the role of TEXT to text, and set the role of hockey to Target.
3. Drag the News data source into the News Profiling diagram. Edit the metadata to correspond
to the following table:
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-36 Lesson 4 Additional Ideas and Nodes
4. Attach the following nodes, in order, to the Input Data Source node: Text Parsing, Text Filter,
and Text Profile. Here is what the process flow should look like.
The pie chart reveals that there are 200 documents classified as hockey news articles.
The number of documents for each category is also given in the Profiled Variables table.
The Profiled Variables table reveals eight key terms that help identify whether an article
is a hockey article.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Text Profile Node 4-37
An appealing feature of the Text Profile node is that it captures part-of-speech/entity categories.
These are presented in the term table in the Text Filter Results window as a role characteristic,
but in clustering, cluster terms do not provide a role, and in the Filter Viewer, you cannot query
by role.
The default setting for the Text Profile node is Maximum Number of Terms=8. You can use
the eight derived terms to create a custom Hockey topic.
7. Attach a Text Topic node to the Text Filter node. Create a custom Hockey topic. Use the
following table:
The custom table specifies the eight terms derived by the Text Profile node.
Note: You can use the belief score derived by the Text Profile node for each term, but if
separation is good, belief scores will be near 1, which is the weight used for all terms
in the above table.
8. Run the Text Topic node.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-38 Lesson 4 Additional Ideas and Nodes
The Hockey topic appears first because it is the only custom topic. Examine the Documents
window. The custom topic seems to do a good job of rank ordering documents with respect
to the hockey target variable. To verify this, a SAS Code node is attached to cross-correlate
hockey by TextTopic_1, where TextTopic_1 is the binary variable for the Hockey custom topic.
The program is called PROC_FREQ_News_Profiling_1.sas and is in
D:\Workshop\Winsas\DMTX51\sassrc. The results follow.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Text Profile Node 4-39
Precision=157/(157+36)=81.3%
Recall is given by
Recall=157/(157+43)=78.5%
The misclassification rate is
Misc=(43+36)/600=13.2%
The custom topic derived using the Text Profile results does a pretty good job of classifying news
articles with respect to hockey.
10. The Text Profile node contains additional diagnostic results when a nominal target variable has
more than two categories. In the same diagram, drag a News data source into the diagram.
Reject all target variables except newsgroup.
11. Add Text Parsing, Text Filter, and Text Profile nodes to the data source having target variable
newsgroup. Run the Text Profile node.
12. Open the results for the Text Profile node.
With more than two categories, the results contain a Beliefs by Value matrix. If the categories are
well separated, you will see a red diagonal with blue off-diagonal cells. If the categories are not
well separated, there will be a less dramatic color change for off-diagonal elements, resulting
in more neutral colored cells, and few cells having a blue color.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-40 Lesson 4 Additional Ideas and Nodes
The plot shows that the eight terms associated with each category have a high belief score,
whereas terms identified for other categories have a low belief score for the given category.
If you see belief scores near zero, indicated by a darker blue color, in the off-diagonal cells, then
the categories are well separated based on terms used in the documents. If cell belief values
in the off-diagonal are closer to 0.5, then a more washed-out or neutral color will indicate a lack
of separation.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.2 Text Profile Node 4-41
Here is a Beliefs by Value table derived using a target value that is independent of document
contents.
Note that the eight high-belief terms for newsgroup=hockey category in the previous plot are
the same terms found for the hockey=1 category in the first process flow.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-42 Lesson 4 Additional Ideas and Nodes
Practice
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-43
Objectives
• Identify the benefits of the HP Text Miner node.
• Run a process flow using HP components.
• State the capabilities within the HPTMINE procedure.
46
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
47
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
You must use a high-performance configuration to effectively take advantage of this capability.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-44 Lesson 4 Additional Ideas and Nodes
Single-Machine Mode
• With local data
• Does not use the MPP environment
• (MPP = Massively parallel processing)
48
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The book: Base SAS® 9.4 Procedures Guide: High-Performance Procedures, Second Edition
The course: SAS® Enterprise Miner™ High-Performance Data Mining Nodes
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-45
Performance Results:
SAS Global Forum Paper 400-2013
Traditional vs. HP Text Mining
3500
3000
2500
seconds
2000
1500
1000
500
0
Total Text Mining Predictive Modling
Traditional HP
49
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The results above were presented in the SAS Global Forum 2013 paper 400-2013.
http://support.sas.com/resources/papers/proceedings13/400-2013.pdf
The data consisted of more than 680,000 paragraphs from the Consumer Complaints data set.
The traditional elapsed times were run on a Windows server with two CPUs and 128 GB of memory.
The HP nodes were run on a cluster containing 16 computing nodes, each with two CPUs and 64
GB of memory. The high- performance text mining procedures can reduce a 30-minute task to less
than a minute in a grid computing environment according to findings presented in this paper.
High-performance results can vary and depend on the configuration of your SAS environment
and your specific parallel processing machine configuration.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-46 Lesson 4 Additional Ideas and Nodes
High-Performance Nodes
The HP Text Miner node (HPTM) is one of many tools available on the
SEMMA tab shown here (HPDM). HP Text Miner executes two phases:
• text parsing
• transformation
50
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
HPTM Properties
51
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The HP Text Miner node property selections include Detect, Filter, and Transform. Text parsing,
natural language processing, and entity detection are all supported.
Your own pre-existing customized tables can be specified for multi-word terms, synonyms, and stop
lists just as in the regular Text Mining nodes. The SVD (Singular Value Decomposition) resolution
can be any number from 2 to 500. A higher number generates a better data summary but takes more
computing power to finish.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-47
The Max SVD Dimensions property (maxdim) accounts for p% of the total variance. High resolution
always generates the maximum number of SVD dimensions (maxdim). For medium resolution, the
recommended number of SVD dimensions accounts for 5/6*(p% of the total variance). For low
resolution, the recommended number of SVD dimensions accounts for 2/3*(p% of the total variance).
52
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
The property shown here is from the regular Text Filter node. This option was not made available
in the High-Performance Text Miner node.
53
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-48 Lesson 4 Additional Ideas and Nodes
Note: The role Key must be set when the data source is defined to Enterprise Miner. It is not
available in the Metadata node. There are seven variables in this data set, including four
inputs, a text, a key, and a target. A target variable is not necessary for unsupervised text
mining.
Process Flow
A data source is connected to the HP Text Miner node in this diagram.
It has been run, and the results are on the next slide.
All properties in this example were left as their defaults.
54
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
55
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-49
The Topics and Terms table results are contained in the node results. The maximum number
of terms shown in the table is 20,000 by default. The other plots in this window provide the same
insight into the document collection as described in previous chapters.
Output Window
The Output window lists the tasks that were run. The HPTMINE procedure
consolidates the actions that would be performed by several non-high-
performance nodes.
56
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Topics
The SVD process was used to derive the list of topics from the document
collection. This list used the default low-resolution setting.
57
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-50 Lesson 4 Additional Ideas and Nodes
This demonstration illustrates the capabilities and results of the High-Performance Text Miner node
in a single-machine environment.
3. From the HPDM SEMMA tab, pull an HP Text Miner node into the diagram and connect the data
source to it.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-51
4. Run the node with all the default settings, and open the results.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-52 Lesson 4 Additional Ideas and Nodes
5. Maximize the Output window. The results indicate that the HPTMINE procedure ran in single-
machine mode. The mode depends on the type of Enterprise Miner implementation. The Output
window shows that the document collection was parsed, terms were analyzed and filtered,
and singular value decomposition was done.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-53
6. Examine the fields of the Terms and Topics tables that were created. Note how similar
(and familiar) they are compared to the Text Mining nodes that we ran in previous chapters.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-54 Lesson 4 Additional Ideas and Nodes
60
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Declare the data set, ID (or key) variable, and the name of the text variable. The data set that you
use in batch mode will likely not have the Data Mining metadata attributes, so you have to identify
these variables in the procedure as specified.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-55
These lines state that we want to parse the document collection. Stemming, tagging, and noun
grouping operations can be performed or suppressed by excluding or including these variables.
Term and cell weights are applied to the compressed term-by-document matrix. Term weight can
be Entropy, MI, or None. Cell weight can be Log or None. Terms appearing in fewer than the
reduced number of documents will be excluded from analysis. Entities can either be identified
or ignored.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-56 Lesson 4 Additional Ideas and Nodes
Term number, document, and count variables will be produced in these output data sets.
The OUTCONFIG= data set is used if a subsequent HPTMSCORE procedure will be run against
the document collection results.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-57
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-58 Lesson 4 Additional Ideas and Nodes
3. Submit the code by pressing F3 or clicking the Run icon. Check the log.
5. Observe the contents of the Output window to see the task timing and the OUTTERMS= data set
results.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-59
6. (Optional) Open SAS Explorer, select the Show Project Data box, and select the Work library.
You will be able to see the output data sets created by the HPTMINE procedure. The
OUTPARENT= data set is shown below.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-60 Lesson 4 Additional Ideas and Nodes
67
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Note: If you are running in single-machine mode as in the classroom environment, the original Data
Partition node could be used and will not generate any errors. You could use either Data
Partition node in this specific case.
However, it is a best practice to use the HP Data Partition node with the HP nodes because
this is necessary in distributed mode.
HP Partition Properties
These are the selections available in the HP partition node:
68
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-61
The HP Data Partition node supports two types of partitioning: simple random and stratified. With
a class target variable, the default method is stratified partitioning. Otherwise, simple random
partitioning is used. The node supports up to two stratification variables.
HP Tree
There is a high-performance decision tree node that runs in a high-
performance environment. Find it on the HPDM tab of the SEMMA palette.
69
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Although additional high-performance modeling nodes are available, we will look at only the HP Tree
node. Explore the other modeling nodes and compare the results to select your champion model.
HP Tree Property
A few new properties are in the HP Tree node.
• Nominal Target Criterion Fast Chaid
• Interval Bins – for interval variables
• Minimum KS distance for Fast Chaid trees
70
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-62 Lesson 4 Additional Ideas and Nodes
There is also a property to create a validation sample from the training data set. If you have
a validation data set available, it could be used in lieu of a sample.
Notable properties that are unavailable with this node include the Interactive Tree Viewer, Decisions
and Priors, and Cross Validation.
Process Flow
In this demonstration, copy the previous process flow and add the HP
Partition and HP Tree nodes with the default properties. The copied and
modified flow is shown below. The data source has a binary target variable:
SubroFlag.
71
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Results
The fit statistics from the results of running the HP Tree node with default
settings are below. This run of the node used the Validation partition from
the HP Part node because we did not request that it create its own
validation set.
72
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.3 High-Performance (HP) Text Miner Node (Optional) 4-63
This demonstration illustrates how to perform predictive modeling using high-performance nodes.
1. Open the High-Performance Text Mining diagram from the previous demonstration.
2. Copy the two nodes in the diagram (HPDMINE and HP Text Miner), and paste them in the
diagram below the originals.
3. Drag an HP Data Partition node and an HP Tree node into the diagram from the HPDM tab.
4. Connect the nodes in the order shown below. Run the process flow from the end with default
settings.
6. Examine the properties and the results of the HPPart node and verify that 70% of the data was
allocated to the role: Train. (341 / 1135 = .30)
7. Look at the results of the HP Text Miner node. Notice that more topics were derived compared
to the previous run.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-64 Lesson 4 Additional Ideas and Nodes
8. Open the results of the HP Tree node. Are there any property settings that you would consider
testing to possibly create an even better predictive model? Look at the Leaf Statistics node for
a hint.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.4 Chapter Summary 4-65
The Text Rule Builder node provides a stand-alone predictive modeling solution for data having
a text variable and a categorical target variable. This node creates Boolean rules from small subsets
of terms to predict a categorical target variable. The node must be preceded by Text Parsing and
Text Filter nodes.
The Text Profile node derives belief scores based on word association with the levels of a categorical
target variable. The terms with the top beliefs scores help profile the levels of a categorical variable.
The derived terms can subsequently be used to enhance querying or to facilitate custom topic
identification.
High-performance text mining and predictive modeling procedures are designed to take advantage
of specially configured computing environments and technology. Distributed configurations can
enhance the speed of analysis by running multi-threaded parallel processing and I/O operations.
The path length of analysis is shortened even further when running analytical process either in the
database or alongside the database. Reducing the number of passes through the data results in less
total read time when data is kept in memory providing fast analysis when needed.
SAS Enterprise Miner includes high-performance nodes on the HPDM tab. The High-Performance
Text Miner node offers simplified selections and combines tasks performed by several individual
Text Mining nodes. Text parsing, filtering, and topic creation all are accomplished in the one node.
The results can be combined with additional data for supervised predictive modeling applications.
References
E. Allan, M. Horvath, C. Kopek, B. Lamb, T. Whaples and M. Berry. Anomaly detection using
nonnegative matrix factorization. In M. Berry and M. Castellanos, Editors, Survey of Text Mining II:
Clustering, Classification, and Retrieval, pages 203–218. Springer-Verlag London Limited, 2008.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-66 Lesson 4 Additional Ideas and Nodes
4.5 Solutions
Solutions to Practices
1. Adding a Decision Tree Node to the Flow to Compare to the Previous Four Regression
Models
The tree should be added to the previous flow by attaching the Decision Tree node to the
Text Topic node corresponding to entropy.
The Model Comparison node shows the following results for the Decision Tree node.
2. Adding a Text Rule Builder Node to the Flow to Compare to the Previous Five Models
The results highlighting the Text Rule Builder ROC index are given below.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.5 Solutions 4-67
Recall that the Text Rule Builder cannot use any non-text inputs. This limitation did not affect
the ASRS project because no other inputs were available. The above results for Target05 are
consistent with the results that you obtained earlier for Target02 in that the Text Rule Builder
is not superior to conventional predictive modeling techniques, but it is competitive with such
techniques while adding an element of interpretability.
3. Creating Custom Topics Using Text Profiles
a. Using the diagram from the previous demonstration, attach a Text Profile node to the
Hockeyprocess flow for the target variable hockey. Set Maximum Number of Terms to 20.
b. Attach a Text Topic node to the Text Filter node for the Hockey process flow. Create a
custom topic table using the terms identified by the Text Profile node using a maximum
of 20 terms.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-68 Lesson 4 Additional Ideas and Nodes
If you copy and paste the original Text Topic node SAS Code node and attach the copied
nodes to the Text Filter node, you can append to the original custom topic table. The final
table follows.
c. How does your custom Hockey topic compare to the one derived in the demonstration?
In the SAS Code node, change the variable TextTopic_1 to TextTopic2_1. The results are
given below.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.5 Solutions 4-69
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4-70 Lesson 4 Additional Ideas and Nodes
30
Copy ri ght © S AS Insti tute Inc. Al l ri ghts reserved.
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
4.5 Solutions 4-71
Copyright © 2019, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.