0% found this document useful (0 votes)
6 views

NLP Lab manual

The document outlines a procedure for text preprocessing in R, including steps for installing necessary libraries, cleaning text, tokenization, stop word removal, and stemming. It provides a sample program demonstrating these techniques using the tm and SnowballC packages. The output showcases the cleaned text, tokens, tokens without stop words, and the stemmed tokens.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

NLP Lab manual

The document outlines a procedure for text preprocessing in R, including steps for installing necessary libraries, cleaning text, tokenization, stop word removal, and stemming. It provides a sample program demonstrating these techniques using the tm and SnowballC packages. The output showcases the cleaned text, tokens, tokens without stop words, and the stemmed tokens.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Ex.

No: 2 :Perform Preprocessing (Tokenization, Scrip Validation, Stop word removal and stemming) of
Text. in R programming

Algorithm:

STEP 1: Install and load the necessary libraries


STEP 2: Create a sample text
STEP 3: Ensure that the text contains valid characters. This can be done using regex to
clean any unwanted characters.
STEP 4: Tokenization involves breaking the text into individual words (tokens). We use the
tm package’s Corpus and strsplit for tokenization.

STEP 5: Stop words are common words (like "the", "is", "in") that are often removed during

text preprocessing. The tm package has a list of common stop words.

STEP 6: Stemming is the process of reducing words to their root form.


For example, "running" becomes "run." We will use the SnowballC package for stemming.

Program:

# Install and load the required packages


install.packages(c("tm", "SnowballC", "stringr"))
library(tm)
library(SnowballC)
library(stringr)
# Sample text
text <- "This is a simple text. I am learning text mining in R! It's very interesting."
# Script Validation: Remove unwanted characters (non-alphabetic characters)
text_clean <- str_replace_all(text, "[^[:alpha:]\\s]", "")
print(paste("Cleaned Text: ", text_clean))
# Tokenization
tokens <- unlist(strsplit(tolower(text_clean), "\\s+"))
print(paste("Tokens: ", tokens))
# Stop word removal
stopwords_list <- stopwords("en")
tokens_no_stopwords <- tokens[!tokens %in% stopwords_list]
print(paste("Tokens without stopwords: ", tokens_no_stopwords))
# Stemming
stemmed_tokens <- wordStem(tokens_no_stopwords)
print(paste("Stemmed Tokens: ", stemmed_tokens))
Output:

Cleaned Text: this is a simple text i am learning text mining in r its very interesting
Tokens: [1] "this" "is" "a" "simple" "text" "i" "am" "learning" "text" "mining" "in" "r" "its" "very"
"interesting"
Tokens without stopwords: [1] "simple" "text" "learning" "text" "mining" "r" "interesting"
Stemmed Tokens: [1] "simpl" "text" "learn" "text" "mine" "r" "interest"

You might also like