0% found this document useful (0 votes)

32 views10 pages

Extraction of Philosophical Arguments Via Topic Modeling4

This study develops a system for extracting philosophical topics from Reddit posts using topic modeling, specifically Latent Dirichlet Allocation (LDA). The research analyzes a dataset of the top 1,000 Reddit posts, aiming to identify philosophical themes after removing common English words. The findings suggest that the most frequent term across topics is philosophy, demonstrating the effectiveness of topic modeling for unsupervised text analysis.

Uploaded by

ap3025985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views10 pages

Extraction of Philosophical Arguments Via Topic Modeling4

Uploaded by

ap3025985

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Extracting Philosophical Topics from Reddit Posts via Topic

Modeling

Evan M. Gertis

September 30, 2021

1
Abstract

The purpose of this study was to develop a system for extracting philosophical arguments
from social media content. In this paper we examine a reddit data that contains the top 1,000
all-time posts from the top 2,500 subreddits, 2.5 million posts in total. From this dataset we
examine a sample of 100 posts. In this paper we aim to address a specific question. What are
the topics generated by Latent Dirichlet Allocation after removing the top 1000 most common
English words from our data set. Our hypothesis is that the topics extracted from the data set
will contain philosophical topics.

1 Literature review

What is the history of topic modeling? An early topic model was described by Papadimitriou. [Steyvers]
There are three works that are highly relevant to this paper that describe topic modeling.

1. Exploring the Space of Topic Coherence Measures by Michael Röder and others.

2. Reading Tea Leaves: How Humans Interpret Topic Models by Jonathan Chang and others.

3. Surveying a suite of algorithms that offer a solution to managing large document archives by
David M. Blei

Probabilistic Latent Semantic Indexing (pLSI), was created by Thomas Hofmann in 1999.[David
M. Blei] Latent Dirichlet Allocation (LDA) is ,perhaps, the most common topic model currently in use.
It is a generalization of pLSI developed by David Blei.[Blei] The inspiration for this work was to find
a way to apply topic modeling to a dataset. Chang describes topic modeling as way for us to organize
and summarize archives that would not be achievable by human annotation. The seminal work of
Röder and others provide an in depth explanation of LDA analysis. These three papers inspired the
creation of one of the most well known libraries used in topic modeling, gensim.[gensim]

2 Reading Tea Leaves: How Humans Interpret Topic Models by Jonathan

Chang and others.

2.1 Authors

Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei

2
2.2 Overview

The authors of this paper discuss machine probabilistic models for unsupervised analysis of text.
They focus on providing a predictive model of future text and a latent topic representation of the
corpus. These studies show that they capture aspects of the model that are typically undetected by
previous measures of models based on held-out likelihood. The authors of this paper discuss machine
probabilistic models for unsupervised analysis of text.

3 Exploring the Space of Topic Coherence Measures

3.1 Authors

Andreas Both, Alexander Hinneburg

3.2 Overview

The authors of this paper present a framework that constructs existing word based coherence mea-
sures as well as new ones by combining elementary components. They present combinations of
components which outperform existing measures with respect to correlation to human ratings. The
authors suggest that results can be transferred to further applications in the context of text mining
and information retrieval.

4 Surveying a suite of algorithms that offer a solution to managing

large document archives

4.1 Authors

David M. Blei

4.2 Overview

The authors of this paper explain how topic modeling enables us to organize and summarize elec-
tronic archives at a scale that would be impossible by human annotation. The example shown in the
paper took 17,000 articles from science magazines and identified 100 topics. This paper describes
how topic modeling provides an algorithmic solution to managing, organizing, and annotating large
archives of texts.

1
5 Introduction

What is the history of topic modeling? An early topic model was described by Papadimitriou, Ragha-
van, Tamaki and Vempala in 1998. [Steyvers] Another one, called Probabilistic latent semantic in-
dexing (PLSI), was created by Thomas Hofmann in 1999.[David M. Blei] Latent Dirichlet allocation
(LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed
by David Blei, Andrew Ng, and Michael I. Jordan in 2002, allowing documents to have a mixture of
topics.[Blei]
LDA is a generative probablistic model that assumes that each topic is a mixture over an underlying
set of words, and each document is a mixture of a set of probabilities. For example, if we take M
documents consisting of N words and K topics then the model uses these parameters to train the
output.

Figure 1: LDA in plate notation

Using the figure fig above. We can describe LDA. with the following parameters.

• K number of topics.

• N number of words in the document.

• α the Dirchlet-prior concentration parameter of the per-document topic distribution.

• β the same parameter of the per-topic distribution.

• phi(k) word distribution for topic K.

• θ(i) the topic distribtuion for document i

• z(i, j) the topic assignment for word w(i, j)

• w(i, j) the j word in the ith document

2
In this list φ and θ are the dirchlet distributions, z and w are the multinomials.
The α parameter is known as the dirchlet prior concentration parameter. It represents document-
topic density. With a high alpha, documents are assumed to consist of more topics and result in a
larger topic distribution per document.
The β parameter is a prior concentration parameter that represents topic-word distribution. With
a high beta, topics are assumed to consist of more words and result in a larger word distribution per
topic. The algorithm is written in python like so

for i in range(len(corpus_sets)):
for k in topics_range:
for a in alpha:
for b in beta:
cv = compute_coherence_values(corpus=corpus_sets[i],
dictionary=id2word,
k=k, a=a, b=b)
model_results[’Validation_Set’].append(corpus_title[i])
model_results[’Topics’].append(k)
model_results[’Alpha’].append(a)
model_results[’Beta’].append(b)
model_results[’Coherence’].append(cv)

[Link](1)
[Link](model_results).to_csv(’./results/lda_tuning_results.csv’, index=False)
[Link]()

num_topics = 8

lda_model = [Link](corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=100,
chunksize=100,
passes=10,
alpha=0.01,
eta=0.9)

3
In the first phase of the algorithm we iterate through the validation corpuses. Then we iterate of
each of the topics for K topics. The for each K to topic we iterate through each α and for each α
value we iterate through each β value. We use this algorithm to compute the coherence score cv for
each set of K, α, and β.
In this paper we present a practical application topic modelling. Our justification is that there is
need for papers that describe practical implementations of a topic modeling system. In our research
we combine the fields of philosophy and text mining in attempt to extract meaningful philosophical
topics from conversations on reddit.

6 Theory

In this paper, we ask what type topics are extracted after from top reddit posts after the top 1000 most
common English words are removed from the text? Our hypothesis is that by the topics extracted
will be philosophical relationships.
Our justification is that online communities like reddit are used for sharing knowledge and ex-
perience. William Russ from Payne College asks Is Truth Relative to Meaning? There is a further
potential source of confusion about truth that might be worth addressing at this point. Words and
sentences can be used in lots of different ways. Even if we are not being inventive with language,
there is lots of vagueness and ambiguity built into natural language.[Payne]. Most words are am-
biguous: a single word form can refer to more than one different concept. For example, the word
form “bark” can refer either to the noise made by a dog, or to the outer covering of a tree. This form
of ambiguity is often referred to as ‘lexical ambiguity’.[Rodd, 2021]

6.1 Designing An Effective Topic Modeling System

The system that we’ve designed in this study follows the recommended steps as described by Jiawei
Han.[Jiawei Han] Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD.[Jiawei Han] We applied the KDD pattern described below
to mine text from the top reddit posts dataset.

1. Data Cleaning (remove noise and inconsistent data)

2. Data Integration (reading data from multiple data sources)

3. Data Selection (selecting data that is relevant for analysis)

4. Data Transformation (data are transformed and consolidated into forms appropriate for mining
by performing summary or aggregation operations)

4
5. Data mining (an essential process where intelligent methods are applied to extract data pat-
terns)

6. Pattern Evaluation (identifying truly interesting patterns representing knowledge based on

interestingness)

7. Knowledge Presentation (visualization and knowledge representation)

6.2 Data Cleaning

Data cleaning can be applied to remove noise and correct inconsistencies in data.[Jiawei Han] The
initial reddit dataset contained 21 columns. For our purposes are only interested in the contents of
the title and self Text. The self Text is what contains the contents of the reddit post. We removed the
rest of the metadata columns and all punctuation. Then we converted titles to lower case.

Figure 2: Compact display of the reddit dataset display

6.3 Data Integration

The dataset used in this study came from the all-time top 1000 posts, from the top 2500 subreddits
by subscribers, pulled from reddit between August 15–20, 2013.[umbrae, 2017] We used a third
party library, pandas, to load the data into our system. Once the data is loaded into the system the
dataframe object can be used throughout the program for data manipulation.

6.4 Data Selection

Our topic modeling system was built with python. We used a 3rd party library, pandas, to load our
dataset from a csv. A pandas data frame is a two-dimensional, size-mutable, potentially heteroge-
neous tabular data.[pandas]

6.5 Data Transformation

The stop words used in our research came from a list of the 1000 most common words used in the
English language.[deekayen, 2021]. Stop words are any word in a stop list (or stoplist or nega-
tive dictionary) which are filtered out (i.e. stopped) before or after processing of natural language
data.[XPO6, 2009]. We used the gensim to remove stop words from our data to automatically detect

5
common [Link] can describe common phrases as n-grams. An n-gram is a contiguous sequence
of n items from a given sample of text or speech. We use bigram, two words frequently occurring in
the document, and trigram, three words frequently occurring in the document, models to construct
a set lemmatized set of data.

6.6 Pattern Evaluation

We evaluate the topic coherence via LDA analysis. To accomplish this we used the gensim library to
develop an LDA model. We then developed a CoherenceModel processes an LDA model constructed
from the lemmatized data, a corpus dictionary developed from the lemmatized data, and coherence
measure. The topic coherence model used in gensim follows the implementation of the four stage
topic coherence pipeline as described by [Michael Röder, 2015].
We evaluated our coherence score using a range of 2 to 11 topics and a step size of 1. Based on
trial and error we determined appropriate ranges for α = [0.01, 0.3] and β = [0.01, 0.3].
In our analysis we iterate through our validation corpuses. Then for each k topic we evaulate our
model for each value of α for each each value of β. This gives us a diverse set of results that utilizes
a range values for α and β.

6.7 Knowledge Presentation

The goal of this study was to determine what topics were left after removing common the 1000 most
common English words[deekayen, 2021]. As stated by Blei, we can use topic models to organize,
summarize, and help users explore large corpora. In our study we used a 3rd party library for Data
Visulization, pyLDavis.

Figure 3: Data Visualization

6
This library uses the data generated from our model to create an html file which we can then use
to examine our topics.

• Selected Topics

• The Interopic Distance Map

• The top 30 modst relevant terms for each topic

• The top 30 Most Salient Terms

6.8 Conclusion

The contribution of this paper is the implementation of the text mining system that we developed
and the data visualization developed by our system. The most frequent term that appeared in all 8 of
our topics was philosophy. We showed that can use topic modeling as way for us to analyze a dataset
that would not be achievable by human annotation. Our implementation explains how probabilistic
models can be used for unsupervised analysis of text. Further applications in the context of text
mining and information retrieval can be built of the system that we have created. Our cod is available
on github and a live demonstration of the data visualization can be seen here.

7
References

D.; Lafferty Blei. A correlated topic model of Science. Applied Statistics 1.

John D. Lafferty David M. Blei. Dynamic topic models.

Probabilistic Topic Models David M. BLEI. Surveying a suite of algorithms that offer a solution to
managing large document archives. ACM, 2012.

deekayen. [Link]. [Link] 2021.

gensim. [Link]

Jian Pei Jiawei Han, Micheline Kamber. Data Mining. Morgan Kaufman.

Jordan Boyd-Graber Jonathan Chang and David M. Blei Sean Gerrish, Chong Wang. Reading tea
leaves: How humans interpret topic models. ACM, 2009.

Alexander Hinneburg Michael Röder, Andreas Both. Exploring the space of topic coherence measures.
ACM, 2015.

pandas. [Link]

W. Russ Payne. An introduction to philosophy. Bellevue College.

Pantelis Leptourgos Joshua G. Kenney Stefan Uddenberg Christoph D. Mathys Leib Litman Jonathan
Robinson Aaron J. Moss Jane R. Taylor Stephanie M. Groman Philip R. Corlett Praveen Suthaharan,
Erin J. Reed. Surveying a suite of algorithms that offer a solution to managing large document
archives. Nature, 2021.

Tet Hin Yeap1 Rania Albalawi1 and Morad Benyoucef2. Using topic modeling methods for short-text
data: A comparative analysis.

Jennifer Rodd. Department of experimental psychology, university college london. Nature, 2021.

Adam Sennet. Ambiguity, the stanford encyclopedia of philosophy. The Stanford Encyclopedia of
Philosophy, 2021.

Tom Steyvers, Mark; Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis
(PDF). Psychology Press.

umbrae. Reddit top 2.5 million. [Link]

2017.

XPO6. List of english stop words. [Link] 2009.

Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Categorizing Research Papers by Topics Using Latent Dirichlet Allocation Model
No ratings yet
Categorizing Research Papers by Topics Using Latent Dirichlet Allocation Model
5 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
A Gentle Introduction To Topic Modeling Using Pyth
No ratings yet
A Gentle Introduction To Topic Modeling Using Pyth
10 pages
DBM 302 Presentation
No ratings yet
DBM 302 Presentation
5 pages
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
No ratings yet
Topic Modelling: A Survey of Topic Models: Abstract-In Recent Years We Have Significant Increase
12 pages
ITD253 L8 TopicModelling
No ratings yet
ITD253 L8 TopicModelling
31 pages
Analyzing Net Neutrality Comments
No ratings yet
Analyzing Net Neutrality Comments
5 pages
An Integrated Clustering and BERT Framework For Improved Topic Modeling
No ratings yet
An Integrated Clustering and BERT Framework For Improved Topic Modeling
9 pages
A Correlated Topic Model of Science1
No ratings yet
A Correlated Topic Model of Science1
19 pages
Unified Neural Topic Model with UTopic
No ratings yet
Unified Neural Topic Model with UTopic
16 pages
LDA: A Comprehensive Tutorial
No ratings yet
LDA: A Comprehensive Tutorial
27 pages
LDA for Text Analysis Researchers
100% (2)
LDA for Text Analysis Researchers
13 pages
Topic Modeling Techniques in Text Mining
No ratings yet
Topic Modeling Techniques in Text Mining
7 pages
A Network Approach To Topic Models
No ratings yet
A Network Approach To Topic Models
22 pages
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
No ratings yet
A Two Staged NLP Based Framework For Assessing The Sentiments On Indian Supreme Court Judgments
10 pages
Enhanced Topic Models with Latent Features
No ratings yet
Enhanced Topic Models with Latent Features
16 pages
Topic Modeling and Digital Humanities
No ratings yet
Topic Modeling and Digital Humanities
6 pages
Ke Et Al. - 2024 - Recent Advances in Text Analysis
No ratings yet
Ke Et Al. - 2024 - Recent Advances in Text Analysis
60 pages
Statistical NLP: Models & Applications
No ratings yet
Statistical NLP: Models & Applications
43 pages
Beginner's Guide to LDA for Topic Modeling
No ratings yet
Beginner's Guide to LDA for Topic Modeling
9 pages
Introduction to Topic Modeling
No ratings yet
Introduction to Topic Modeling
15 pages
Topic Models Dsi Talk March 2017
No ratings yet
Topic Models Dsi Talk March 2017
24 pages
LDA & Topic Modeling Survey
No ratings yet
LDA & Topic Modeling Survey
41 pages
Topic Modeling For Social Media Content A Practical Approach
No ratings yet
Topic Modeling For Social Media Content A Practical Approach
7 pages
Topic Coherence in NLP Models
No ratings yet
Topic Coherence in NLP Models
9 pages
Topic Modeling in Text Classification Report
No ratings yet
Topic Modeling in Text Classification Report
3 pages
Deep Learning for Topic Modeling Clustering
No ratings yet
Deep Learning for Topic Modeling Clustering
11 pages
Exploration of Thesis
No ratings yet
Exploration of Thesis
93 pages
Correlated Topic Models Explained
No ratings yet
Correlated Topic Models Explained
8 pages
Probabilistic Topic Models Overview
No ratings yet
Probabilistic Topic Models Overview
78 pages
Maier 2018
No ratings yet
Maier 2018
27 pages
LDA Topic Modeling: Models & Applications
No ratings yet
LDA Topic Modeling: Models & Applications
40 pages
Indigen Porn Analysis in NLP Tasks
No ratings yet
Indigen Porn Analysis in NLP Tasks
19 pages
Evaluating Topic Models for Content Analysis
No ratings yet
Evaluating Topic Models for Content Analysis
20 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
5 - Ines - Topic Modeling On News Articles Using Latent Dirichlet Allocation Kretinin A Kol
No ratings yet
5 - Ines - Topic Modeling On News Articles Using Latent Dirichlet Allocation Kretinin A Kol
10 pages
Sbalchiero Topicmodelinglongtextsand
No ratings yet
Sbalchiero Topicmodelinglongtextsand
14 pages
Topic Pattern Mining Survey
No ratings yet
Topic Pattern Mining Survey
7 pages
Eai 13-7-2018 159623
No ratings yet
Eai 13-7-2018 159623
16 pages
A Document Exploring System On Lda Topic Model For Wikipedia Articles
No ratings yet
A Document Exploring System On Lda Topic Model For Wikipedia Articles
13 pages
ME314 Day11
No ratings yet
ME314 Day11
77 pages
Ontology-Enhanced Topic Labeling
No ratings yet
Ontology-Enhanced Topic Labeling
7 pages
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
No ratings yet
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
6 pages
Ing. Anna Tesaříková - Topic Modeling For Corpus of Czech Verse
No ratings yet
Ing. Anna Tesaříková - Topic Modeling For Corpus of Czech Verse
57 pages
Report NLP
No ratings yet
Report NLP
25 pages
Topic Models Indian Institute of Technology Pawangcoursestopicmodelspdf
No ratings yet
Topic Models Indian Institute of Technology Pawangcoursestopicmodelspdf
93 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Machine Learning For Data Science Unit-5
No ratings yet
Machine Learning For Data Science Unit-5
10 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
ITD253 P5 TopicModelling
No ratings yet
ITD253 P5 TopicModelling
7 pages
Sma Exp 4
No ratings yet
Sma Exp 4
3 pages
Visualizing Topic Models
No ratings yet
Visualizing Topic Models
4 pages
Advancements of Artificial Intelligence Techniques in The Realm About Library and Information SubjectA Case Survey of Latent Dirichlet Allocation Method
No ratings yet
Advancements of Artificial Intelligence Techniques in The Realm About Library and Information SubjectA Case Survey of Latent Dirichlet Allocation Method
14 pages
First
No ratings yet
First
27 pages
Inventory Management, Inventory Planning and Forecasting Techniques
No ratings yet
Inventory Management, Inventory Planning and Forecasting Techniques
33 pages
Research Paper McClelland
No ratings yet
Research Paper McClelland
17 pages
Film Screening Audience Insights
No ratings yet
Film Screening Audience Insights
2 pages
Sample Question Paper PMA
100% (1)
Sample Question Paper PMA
5 pages
Academic Insights on Social Values
No ratings yet
Academic Insights on Social Values
510 pages
ORK Xperience: Email Marketing Specialist
No ratings yet
ORK Xperience: Email Marketing Specialist
2 pages
Practice Questions
No ratings yet
Practice Questions
3 pages
Test 1 - Bus 221 - 2024-Marking Guide
No ratings yet
Test 1 - Bus 221 - 2024-Marking Guide
3 pages
Teachers' Guide to Lesson Study
No ratings yet
Teachers' Guide to Lesson Study
14 pages
Araujo 2008 From Neutrality To Praxis The Shifting P
No ratings yet
Araujo 2008 From Neutrality To Praxis The Shifting P
18 pages
Northleaf 2022 Dei Report
No ratings yet
Northleaf 2022 Dei Report
30 pages
Undergraduate Statistics Course Syllabus
No ratings yet
Undergraduate Statistics Course Syllabus
36 pages
Consumer Preference and Perception For Cadbury Chocolate With Reference To Other Market Players
50% (2)
Consumer Preference and Perception For Cadbury Chocolate With Reference To Other Market Players
72 pages
Sample Committee Charter Governance Committee Charter July 2008
100% (1)
Sample Committee Charter Governance Committee Charter July 2008
3 pages
Sdoquezon Adm Aqua7 8 Qi M0
No ratings yet
Sdoquezon Adm Aqua7 8 Qi M0
27 pages
Comprehensive Exam Guide for Nursing MA
No ratings yet
Comprehensive Exam Guide for Nursing MA
6 pages
Motivational Letter
100% (1)
Motivational Letter
2 pages
Cox Communications Inc 1999 Analysis Guide
No ratings yet
Cox Communications Inc 1999 Analysis Guide
7 pages
Summer Training Report On: Performance Management System
No ratings yet
Summer Training Report On: Performance Management System
6 pages
Lesson Plan Stat Ang Proba
No ratings yet
Lesson Plan Stat Ang Proba
6 pages
Developing and Publishing Strong Empirical Researc
No ratings yet
Developing and Publishing Strong Empirical Researc
9 pages
Keshav Gupta
No ratings yet
Keshav Gupta
45 pages
EEE806 Research Methodology 2016 2
No ratings yet
EEE806 Research Methodology 2016 2
135 pages
College Students' Formal Operations Performance
No ratings yet
College Students' Formal Operations Performance
8 pages
National Mental Health Program As of July 2016
No ratings yet
National Mental Health Program As of July 2016
24 pages
Customer Perception Towards Home Loans in HDFC Bank in Mayiladuthurai Town-with-cover-page-V2
No ratings yet
Customer Perception Towards Home Loans in HDFC Bank in Mayiladuthurai Town-with-cover-page-V2
8 pages
Geological Field Work Learning and Preparation Report
No ratings yet
Geological Field Work Learning and Preparation Report
31 pages
Game Theory in US-China Trade War Analysis
No ratings yet
Game Theory in US-China Trade War Analysis
17 pages
Becoming A Hig Expectations Teacher PDF
No ratings yet
Becoming A Hig Expectations Teacher PDF
275 pages
CSE35 Project Report
No ratings yet
CSE35 Project Report
111 pages

Extraction of Philosophical Arguments Via Topic Modeling4

Uploaded by

Extraction of Philosophical Arguments Via Topic Modeling4

Uploaded by

Extracting Philosophical Topics from Reddit Posts via Topic

September 30, 2021

2 Reading Tea Leaves: How Humans Interpret Topic Models by Jonathan

3 Exploring the Space of Topic Coherence Measures

Andreas Both, Alexander Hinneburg

4 Surveying a suite of algorithms that offer a solution to managing

Figure 1: LDA in plate notation

• N number of words in the document.

• α the Dirchlet-prior concentration parameter of the per-document topic distribution.

• β the same parameter of the per-topic distribution.

• phi(k) word distribution for topic K.

• θ(i) the topic distribtuion for document i

• z(i, j) the topic assignment for word w(i, j)

• w(i, j) the j word in the ith document

6.1 Designing An Effective Topic Modeling System

1. Data Cleaning (remove noise and inconsistent data)

2. Data Integration (reading data from multiple data sources)

3. Data Selection (selecting data that is relevant for analysis)

6. Pattern Evaluation (identifying truly interesting patterns representing knowledge based on

7. Knowledge Presentation (visualization and knowledge representation)

6.2 Data Cleaning

Figure 2: Compact display of the reddit dataset display

6.3 Data Integration

6.4 Data Selection

6.5 Data Transformation

6.6 Pattern Evaluation

6.7 Knowledge Presentation

Figure 3: Data Visualization

• The Interopic Distance Map

• The top 30 modst relevant terms for each topic

• The top 30 Most Salient Terms

D.; Lafferty Blei. A correlated topic model of Science. Applied Statistics 1.

John D. Lafferty David M. Blei. Dynamic topic models.

deekayen. [Link]. [Link] 2021.

W. Russ Payne. An introduction to philosophy. Bellevue College.

umbrae. Reddit top 2.5 million. [Link]

XPO6. List of english stop words. [Link] 2009.

You might also like