0% found this document useful (0 votes)
32 views10 pages

Extraction of Philosophical Arguments Via Topic Modeling4

This study develops a system for extracting philosophical topics from Reddit posts using topic modeling, specifically Latent Dirichlet Allocation (LDA). The research analyzes a dataset of the top 1,000 Reddit posts, aiming to identify philosophical themes after removing common English words. The findings suggest that the most frequent term across topics is philosophy, demonstrating the effectiveness of topic modeling for unsupervised text analysis.

Uploaded by

ap3025985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

Extraction of Philosophical Arguments Via Topic Modeling4

This study develops a system for extracting philosophical topics from Reddit posts using topic modeling, specifically Latent Dirichlet Allocation (LDA). The research analyzes a dataset of the top 1,000 Reddit posts, aiming to identify philosophical themes after removing common English words. The findings suggest that the most frequent term across topics is philosophy, demonstrating the effectiveness of topic modeling for unsupervised text analysis.

Uploaded by

ap3025985
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Extracting Philosophical Topics from Reddit Posts via Topic

Modeling

Evan M. Gertis

September 30, 2021

1
Abstract

The purpose of this study was to develop a system for extracting philosophical arguments
from social media content. In this paper we examine a reddit data that contains the top 1,000
all-time posts from the top 2,500 subreddits, 2.5 million posts in total. From this dataset we
examine a sample of 100 posts. In this paper we aim to address a specific question. What are
the topics generated by Latent Dirichlet Allocation after removing the top 1000 most common
English words from our data set. Our hypothesis is that the topics extracted from the data set
will contain philosophical topics.

1 Literature review

What is the history of topic modeling? An early topic model was described by Papadimitriou. [Steyvers]
There are three works that are highly relevant to this paper that describe topic modeling.

1. Exploring the Space of Topic Coherence Measures by Michael Röder and others.

2. Reading Tea Leaves: How Humans Interpret Topic Models by Jonathan Chang and others.

3. Surveying a suite of algorithms that offer a solution to managing large document archives by
David M. Blei

Probabilistic Latent Semantic Indexing (pLSI), was created by Thomas Hofmann in 1999.[David
M. Blei] Latent Dirichlet Allocation (LDA) is ,perhaps, the most common topic model currently in use.
It is a generalization of pLSI developed by David Blei.[Blei] The inspiration for this work was to find
a way to apply topic modeling to a dataset. Chang describes topic modeling as way for us to organize
and summarize archives that would not be achievable by human annotation. The seminal work of
Röder and others provide an in depth explanation of LDA analysis. These three papers inspired the
creation of one of the most well known libraries used in topic modeling, gensim.[gensim]

2 Reading Tea Leaves: How Humans Interpret Topic Models by Jonathan


Chang and others.

2.1 Authors

Jonathan Chang, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei

2
2.2 Overview

The authors of this paper discuss machine probabilistic models for unsupervised analysis of text.
They focus on providing a predictive model of future text and a latent topic representation of the
corpus. These studies show that they capture aspects of the model that are typically undetected by
previous measures of models based on held-out likelihood. The authors of this paper discuss machine
probabilistic models for unsupervised analysis of text.

3 Exploring the Space of Topic Coherence Measures

3.1 Authors

Andreas Both, Alexander Hinneburg

3.2 Overview

The authors of this paper present a framework that constructs existing word based coherence mea-
sures as well as new ones by combining elementary components. They present combinations of
components which outperform existing measures with respect to correlation to human ratings. The
authors suggest that results can be transferred to further applications in the context of text mining
and information retrieval.

4 Surveying a suite of algorithms that offer a solution to managing


large document archives

4.1 Authors

David M. Blei

4.2 Overview

The authors of this paper explain how topic modeling enables us to organize and summarize elec-
tronic archives at a scale that would be impossible by human annotation. The example shown in the
paper took 17,000 articles from science magazines and identified 100 topics. This paper describes
how topic modeling provides an algorithmic solution to managing, organizing, and annotating large
archives of texts.

1
5 Introduction

What is the history of topic modeling? An early topic model was described by Papadimitriou, Ragha-
van, Tamaki and Vempala in 1998. [Steyvers] Another one, called Probabilistic latent semantic in-
dexing (PLSI), was created by Thomas Hofmann in 1999.[David M. Blei] Latent Dirichlet allocation
(LDA), perhaps the most common topic model currently in use, is a generalization of PLSI developed
by David Blei, Andrew Ng, and Michael I. Jordan in 2002, allowing documents to have a mixture of
topics.[Blei]
LDA is a generative probablistic model that assumes that each topic is a mixture over an underlying
set of words, and each document is a mixture of a set of probabilities. For example, if we take M
documents consisting of N words and K topics then the model uses these parameters to train the
output.

Figure 1: LDA in plate notation

Using the figure fig above. We can describe LDA. with the following parameters.

• K number of topics.

• N number of words in the document.

• α the Dirchlet-prior concentration parameter of the per-document topic distribution.

• β the same parameter of the per-topic distribution.

• phi(k) word distribution for topic K.

• θ(i) the topic distribtuion for document i

• z(i, j) the topic assignment for word w(i, j)

• w(i, j) the j word in the ith document

2
In this list φ and θ are the dirchlet distributions, z and w are the multinomials.
The α parameter is known as the dirchlet prior concentration parameter. It represents document-
topic density. With a high alpha, documents are assumed to consist of more topics and result in a
larger topic distribution per document.
The β parameter is a prior concentration parameter that represents topic-word distribution. With
a high beta, topics are assumed to consist of more words and result in a larger word distribution per
topic. The algorithm is written in python like so

for i in range(len(corpus_sets)):
for k in topics_range:
for a in alpha:
for b in beta:
cv = compute_coherence_values(corpus=corpus_sets[i],
dictionary=id2word,
k=k, a=a, b=b)
model_results[’Validation_Set’].append(corpus_title[i])
model_results[’Topics’].append(k)
model_results[’Alpha’].append(a)
model_results[’Beta’].append(b)
model_results[’Coherence’].append(cv)

[Link](1)
[Link](model_results).to_csv(’./results/lda_tuning_results.csv’, index=False)
[Link]()

num_topics = 8

lda_model = [Link](corpus=corpus,
id2word=id2word,
num_topics=num_topics,
random_state=100,
chunksize=100,
passes=10,
alpha=0.01,
eta=0.9)

3
In the first phase of the algorithm we iterate through the validation corpuses. Then we iterate of
each of the topics for K topics. The for each K to topic we iterate through each α and for each α
value we iterate through each β value. We use this algorithm to compute the coherence score cv for
each set of K, α, and β.
In this paper we present a practical application topic modelling. Our justification is that there is
need for papers that describe practical implementations of a topic modeling system. In our research
we combine the fields of philosophy and text mining in attempt to extract meaningful philosophical
topics from conversations on reddit.

6 Theory

In this paper, we ask what type topics are extracted after from top reddit posts after the top 1000 most
common English words are removed from the text? Our hypothesis is that by the topics extracted
will be philosophical relationships.
Our justification is that online communities like reddit are used for sharing knowledge and ex-
perience. William Russ from Payne College asks Is Truth Relative to Meaning? There is a further
potential source of confusion about truth that might be worth addressing at this point. Words and
sentences can be used in lots of different ways. Even if we are not being inventive with language,
there is lots of vagueness and ambiguity built into natural language.[Payne]. Most words are am-
biguous: a single word form can refer to more than one different concept. For example, the word
form “bark” can refer either to the noise made by a dog, or to the outer covering of a tree. This form
of ambiguity is often referred to as ‘lexical ambiguity’.[Rodd, 2021]

6.1 Designing An Effective Topic Modeling System

The system that we’ve designed in this study follows the recommended steps as described by Jiawei
Han.[Jiawei Han] Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD.[Jiawei Han] We applied the KDD pattern described below
to mine text from the top reddit posts dataset.

1. Data Cleaning (remove noise and inconsistent data)

2. Data Integration (reading data from multiple data sources)

3. Data Selection (selecting data that is relevant for analysis)

4. Data Transformation (data are transformed and consolidated into forms appropriate for mining
by performing summary or aggregation operations)

4
5. Data mining (an essential process where intelligent methods are applied to extract data pat-
terns)

6. Pattern Evaluation (identifying truly interesting patterns representing knowledge based on


interestingness)

7. Knowledge Presentation (visualization and knowledge representation)

6.2 Data Cleaning

Data cleaning can be applied to remove noise and correct inconsistencies in data.[Jiawei Han] The
initial reddit dataset contained 21 columns. For our purposes are only interested in the contents of
the title and self Text. The self Text is what contains the contents of the reddit post. We removed the
rest of the metadata columns and all punctuation. Then we converted titles to lower case.

Figure 2: Compact display of the reddit dataset display

6.3 Data Integration

The dataset used in this study came from the all-time top 1000 posts, from the top 2500 subreddits
by subscribers, pulled from reddit between August 15–20, 2013.[umbrae, 2017] We used a third
party library, pandas, to load the data into our system. Once the data is loaded into the system the
dataframe object can be used throughout the program for data manipulation.

6.4 Data Selection

Our topic modeling system was built with python. We used a 3rd party library, pandas, to load our
dataset from a csv. A pandas data frame is a two-dimensional, size-mutable, potentially heteroge-
neous tabular data.[pandas]

6.5 Data Transformation

The stop words used in our research came from a list of the 1000 most common words used in the
English language.[deekayen, 2021]. Stop words are any word in a stop list (or stoplist or nega-
tive dictionary) which are filtered out (i.e. stopped) before or after processing of natural language
data.[XPO6, 2009]. We used the gensim to remove stop words from our data to automatically detect

5
common [Link] can describe common phrases as n-grams. An n-gram is a contiguous sequence
of n items from a given sample of text or speech. We use bigram, two words frequently occurring in
the document, and trigram, three words frequently occurring in the document, models to construct
a set lemmatized set of data.

6.6 Pattern Evaluation

We evaluate the topic coherence via LDA analysis. To accomplish this we used the gensim library to
develop an LDA model. We then developed a CoherenceModel processes an LDA model constructed
from the lemmatized data, a corpus dictionary developed from the lemmatized data, and coherence
measure. The topic coherence model used in gensim follows the implementation of the four stage
topic coherence pipeline as described by [Michael Röder, 2015].
We evaluated our coherence score using a range of 2 to 11 topics and a step size of 1. Based on
trial and error we determined appropriate ranges for α = [0.01, 0.3] and β = [0.01, 0.3].
In our analysis we iterate through our validation corpuses. Then for each k topic we evaulate our
model for each value of α for each each value of β. This gives us a diverse set of results that utilizes
a range values for α and β.

6.7 Knowledge Presentation

The goal of this study was to determine what topics were left after removing common the 1000 most
common English words[deekayen, 2021]. As stated by Blei, we can use topic models to organize,
summarize, and help users explore large corpora. In our study we used a 3rd party library for Data
Visulization, pyLDavis.

Figure 3: Data Visualization

6
This library uses the data generated from our model to create an html file which we can then use
to examine our topics.

• Selected Topics

• The Interopic Distance Map

• The top 30 modst relevant terms for each topic

• The top 30 Most Salient Terms

6.8 Conclusion

The contribution of this paper is the implementation of the text mining system that we developed
and the data visualization developed by our system. The most frequent term that appeared in all 8 of
our topics was philosophy. We showed that can use topic modeling as way for us to analyze a dataset
that would not be achievable by human annotation. Our implementation explains how probabilistic
models can be used for unsupervised analysis of text. Further applications in the context of text
mining and information retrieval can be built of the system that we have created. Our cod is available
on github and a live demonstration of the data visualization can be seen here.

7
References

D.; Lafferty Blei. A correlated topic model of Science. Applied Statistics 1.

John D. Lafferty David M. Blei. Dynamic topic models.

Probabilistic Topic Models David M. BLEI. Surveying a suite of algorithms that offer a solution to
managing large document archives. ACM, 2012.

deekayen. [Link]. [Link] 2021.

gensim. [Link]

Jian Pei Jiawei Han, Micheline Kamber. Data Mining. Morgan Kaufman.

Jordan Boyd-Graber Jonathan Chang and David M. Blei Sean Gerrish, Chong Wang. Reading tea
leaves: How humans interpret topic models. ACM, 2009.

Alexander Hinneburg Michael Röder, Andreas Both. Exploring the space of topic coherence measures.
ACM, 2015.

pandas. [Link]

W. Russ Payne. An introduction to philosophy. Bellevue College.

Pantelis Leptourgos Joshua G. Kenney Stefan Uddenberg Christoph D. Mathys Leib Litman Jonathan
Robinson Aaron J. Moss Jane R. Taylor Stephanie M. Groman Philip R. Corlett Praveen Suthaharan,
Erin J. Reed. Surveying a suite of algorithms that offer a solution to managing large document
archives. Nature, 2021.

Tet Hin Yeap1 Rania Albalawi1 and Morad Benyoucef2. Using topic modeling methods for short-text
data: A comparative analysis.

Jennifer Rodd. Department of experimental psychology, university college london. Nature, 2021.

Adam Sennet. Ambiguity, the stanford encyclopedia of philosophy. The Stanford Encyclopedia of
Philosophy, 2021.

Tom Steyvers, Mark; Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis
(PDF). Psychology Press.

umbrae. Reddit top 2.5 million. [Link]


2017.

XPO6. List of english stop words. [Link] 2009.

You might also like