CS109/Stat121/AC209/E-109
Data Science
Bayesian Methods Continued, Text Data
Hanspeter Pfister, Joe Blitzstein,Verena Kaynig
Topic proportions and
Topics Documents
assignments
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that some
Blei, https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
number of “topics,” which are distributions over words, exist for the whole collection (far left).
Each document is assumed to be generated as follows. First choose a distribution over the
topics (the histogram at right); then, for each word, choose a topic assignment (the colored
coins) and choose the word from the corresponding topic. The topics and topic assignments
This Week
• Project team info is due tonight at 11:59 pm via
the Google form:
http://goo.gl/forms/CzVRluCZk6
• HW4 is due this Thursday (Nov 5) at 11:59 pm
• Before this Thursday’s lecture on interactive
visualizations:
• Download/install Tableau Public at
https://public.tableau.com/
• Download data file (.zip) from
http://bit.ly/cs109data
MCMC as mountain exploration
vs.
http://healthyalgorithms.com/2010/03/12/a-useful-metaphor-for-explaining-mcmc/
Bayesian Hierarchical Models: Radon Example
Example from Gelman http://www.eecs.berkeley.edu/~russell/
classes/cs294/f05/papers/gelman-2005.pdf
Python-based exposition at
http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/
Complete Pooling vs. No pooling
complete pooling: radoni,c = ↵ + · floori,c + ✏
no pooling: radoni,c = ↵c + c · floori,c + ✏c
Partial Pooling
no pooling:
partial pooling/
hierarchical model:
Partial Pooling
radoni,c = ↵c + c · floori,c + ✏c
2
↵c ⇠ N (µ↵ , ↵)
2
c ⇠ N (µ , )
http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/
Hierarchical Models Provide:
• a compromise between no pooling and complete pooling
• regularization and shrinkage
• give sensible estimates even for small groups
• organize the parameters in an interpretable way
• incorporate information at different levels in the hierarchy
(e.g., individual level, county level, state level)
• predictions at various levels of the hierarchy (e.g., for
new house or for new county)
Gibbs Sampler
Explore space by updating one coordinate at a time.
2D parameter space version:
Draw new ✓1 from conditional distribution of ✓1 |✓2
Draw new ✓2 from conditional distribution of ✓2 |✓1
Repeat
http://zoonek.free.fr/blosxom//R/2006-06-22_useR2006_rbiNormGiggs.png
Gibbs sampler animation
http://twiecki.github.io/blog/2014/01/02/visualizing-mcmc/
Metropolis-Hastings Algorithm
Modify a Markov chain on a state space of interest to obtain
a new chain with any desired stationary distribution!
2 CHAPTER 1. MARKOV CHAIN MONTE CARLO
1. If Xn = i, propose a new state j using the transition probabilities pij of the
original Markov chain.
2. Compute an acceptance probability,
✓ ◆
sj pji
aij = min ,1 .
si pij
3. Flip a coin that lands Heads with probability aij , independently of the Markov
chain.
4. If the coin lands Heads, accept the proposal and set Xn+1 = j. Otherwise, stay
in state i; set Xn+1 = i.
In other words, the modified Markov chain uses the original transition probabilities
pij to propose where to go next, then accepts the proposal with probability aij ,
https://www.siam.org/pdf/news/637.pdf
Metropolis-Hastings animation
http://twiecki.github.io/blog/2014/01/02/visualizing-mcmc/
MCMC in Python
• Stan: http://mc-stan.org
• PyMC: https://pymc-devs.github.io/pymc/
Mosteller-Wallace, Federalist Papers Authorship
Mosteller-Wallace, Federalist Papers Authorship
https://www.stat.cmu.edu/Exams/mosteller.pdf
onian. In combination with a similar treatment of other “non-contextual” words in these
Use of “upon” by Hamilton vs. Madison
writings, this approach provided strong evidence that Madison was the author of all twelve
of the disputed papers, essentially settling the authorship debate.
Rate/1000 Words Authored by Hamilton Authored by Madison 12 Disputed Papers
Exactly 0 0 41 11
(0.0, 0.4) 0 2 0
[0.4, 0.8) 0 4 0
[0.8, 1.2) 2 1 1
[1.2, 1.6) 3 2 0
[1.6, 2.0) 6 0 0
[2.0, 3.0) 11 0 0
[3.0, 4.0) 11 0 0
[4.0, 5.0) 10 0 0
[5.0, 6.0) 3 0 0
[6.0, 7.0) 1 0 0
[7.0, 8.0) 1 0 0
Totals: 48 50 12
Table from Samaniego,
Table Stochastic
1.2.1. Frequency Modeling
distribution and “upon”
of the word Mathematical Statistics
in 110 essays.
But what is the probability that Madison authored a
Exercises 1.2.
particular disputed document, and how confident
1. Specify the sample space for the experiment consisting of three consecutive tosses of a
should we be about our answer?
fair coin, and specify a stochastic model for this experiment. Using that model, compute
Poisson Model
y
e
f (y| ) =
y!
y is the number of occurrences of a specific word
is the rate parameter
a 1 b
Gamma prior is conjugate: p( ) / e
Likelihood and Posterior for Madison’s use of “from”
6
Posterior
Likelihood
5
dgamma(x, shape = 331.6, rate = 270.3)
4
3
2
1
0
0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6
g. 12.2 Posterior and the likelihood function for the rate of using the word fr
n-grams
Data science is fun.
Unigrams: look at individual words.
“data”, “science”, “is”, “fun”
Bigrams: look at word pairs.
“data science”, “science is”, “is fun”
Trigrams: look at word triplets.
“data science is”, “science is fun”
n-grams: Randomized Hobbit
into trees, and then bore to the Mountain to go
through?” groaned the hobbit. “Well, are you
doing, And where are you doing, And where
are you?” it squeaked, as it was no answer.
They were surly and angry and puzzled at
finding them here in their holes
Karl Broman, Randomized Hobbit
http://www.r-bloggers.com/randomized-hobbit/
n-grams: Hobbit/Cat in the Hat Mixture
“I am Gandalf,” said the fish. This is no way at all!
already off his horse and among the goblin and the dragon,
who had remained behind to guard the door. “Something is
outside!” Bilbo’s heart jumped into his boat on to sharp
rocks below; but there was a good game, Said our fish No!
No! Those Things should not fly.
Karl Broman, Randomized Hobbit
http://www.r-bloggers.com/randomized-hobbit/
if current == ".": return " ".join(result) # if "." we're done
n-grams
The sentences it produces are gibberish, but they’re the kind of gibberish you might
put on your website if you were trying to sound data-sciencey. For example:
If you may know which are you want to data sort the data feeds web friend someone on
trending topics as the data in Hadoop is the data science requires a book demonstrates
why visualizations are but we do massive correlations across many commercial disk
drives in Python language and creates more tractable form making connections then
use and uses it to solve a data.
—Bigram Model
We can make the sentences less gibberishy by looking at trigrams, triplets of consecu‐
tive words. (More generally, you might look at n-grams consisting of n consecutive
words, but three will be plenty for us.) Now the transitions will depend on the previ‐
In hindsight MapReduce seems like an epidemic and if so does that give us new
two words:
ousinsights into how economies work That’s not a question we could even have asked a
few years there
trigrams has been instrumented.
= zip(document, document[1:], document[2:])
trigram_transitions = defaultdict(list) —Trigram Model
starts = []
Of course, they sound better because at each step the generation process has fewe
choices,
for and at current,
prev, many steps nextonly a single choice. This means that you frequently gen
in trigrams:
erate sentences
if prev(or
== at least
".": long
Joel phrases)
Grus, Data that
the were
# ifScience seenScratch
from
previous verbatim
"word" was ainperiod
the original data
Having morestarts.append(current)
data would help; it would # thenalso
thiswork
is a better if you collected n-gram
start word
rom multiple essays about data science.
trigram_transitions[(prev, current)].append(next)
Topic Modeling
Topic proportions and
Topics Documents
assignments
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that some
Blei,
numberhttps://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
of “topics,” which are distributions over words, exist for the whole collection (far left).
Each document is assumed to be generated as follows. First choose a distribution over the
Topic Modeling
Topic proportions and
Topics Documents
assignments
gene 0.04
dna 0.02
genetics genetic
.,,
0.01
life 0.02
evolve 0.01
evolution organism 0.01
.,,
brain 0.04
brain neuron
nerve
0.02
0.01
...
data 0.02
number 0.02
computing computer 0.01
.,,
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that some
number of “topics,” which are distributions over words, exist for the whole collection (far left).
Each document is assumed to be generated as follows. First choose a distribution over the
topics (the histogram at right); then, for each word, choose a topic assignment (the colored
coins) and choose the word from the corresponding topic. The topics and topic assignments
in this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit from
data.
Blei, https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
model assumes the documents arose. (The interpretation of LDA as a probabilistic model is
fleshed out below in Section 2.1.)
17,000 articles from Science, 100 topics
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computer
genome evolutionary host models
0.4
dna species bacteria information
genetic organisms diseases data
0.3
genes life resistance computers
sequence origin bacterial system
Probability
gene biology new network
0.2
molecular groups strains systems
sequencing phylogenetic control model
0.1
map living infectious parallel
information diversity malaria methods
genetics group parasite networks
0.0
1 8 16 26 36 46 56 66 76 86 96 mapping new parasites software
Topics
project two united new
sequences common tuberculosis simulations
Figure 2: Real inference with LDA. We fit a 100-topic LDA model to 17,000 articles
from the journal Science. At left is the inferred topic proportions for the example article in
Blei, https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
Figure 1. At right are the top 15 most frequent words from the most frequent topics found in
this article.
is drawn from one of the topics (step #2b), where the selected topic is chosen from the
per-document distribution over topics (step #2a).2
Latent Dirichlet Allocation (LDA):
Generation and Estimation
http://mcburton.net/blog/joy-of-tm/
Dirichlet Distribution
http://blog.bogatron.net/blog/2014/02/02/visualizing-dirichlet-distributions/
Latent Dirichlet Allocation (LDA):
Generative Model
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Latent Dirichlet Allocation (LDA):
Generative Model Example
• Pick 5 to be the number of words in D.
• Decide that D will be 1/2 about food and 1/2 about cute animals.
• Pick the first word to come from the food topic, which then gives you the word “broccoli”.
• Pick the second word to come from the cute animals topic, which gives you “panda”.
• Pick the third word to come from the cute animals topic, giving you “adorable”.
• Pick the fourth word to come from the food topic, giving you “cherries”.
• Pick the fifth word to come from the food topic, giving you “eating”.
http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
Latent Dirichlet Allocation (LDA):
Generative Model
http://mcburton.net/blog/joy-of-tm/
Recommendation Systems and LDA in the NY Times
http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-
recommendation-engine/?_r=2
Recommendation Systems and LDA in the NY Times
http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-
recommendation-engine/?_r=2
LDA Visualization
http://cpsievert.github.io/LDAvis/reviews/reviews.html
pyLDAvis: https://pypi.python.org/pypi/pyLDAvis