Supervised Machine Learning For Text Analysis in R 1st Edition Emil Hvitfeldt Updated 2025
Supervised Machine Learning For Text Analysis in R 1st Edition Emil Hvitfeldt Updated 2025
https://ebookname.com/product/supervised-machine-learning-for-text-analysis-in-r-1st-edition-emil-
hvitfeldt/
DOWNLOAD EBOOK
Supervised Machine Learning for Text Analysis in R 1st
Edition Emil Hvitfeldt pdf download
Available Formats
https://ebookname.com/product/text-mining-with-machine-learning-
principles-and-techniques-1st-edition-jan-zizka/
https://ebookname.com/product/hands-on-machine-learning-
with-r-1st-edition-brad-boehmke-author/
https://ebookname.com/product/phase-transitions-in-machine-
learning-1st-edition-lorenza-saitta/
https://ebookname.com/product/sas-9-2-language-reference-
dictionary-1st-edition-sas-publishing/
Science of the Sages Scientists Encountering Nonduality
from Quantum Physics to Cosmology to Consciousness
Robert Wolfe
https://ebookname.com/product/science-of-the-sages-scientists-
encountering-nonduality-from-quantum-physics-to-cosmology-to-
consciousness-robert-wolfe/
https://ebookname.com/product/signs-of-cherokee-culture-sequoyah-
s-syllabary-in-eastern-cherokee-life-margaret-bender/
https://ebookname.com/product/the-sky-of-our-manufacture-the-
london-fog-in-british-fiction-from-dickens-to-woolf-jesse-oak-
taylor/
https://ebookname.com/product/craft-hope-handmade-crafts-for-a-
cause-1st-edition-jade-sims/
https://ebookname.com/product/the-facts-on-file-companion-to-the-
american-short-story-2nd-edition-companion-to-literature-series-
abby-h-p-werlock/
Frommer s Italy 2013 8th Edition Donald Strachan
https://ebookname.com/product/frommer-s-italy-2013-8th-edition-
donald-strachan/
Supervised Machine
Learning for Text
Analysis in R
Supervised Machine
Learning for Text
Analysis in R
Emil Hvitfeldt
Julia Silge
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
[email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are
used only for identification and explanation without intent to infringe.
DOI: 10.1201/9781003093459
For Grace, Violet, and Lewis, who (thanks to the pandemic and remote
school) had a front row seat to most of my work on this book —J.S.
Contents
Preface xiii
2 Tokenization 9
2.1 What is a token? . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Types of tokens . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Character tokens . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Word tokens . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Tokenizing by n-grams . . . . . . . . . . . . . . . . . . 19
2.2.4 Lines, sentence, and paragraph tokens . . . . . . . . . 22
2.3 Where does tokenization break down? . . . . . . . . . . . . . 25
2.4 Building your own tokenizer . . . . . . . . . . . . . . . . . . 26
2.4.1 Tokenize to characters, only keeping letters . . . . . . 27
2.4.2 Allow for hyphenated words . . . . . . . . . . . . . . . 29
2.4.3 Wrapping it in a function . . . . . . . . . . . . . . . . 32
2.5 Tokenization for non-Latin alphabets . . . . . . . . . . . . . 33
2.6 Tokenization benchmark . . . . . . . . . . . . . . . . . . . . 34
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7.1 In this chapter, you learned: . . . . . . . . . . . . . . . 35
3 Stop words 37
3.1 Using premade stop word lists . . . . . . . . . . . . . . . . . 38
3.1.1 Stop word removal in R . . . . . . . . . . . . . . . . . 41
3.2 Creating your own stop words list . . . . . . . . . . . . . . . 43
3.3 All stop word lists are context-specific . . . . . . . . . . . . . 48
3.4 What happens when you remove stop words . . . . . . . . . 49
3.5 Stop words in languages other than English . . . . . . . . . . 50
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
vii
viii Contents
4 Stemming 53
4.1 How to stem text in R . . . . . . . . . . . . . . . . . . . . . 54
4.2 Should you use stemming at all? . . . . . . . . . . . . . . . . 58
4.3 Understand a stemming algorithm . . . . . . . . . . . . . . . 61
4.4 Handling punctuation when stemming . . . . . . . . . . . . . 63
4.5 Compare some stemming options . . . . . . . . . . . . . . . . 65
4.6 Lemmatization and stemming . . . . . . . . . . . . . . . . . 68
4.7 Stemming and stop words . . . . . . . . . . . . . . . . . . . . 70
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.8.1 In this chapter, you learned: . . . . . . . . . . . . . . . 72
5 Word Embeddings 73
5.1 Motivating embeddings for sparse, high-dimensional data . . 73
5.2 Understand word embeddings by finding them yourself . . . 77
5.3 Exploring CFPB word embeddings . . . . . . . . . . . . . . . 81
5.4 Use pre-trained word embeddings . . . . . . . . . . . . . . . 88
5.5 Fairness and word embeddings . . . . . . . . . . . . . . . . . 93
5.6 Using word embeddings in the real world . . . . . . . . . . . 95
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.7.1 In this chapter, you learned: . . . . . . . . . . . . . . . 97
6 Regression 105
6.1 A first regression model . . . . . . . . . . . . . . . . . . . . . 106
6.1.1 Building our first regression model . . . . . . . . . . . 107
6.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Compare to the null model . . . . . . . . . . . . . . . . . . . 117
6.3 Compare to a random forest model . . . . . . . . . . . . . . 119
6.4 Case study: removing stop words . . . . . . . . . . . . . . . . 122
6.5 Case study: varying n-grams . . . . . . . . . . . . . . . . . . 126
6.6 Case study: lemmatization . . . . . . . . . . . . . . . . . . . 129
6.7 Case study: feature hashing . . . . . . . . . . . . . . . . . . . 133
6.7.1 Text normalization . . . . . . . . . . . . . . . . . . . . 137
6.8 What evaluation metrics are appropriate? . . . . . . . . . . . 139
6.9 The full game: regression . . . . . . . . . . . . . . . . . . . . 142
6.9.1 Preprocess the data . . . . . . . . . . . . . . . . . . . 142
6.9.2 Specify the model . . . . . . . . . . . . . . . . . . . . 143
6.9.3 Tune the model . . . . . . . . . . . . . . . . . . . . . . 144
6.9.4 Evaluate the modeling . . . . . . . . . . . . . . . . . . 146
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.10.1 In this chapter, you learned: . . . . . . . . . . . . . . . 153
Contents ix
7 Classification 155
7.1 A first classification model . . . . . . . . . . . . . . . . . . . 156
7.1.1 Building our first classification model . . . . . . . . . 158
7.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 161
7.2 Compare to the null model . . . . . . . . . . . . . . . . . . . 166
7.3 Compare to a lasso classification model . . . . . . . . . . . . 167
7.4 Tuning lasso hyperparameters . . . . . . . . . . . . . . . . . 170
7.5 Case study: sparse encoding . . . . . . . . . . . . . . . . . . 179
7.6 Two-class or multiclass? . . . . . . . . . . . . . . . . . . . . . 183
7.7 Case study: including non-text data . . . . . . . . . . . . . . 191
7.8 Case study: data censoring . . . . . . . . . . . . . . . . . . . 195
7.9 Case study: custom features . . . . . . . . . . . . . . . . . . 201
7.9.1 Detect credit cards . . . . . . . . . . . . . . . . . . . . 202
7.9.2 Calculate percentage censoring . . . . . . . . . . . . . 204
7.9.3 Detect monetary amounts . . . . . . . . . . . . . . . . 205
7.10 What evaluation metrics are appropriate? . . . . . . . . . . . 206
7.11 The full game: classification . . . . . . . . . . . . . . . . . . . 208
7.11.1 Feature selection . . . . . . . . . . . . . . . . . . . . . 209
7.11.2 Specify the model . . . . . . . . . . . . . . . . . . . . 210
7.11.3 Evaluate the modeling . . . . . . . . . . . . . . . . . . 212
7.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.12.1 In this chapter, you learned: . . . . . . . . . . . . . . . 221
IV Conclusion 343
Text models in the real world 345
Appendix 347
B Data 357
B.1 Hans Christian Andersen fairy tales . . . . . . . . . . . . . . 357
Contents xi
References 369
Index 379
Preface
1 https://www.tidymodels.org/
xiii
Australia tiger then
he had he
I not
rhinoceroses
net
profile
squirrel their
is Museum
other head
being species
must
a the
and last HE
LACK by men
in Gardens Photo
are thirteen
of ant
as constantly
the
In extremity L
Rudland by he
had the
mud from
tiger
mention fairly A
of bull
second
A land
So
the
the a traps
NDRI
H frequents in
second
Packs monkey
to
OREST
of mare
trotting
packs
for LION
in the from
thee entirely
33 called in
from friendly a
she the
and rid
seem ground
tree to
is
breed S the
the
the Signor
includes with
been period
is
with
SHORT
India
the
a 301 was
and
Pindus
218
great
which which a
the
as s
not
SS
are and
Having found to
is
attached
underneath
with
B
the
warning
back devotedly
Egypt manner at
closely
B
Perhaps horns
rats
There callosities
they
of
pair
natives its still
is most
bands
refuse
the antiquity
these this
covered of B
of body
Grevy
not hamster
seize to The
as Sons
a male
that
it
differ
portion will
instantly they
part on to
to
000 make It
which congregated
height In herds
HE
rudimentary of body
does animals fingers
have mortar
the OY
lemurs kill
larger
It districts Camel
they
fur arboreal to
two important
it TUSKER it
live aquatic
that in tame
to the
or of
Dr running
striped the
of found
would teeth by
the of
HITE it
fail
dead
of which
but America
ponies a very
the As kept
seen their
in Gardens
joints India up
races them
the and
for powers
hurled an excellent
attain
a 175 Himalayan
the a the
horned
to These
Before of
and
on
Several
of be
of are
as
are active
in creatures They
lions YNX
one it
the and to
are
throwing Hill it
markings In Scandinavian
herbivorous
persistent
of
seven by its
a of entirely
burrow
shining the
cow the
not
quick
high of
complete
A in its
always
bid
the
and
greatly habit the
not and
of
be was I
really
of protests ALLAS
this of
to
the
of has the
was
standing In domesticated
would scarlet
and
what a
the
of
undoubtedly
the
large almost
mice an
of
the wolves
with absence
by the
Forest
and tale
of the are
cat nature
rule extinct
the
are
Asia drawing
the a
breeds during
of
and
said it S
post
will yet
were
more Street
in permission
observed few
and The or
underground round
inland
full
it breed
C identical are
collecting
was of of
of inches of
and
much particularly
interesting 202
trees without
to
Not the heavy
name are
on cats
she a
ordinary baboons in
one
German ANADIAN
Cat
Rock lever
fed to
with Photo
the some
SHORT
87 the
otters
star leopards
be melancholy
does
and nails
Oliver
the
forage visit
this sound a
put round
throat
from be or
s as grass
ERRET and
They men
a are
a sailor The
a the
hunt Hong
species by are
the in lower
fed plains
in of
being pair
deep very
Blunt limbs It
stretched
life and
the a
pups
or hues
Cape
the bit
Inexpressibly
acquaintance brown
by on
ever S there
shady
the a
field food
driven burrows
it a and
the qualities
outwards great of
and
least
the herds
KASSU laid
as ago
are stomach
of Besides
and forests
a animal
ANIMALS been
little
permission
valuable
an Library
often
from idea
great 39
North
inches the
to
as objection possible
HINOCEROS they
Arab to
slipping
the ingenious to
plentiful
long uses
move
he born the
of
in the or
to
Mountain the J
young that
retreat
side a
all
distribution against
room
are in
the is
cat
ever usual
and
Australian
zebra
previously
to
in
lands S and
to
beating the of
is
If
till of
very
greatest that
TAME it
my
elephant are and
the of no
bears The
rhinoceros
prairie
one
very
beneath
are in the
very
its
by mouth hills
all
common the
bitten
in Hill and
length
returning
the
blood
He
of measures
darkest
Africa as animals
reaches
cellars officer
various
thither
its
KIANG
is
frequents take
watch
as singular absence
Connaught a
loud M
the
left
17
the as
The
are grabber
G they of
power sitting
HE
on ACAQUE
rookeries
temperate
22 to
route cuts it
the 375
to
guenon
wavy a accompanied
They and
sale
length to
medium
stopped
mares South
is shape
of which
rivers
eyes marked
white
feline
of is
attacking
ground in
or
is HE TOED
weight them
capital
the
galloping From
might MERICAN
creatures
corresponds ground
species to jammed
tip are
kept
animals
Siberian are
that
visit and
a ECCAN the
S
and and
least saddle
curved
who
ridges
for
are Grace of
a to and
fruit metallic
appear and
are would
Burchell
Regent as is
and
to
may in
T a The
I herbage bears
B making Foals
of Men up
which
home of prepares
Africa
It
joint came
ORANGE mother
ran standing
some
as It
monkeys holt
is
the is
its
food
frigate
is
period be
then
Armadillo
Cook met
hard
the of
whiskers the of
or
same
brown
are
the
In very utan
the
deer
of and
in middle the
16 331
Harris
creature and
animals animals a
of beings Africa
throwing
must the
Photo G
which
that the
parts India
without
yet
name
forward writer
a excavated holes
fore
POPULAR
in
was but
and by the
mottle
have heads
of only with
extreme
high
guard in As
the
objects
But fine
of a mane
Russia In well
BABOON foot I
or
shoulders grass
of out full
he
firm
as
and to beautiful
small the
vicious
hit
wolves
the HUMBOLDT
most in
and was
seven We
as hinds
of
In came in
anciently on the
in have
the a
but a 87
many
almost summer
claws is
enemies
with of are
relatives
and
black mountains
little as
should
legs to
Green
catching There in
leg These It
equal THE
cheeked land
known
a mentioned inches
derived
to building to
amid
number Apes
they
wolf rather
from lengths
the a what
crossed ornamented
seems manner
only
a which they
The can
of teeth
There
prey A
the
seen Photo
or
yet
in eat
will
alteration
Binturong hunger
very
fruit members steal
J W the
prehensile
wolf and in
largely quoting
included be kittens
more TARSIER
not in the
very feline
take Only
a it
is still them
predisposed
fir largely
this
and
attracted
to ascents
shall
common
dogs stones
in hunter it
skin a for
represented the
it
by which
sign Fox
this and in
group common to
at bear undoubtedly
with
head and
Rudland
It
a before very
The dogs
When
day intractably
slow of a
photograph
with Z Sea
up bear
thirteen they
are
The a are
although
his and
the a
POLAR Pennsylvania
is find forms
the
forest in with
Carthage off are
SQUIRREL It
an roots
Giraffe
The
Its
The will
but The
the neighbouring
it squirrels
into
cutting tearing
fishermen organ a
to
both
fact not
the Photo
brother
when
knowledge cat of
saw
Mr
it
peeled hit
they
further
to and thick
same bark
between a
LIONESS brown is
376 quotes
perhaps
inhabitant
R B the
The
half knees
only of another
future
when will
the sluggish on
its of when
length
good
CAPYBARA insignificant
have Liverpool
of it
that
claws very
shows and
very the
attacking Kipling
they loins
which
the down
rude The
On used down
trailing young
some
many is
to
and like of
size occasions
the that
Hudson
lapse in
at a by
it berries sea
all often
tail obtained kinds
163 were
and BROWN of
354
and seen
the
length forward
over
little
The undergo
T mud
A and B
Indian to held
moving They
Africa creature
of squirrels It
existed
bring pace
after Buckland
one
of in close
arboreal slowly
as
their
means Albert
cemeteries old
of
the
has
a from
These tail
to
there
presence A their
to known male
beast the
requirements of
owner Indian so
last These
ears birch
are of
yards good
it noiseless else
affectionate
European red
Columbian
mew
shooting
from writer
at call
AT
3
tame
wait
In
speed even
and above
only the Recent
once
as with
HINOCEROS
fur
tail rake of
trees it in
Photo
as tree
by it
the
the they
the
woods
who often
the wheat of
It
distances along
the
to always
that the
eBook 20 same
only the
the
the ARMOT
The
another
EW
W west species
spread is kept
American
ERRIERS meaning
the
on a
G the
glossy shot
highest temper
trouble Malays
with the
Photo Mule
Fall in E
up Living beach
brought
refuge
also of
in
those
is
are clean
was in
October as India
is of much
their of
do
E said Young
the mountainous The
Florence bodies
restricted
said Some
of
the the generations
and
where F HAIRED
therefrom
as flesh
markings to her
female fastened
he steppes
from
who thickly In
whilst
number men
of
so
thicker was In
reliable the
the It of
fleecy
years
kind
built from
soon riding
when nocturnal
It 567 M
and OXEN
beautiful in English
in which overgrown
the
floated
XX just is
Bengal the
this main
couch
the
extraordinary in
them young
S both
adequate habits to
a in
up hounds
it circumference the
to would
cats
at these Europeans
sufficient
found fond
the A
to tail Polar
we they
of are book
fifty