Welcome back.
No matter what NLP method you use,
you will one day find yourself faced with an error, for example,
a misclassified sentence. In this video,
I'll show you how to analyze such errors. Let us consider some possible
errors in the model prediction that can be caused by these issues. One, semantic
meaning lost
in the pre-processing step. Two, how word order affects
the meaning of a sentence. And three, some quirks of languages
come naturally to humans but confuse naive Bayes models. So let's start. One of
your main considerations when
analyzing errors in NLP systems is what the processed version
of the text actually looks like. Let's look at this tweet. My beloved grandmother,
with some
punctuation indicating a sad face. The sad face punctuation in this case
is very important to the sentiment of the tweet because it
tells you what's happening. But if you're removing punctuation, then
the processed tweet will leave behind only beloved grandmother,
which looks like a very positive tweet. My beloved grandmother, exclamation mark
would be a very different sentiment. So remember, always check what
the actual text looks like. It's not just about punctuation either. Check out this
tweet. This is not good, because your attitude
is not even close to being nice. If you remove neutral words like not and this,
what you're left
with is the following. Good, attitude, close, nice. From this set of words, any
classifier will infer that
this is something very positive. I'll talk later on about handling nots and
word orders. But remember, double check what
your process text looks like to make sure your model will be
able to get an accurate read. The inputs pipeline isn't the only
potential source of trouble. Look at these tweets. I'm happy because I did not go.
This is a purely positive tweet. I'm not happy because I did not go,
with a negative sentiment. In this case,
the not is important to the sentiment but gets missed by your
naive Bayes classifier. So word order can be as
important to spelling. There are many other factors to consider
as well, and you will see more and more ways to build systems that
handle them in the weeks to come. Another problem of naive Bayes is
something called an adversarial attack. The term adversarial attack describes
some common language phenomenon, like sarcasm, irony, and euphemism. Humans pick
these up quickly but
machines are terrible at it. This tweet,
this is a ridiculously powerful movie. The plot was gripping and I cried right
through until the ending, contains a somewhat positive movie review, but
pre-processing might suggest otherwise. If you pre-process this tweet, you'll
get a list of mostly negative words, but as you can see, they were actually used to
describe a movie that the author enjoyed. If you use naive Bayes
on this list of words, it would end up giving a very
negative score regardless. Now you know how to apply the naive
Bayes method to text classification. It makes the independence assumption,
which can lead to errors, but you know how to analyze them. It's still a very
powerful baseline. As you know,
it relies on word frequency counts. Next week, we learn how to use a different
form of word vectors that can give better results.