0% found this document useful (0 votes)
24 views8 pages

Lecture 3

Stats

Uploaded by

hiba1211khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views8 pages

Lecture 3

Stats

Uploaded by

hiba1211khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Statistics Lecture 3 - 9/25/2017

Conditional probability and independance


Lecturer: Arian Maleki Scribe: Arian Maleki

1 Definition of conditional probability


Before I start this topic, let me mention that this is probably one of the most important topics that you
learn in this course. Make sure that you learn this topic now. If certain things confuse you make sure to
discuss them with me and/or the TA. If you are not aware of the issues that can happen in conditioning and
you do not pay enough attention, you may easily make big mistakes in your conclusion.
Let me start this topic with an example that most of you know and will probably never forget! It is
about 2016 US election. There is a website, called 538, that estimates the probability of a candidate winning
the election. On Nov. 7 the day before the election, this website estimated the chance of Clinton winning
the election to be around 0.66. Then, on the election night once the polling stations were closed, and the
votes were counted we started seeing that the chance of Clinton winning slowly went down and after a while
once the result of Michigan, Florida and Ohio were announced, it became zero !! If the chance of Clinton
winning was 0.66, then what were the other numbers that 538 was reporting?

Let’s look at a few key states in the US election, such as Florida and Ohio. Let’s also say that 538
estimates the chance of Clinton winning in Florida, Ohio, Iowa, Pennsylvania, and Michigan, to be 0.55,
0.65, 0.55, and 0.7, respectively. 1 Let’s define these two events: A which is the event of Clinton winning the
general election, and B1 , B2 , B3 , B4 the chance of Clinton winning Ohio, Iowa, Pensylvania, and Michigan.
Originally 538 believed that
P (A) = 0.66.
Now after counting the votes in the four key states, I realize that Clinton has lost all four. We already know
that the chance of Clinton winning the election now is zero. Right? The way we write this is that

P (A | B1c , B2c , B3c , B4c ) = 0.

In other words, conditioned on Clinton loosing the four key states, the chance of her winning the general
election is zero. What if we have only observed that Clinton has lost Ohio? Then, we write the chance of
Clinton winning the general election conditioned on her loosing Ohio as

P (A | B1c ).

We expect this chance to be lower than P (A). Does this make sense? This is an example of conditioning.
We know the probability of the even A. But if we have some new observation (evidence), e.g. Clinton
loosing Florida, then P (A) is irrelevant. What really matters is P (A|B1c ). Let me now formally introduce
the conditional probability and then we should try to understand the definition and see why this is consistent
with our intuition of the conditional probability.
Definition 1. If A and B are two events with P (B) > 0. Then the conditional probability of A given B,
denoted by P (A|B) is defined as
P (A ∩ B)
P (A | B) = .
P (B)
Here, P (A) is called the prior probability of event A, and P (A|B) is called the posterior probability of A
given evidence B.
1I am now just making up these numbers.

1
Figure 1: Note that once we know that B has happened, the only elements of the sample space that matter are the
big blue dots that are inside B. The other elements of the sample space become irrelevant. Also, by assumption, we
know that all these 6 elements in B are equiprobable. Then, we can ask what is the chance of A also happening?If
A wants to happen too, then one of the two points in the intersection of A and B should happen. Hence, the chance
of A happening given that we know B has happened is |A∩B||B|
. Can you explain why? Can you connect this ratio to
Pcard (A∩B)
Pcard (B)
.

The best way to understand why this definition makes sense is to consider a finite sample space with
equiprobable elements and use Cardano’s definition of probability. Consider the sample space S shown in
Figure 1. Please read the caption of the figure to understand where the definition of conditional probability
comes from.

Example 1 Suppose that A ∩ B = ∅. What is P (A | B)? Does the answer make sense? Let’s say A is
the event of Clinton winning. Can you construct a B based on the key states such that A ∩ B = ∅?

Example 2 A standard deck of cards is shuffled well. Two cards are drawn randomly, one at a time
without replacement. Let A be the event of the first card being a heart and B be the event that the second
card is a heart too. Find P (B | A).
From the definition of conditional probability we conclude that in order to calculate P (B | A) we have
to calculate P (A ∩ B) and P (A). We have
13 ∗ 12 6
P (A ∩ B) = = .
52 ∗ 51 102
Can you explain why? Also,
13 1
P (A) = = .
52 4
Hence,
12
P (B|A) =.
51
In the class I mentioned another approach to calculate the conditional probability P (B | A). Do you
remember that approach?

Now, let us do some examples that may challenge the intuitions of some of us.

Example 3 A family has two children, and it is known that at least one is a girl. What is the probability
that both are girls?
Again we use the definition. Let A denote the event of both kids being girls. and B be the event that at
least one of them is a girl. Then, in order to calculate P (A | B) we have to first calculate P (A ∩ B) and

2
P (B). Note that
1
P (A ∩ B) = P (A) =
.
4
CAn you explain where the two equalities come from? Now, let’s calculate P (B). Again we have “at least
one” phrase. So, it is easier to calculate P (B c ).
1
P (B) = 1 − P (B c ) = 1 − = 3/4.
4
Can you explain why P (B c ) = 1/4? Hence, P (A | B) = 31 .
Before I ask you to think more about this example let me mention another example.

Example 4 A family has two kids. The elder one is a girl. What is the chance that both are girls?
Again, let A denote the event of both kids being girls, and B be the event that the elder kid is a girl. Let’s
calculate P (A ∩ B) and P (B). As before
1
P (A ∩ B) = P (A) = .
4
Can you explain why? It is also straightforward to calculate B:

P (B) = 0.5.

Can you explain why? By dividing these two number we obtain P (A | B) = 0.5.

Now forget about the calculations we did. Didn’t you expect to get the same numbers from these two
examples? Are you convinced that these two problems should give different answers?

2 Bayes rule and law of total probability


As we will discuss throughout the course conditional probability is a very important concept in probability
and statistics. Hence, we have to develop some tools that can help us calculate conditional probabilities
more easily. In particular, as we will see later in this lecture, in some cases, it is not trivial to calculate
these conditional probabilities directly. The two simple rules that we introduce and prove below help us
calculate the conditional probabilities. The proofs are so easy that many of you may dismiss the importance
of these two theorems. But, these are probably two of the few formulas that you have to memorize. Please
pay enough attention to the formulas and the applications. The first rule is called Bayes rule.
Theorem 2 (Bayes rule). If A and B are two events. IF P (A) > 0 and P (B) > 0, then

P (B | A)P (A)
P (A | B) = .
P (B)

Proof Note that P (A | B) = P P(A∩B)


(B) . Also note that according to the definition of conditional probability
we have P (A ∩ B) = P (B | A)P (A). By combining these two equations we get the Bayes rule. Is it clear
why?

Before we tell you the next tool that can help you in calculating the conditional probability, I remind
you the definition of the partition of the sample space. A1 , A2 , . . . , An is called a partition of S if it has the
following two properties:
1. A1 , A2 , . . . , An are disjoint, i.e., for every i 6= j, Ai ∩ Aj = ∅.

3
2. A1 ∪ A2 ∪ . . . An = S.

With this definition we state the law of total probability.


Theorem 3 (Law of total probability). Let A1 , A2 , . . . , An be a partition of the sample space S with P (Ai ) >
0. Then,
Xn
P (B) = P (B | Ai )P (Ai ).
i=1

Proof Define Ci = Ai ∩ B. Can you prove the following statements:

C1 ∪ C2 . . . ∪ Cn = B,
Ci ∩ Cj = ∅. ∀i, j (1)

Hence, from the second axiom of probability we can conclude that


n
X
P (B) = P (C1 ∪ C2 . . . ∪ Cn ) = P (Ci ). (2)
i=1

Can you explain why. On the other hand,

P (Ci ) = P (Ai ∩ B) = P (B | Ai )P (Ai ), (3)

where the last equality is due the definition of conditional probability. Combining (2) and (3) finishes the
proof.

Please note that these types of proof are easy, and I may ask you in the final exam to for instance prove this
theorem. Let’s look at an important application of the Bayes rule and the law of total probability which is
counter-intuitive for most people.

Example 5 Only 1 out of 1000 adults is affected with a rare type of cancer. We have a test to diagnose
this cancer. Let’s say this is a really good test and if the person has the cancer, then the chance of this
test detecting the cancer is 99 percent. Also, if a person does not have the disease then the chance of the
test test giving a positive result is only 1 percent (this type of error is often called false negative). Let’s
say you are a surgeon. A person comes to you and say that the result of the test is positive. What will be
your decision? Will you suggest that person to start chemotherapy? Let’s think about this questions for
a moment. When would you suggest chemotherapy? I think it is reasonable to say that if the chance of
the person having cancer is high, then I suggest chemotherapy. Right? So, let’s calculate the chance of this
person having cancer. Let A denote the event of the person being affected by the cancer and B the event of
the person receiving a positive result from the test. Since we know that B has already happened, the chance
of being affected by the cancer is
P (A | B).
As you can confirm this information is not given to us yet. Instead we have access to P (B|A) = 0.99 and
P (B|Ac ) = 0.01. Is this clear why? However, if you remember the Bayes rule connected P (A | B) to
P (B | A). Hence, let’s use the Bayes rule.

P (B|A)P (A) 0.99 ∗ 0.001


P (A | B) = = .
P (B) P (B)

Why did I say that P (A) = 0.001? The only remaining step is to calculate P (B). By the law of total
probability
P (B) = P (B | A)P (A) + P (B |Ac )P (Ac ) = 0.99 ∗ 0.001 + 0.01 ∗ 0.999.

4
Hence, P (A | B) ≈ 0.09. In other words, at this point we still believe that the person does not have cancer.
What is next? We have to do another test. I will give you a problem in HW 3 so that you can see how you
can improve the chance of diagnosing cancer.

Please think about this problem for a few minutes. Even though the test is good, and the result of the
test is positive we still think the person does not have cancer? How could it be possible? Note that here, we
have two pieces of information about the person and disease. (i) The result of the test is positive, (ii) Very
few people are affected by this cancer. These two pieces of information act in opposite directions. One of
them tends to increase the chance of cancer, but the second one lowers the chance of cancer. Hence, it is very
difficult to intuitively argue which of these two evidences are more important and we have to do conditional
probability calculations for that purpose. Please make sure you understand this problem.

The next example is also very important. Please read it very carefully. If you are still confused ask at
the beginning of the next lecture.

Example 6 (Simpson’s paradox) Suppose that there are two types of Kidney stones. One is a “large
stone” and the other one is a small “stone”. Let’s assume that removing larger stones is more challenging for
surgeons. Now we want to compare two surgeons. Suppose that both surgeons have performed 100 surgeries
so far. Here are the data we have about the two surgeons: Surgeon A: out of the 85 large stone surgeries, she
has been successful in 65 and failed in 20. Out of the 15 small stone surgeries she has been successful in all
of them. Surgeon B has performed 10 large stone surgeries and failed in 8. Out of 90 small stone surgeries
he failed in 10. Overall, Surgeon A has had 80 successful surgeries, while Surgeon B has had 82 successful
surgeries. Which surgeon will you pick?
I can give you two arguments. We will discuss which one is more meaningful.
1. Sergeant B is better: Because his overall success rate is higher 82/100 compared to 80/100.
2. Sergeant A is better: If I have a large kidney, I should go to A because for large stones she has the
success rate of 65/80, while Bhas success rate of 2/10. If I have a small stone again I should go to A,
because she has had 100 percent success rate, while B’s success rate is 80/90.
Which argument will you accept? The second argument is correct. Here is the reason. The second surgeon
is not as good as the first surgeon. So he has only accepted easier-to-operate patients (90 of his patients
have small stones). The second surgeon is stronger and she has picked more challenging cases. If she had
operated on the group of patients B picked, then she would have almost a 100 percent success rate. Does
this make sense? Now let’s discuss what is happening here mathematically? Note that
P (success of A | stone is large) = 65/80 > 2/10 = P (success of B | stone is large)
P (success of A | stone is small) = 1 > 80/90 = P (success of B | stone is small). (4)
These two conditions are enough to say that A is better than B. Let’s call the group of people A accepted GA
and the group of people B accepted GB . By the law of total probability when we calculate P (success of A)
we have
P (success of A) = P (success of A | stone is large)PGA (stone is large)+P (success of A | stone is small)PGA (stone is small)
Note that PGA (stone is large) denotes the percentage of the people in the first group who suffer from large
stones. The success prorbabiity of B is also
P (success of B) = P (success of B | stone is large)PGB (stone is large)+P (success of B | stone is small)PGB (stone is small
By comparing the above two equations we see that in the total success probability in addition to conditional
probabilities, the population on which the study is performed also matters, i.e. PGB (stone is large). By
playing with the population, we can make a worse Dr. look better!! As we discussed in the class in many
studies, ratings, ... this subtle issue is still missing. I think this is an issue that our society should revisit. It
seems to me that ignoring this issue has had some major (and in my opinion negative effect) on our society.

5
3 Independence
Independence is another important concept that we should be careful about. Since the existence of indepen-
dence makes the calculations of conditional probabilities (that are important in all applications) easy, people
often assume events are independent even though they are not. In this section, I’d like to make sure you do
not repeat the usual mistakes people make.
Let me start with the definition of independence for two events.
Definition 4. Two events A and B with P (A) > 0 and P (B) > 0 are called independent if and only if
P (A ∩ B) = P (A)P (B).
Before we explain what this definition is, let me mention a simple lemma about it. This lemma clarifies
the definition of the independence.
Lemma 5. If A and B are independent, then
P (A|B) = P (A).
Proof
P (A ∩ B) P (A)P (B)
P (A | B) = = = P (A).
P (B) P (B)

What is the intuition of the independence? According to Lemma 5, it means that even if you know B
has happened, this piece of information does not have any effect on the probability of event A. Let me give
you some intuitive example to help you understand the topic.

Example 7 Let A be the event of Clinton being elected as the president of the US. Consider the following
events and explain if you think A is independent of them or not.
1. B1 is the event that on the day of election it is raining in Paris.
2. B2 is the event that on the day of the election no-one votes in Bay area (area around San Jose and
San Francisco).
3. B3 is the event that on the day of the election it is raining in the Bay area (area around San Jose and
San Francisco).
4. B4 is the event that Clinton looses in Florida.
It seems quite unreasonable that raining in Paris has a major effect on the chance of Clinton winning the
election. Hence, A and B1 are independent.
If no one votes in CA, then the chance of Clinton winning the election is very low. Hence, A and B2 are
quite dependent.
Bay area is more or less highly populated and democratic. If it rains there, the number of people who will
vote in the area may go down. Hence, the chance of Clinton winning the election in CA goes down. Hence,
A and B3 seem dependent. Note that we often say that the dependency is weak. The reason is that the
effect of raining on voters turn-out is not much. Hence, we do not expect the rain to have a major effect on
the result of election and it is often ignored in the predictions.
A and B4 are very dependent as you can imagine. Why? How about more than two events?
Definition 6. Three events A, B, and C are called independent if and only if all the following conditions
hold
P (A ∩ B ∩ C) = P (A)P (B)P (C),
P (A ∩ B) = P (A)P (B), P (A ∩ C) = P (A)P (C), P (B, C) = P (B)P (C).

6
Can you guess how independence should be extended to more than 3 events?
Definition 7. A1 , A2 , . . . , An are called independent if and only if for every k and every indices i1 , i2 , . . . , ik
we have
P (Ai1 ∩ Ai2 , . . . ∩ Aik ) = P (Ai1 )P (Ai2 ) . . . P (Aik ).
I believe the following lemma clarifies the definition of independence for you. Before that let me introduce
a new terminology. Often instead of writing P (A | B ∩ C), we write P (A | B, C). The only reason is that it
looks nicer. Make sure that the new notation does not confuse you. Hence,
P (A ∩ B ∩ C)
P (A | B, C) = .
P (B ∩ C)
Lemma 8. If A, B and C are independent, then

P (A | B, C) = P (A).

Proof First we should mention that P (A | B, C) is a simpler notation for P (A | B ∩ C). Hence,

P (A ∩ B ∩ C) P (A)P (B)P (C)


P (A | B, C) = = = P (A).
P (B ∩ C) P (B)P (C)

As is clear, again this matches our intuition for independence. The fact that we know B and C have hap-
pened does not have any impact on the chance of A happening.

Now let’s see how independence can help in the following example:
Example 8 Consider the rare cancer example that we discussed above. Now we have two different tests
for the same cancer. We do them on this person and the results are independent of each other. The accuracy
of both tests is 0.99 (both false positive and false negative probabilities are 0.01). Would you recommend
the person to start chemotherapy? How important is the assumption of independence of the tests in your
conclusion?

Let A denote the event of the person being affected by the cancer and B1 and B2 the event of the person
receiving a positive result from the first test and second test respectively. Since we know that B1 and B2
have already happened, the chance of being affected by the cancer is

P (A | B1 , B2 ) = P (A | B1 ∩ B2 ).

As before we will use the Bayes rule. We have


P (B1 ∩ B2 | A)P (A)
P (A | B1 ∩ B2 ) = .
P (B1 ∩ B2 )
We said that if we run the two tests on the person the results are independent. This means that

P (B1 ∩ B2 | A) = P (B1 | A)P (B2 | A) = 0.99 × 0.99 = 0.98.

As before P (A) = 0.01. The only term that we should calculate is P (B1 ∩B2 ). By the law of total probability
we have

P (B1 ∩ B2 ) = P (B1 ∩ B2 | A)P (A) + P (B1 ∩ B2 | Ac )P (Ac )


= P (B1 | A)P (B2 | A)P (A) + P (B1 | Ac )P (B2 | Ac )P (Ac )
= 0.99 ∗ 0.99 ∗ 0.001 + 0.01 ∗ 0.01 ∗ 0.999 = 0.000108. (5)

7
Hence,
P (B1 ∩ B2 | A)P (A) 0.98 ∗ .001
P (A | B1 ∩ B2 ) = = ≈ 0.91.
P (B1 ∩ B2 ) 0.00108
Note that by using two independent tests only, we have increased the chance from 0.1 to 0.9. The in-
dependence of the tests are very important. Can you explain why? For that reason, it is often better not
to suggest the same type of test to the patient. If there were some issues that led to a wrong result in the
first test, the same issues may lead to a wrong result if we repeat that test. In general, doctors try to use
a test that is as independent as possible from the first one. Extensive studies are often performed to figure
out whether the results of different tests are independent or not.

Example 9 (Sally Clark’s trial) This is an example based on a true event and you can read about it
online. Sally Clark, a British woman, was accused in 1998 of having killed her first child at 11 weeks of age
and then her second child at 8 weeks of age. Prosecutor had an expert witness testify that the chance of a
child dying from sudden infant death syndrome was 1/8500 and hence the chance of two children dying from
sudden infant death syndrome is 1/8500 ∗ 1/8500 which is 1 out of 73000000 children. Hence, he argued that
the chance of Sally Clark being innocent is 1/73000000. This argument led to the conviction of Sally Clark.
Can you figure out the mistakes in the expert witness arguments?
The first issue that you should notice here is the misuse of the independence. If B1 and B2 are the event
that the first baby and second baby have died from the sudden infant death syndrome respectively, then it
is hard to believe that B1 and B2 are independent. For instance, one of the parents may have a genetic issue
that causes this type of death in the babies. You can think of many other reasons for dependence of the two
events. But in addition to this, it has another issue. Do you see what that is? We will discuss that in more
details in HW3.

Example 10 The following problem is known as Monty Hall problem, after the host of the game show
“Let’s make a deal” (read about the historh of the problem in wikipedia after you do this problem). There
are three curtains. Behind one curtain is a car and behind the other two are goats. The contestant picks one
curtain. Monty then opens one of the other curtains to show one of the goats. Then, the contestant has the
option to stay with the curtain she originally chose, or switch to the other unopened curtain. What would
you do if you were the contestant and why? I will discuss the solution in the class. You may then write
that down here.

You might also like