0% found this document useful (0 votes)
36 views

Module1 DS Ppt

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Module1 DS Ppt

Uploaded by

rajaa.david
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Module 1: Introduction

What is Data Science? Visualizing Data, matplotlib, Bar Charts, Line Charts,
Scatterplots, Linear Algebra, Vectors, Matrices, Statistics, Describing a Single Set of
Data, Correlation, Simpson's Paradox, Some Other Correlational Caveats,
Correlation and Causation, Probability, Dependence and Independence, Conditional
Probability, Bayes's Theorem, Random Variables, Continuous Distributions, The
Normal Distribution, The Central Limit Theorem.
The Ascendance of Data

• We live in a world that’s drowning in data. Websites track every user’s every click. Your
smartphone is building up a record of your location and speed every second of every day.
• “Quantified selfers” wear pedometers-on-steroids that are ever recording their heart rates,
movement habits, diet, and sleep patterns.
• Smart cars collect driving habits, smart homes collect living habits, and smart marketers
collect purchasing habits.
• The Internet itself represents a huge graph of knowledge that contains (among other things)
an enormous cross-referenced encyclopedia; domain-specific databases about movies,
music, sports results, pinball machines, memes, and cocktails, etc.
What is Data Science?

• Data science is an interdisciplinary field that focuses on extracting


knowledge and insights from structured and unstructured data using
various techniques, algorithms, and systems.

• It combines elements from several disciplines, including statistics,


computer science, mathematics, and domain-specific knowledge, to
analyze and interpret complex data.

• Data science plays a crucial role in decision-making across various


industries, leveraging the power of data to drive innovation and
improve outcomes.

• Data scientist is someone who extracts insights from messy data.


• For instance, the dating site OkCupid asks its members to answer thousands of questions
in order to find the most appropriate matches for them.

• Facebook asks you to list your hometown and your current location, ostensibly to make it
easier for your friends to find and connect with you. But it also analyzes these locations to
identify global migration patterns

• As a large retailer, Target tracks your purchases and interactions, both online and in-store.
And it uses the data to predictively model which of its customers are pregnant, to better
market baby-related purchases to them.

• In 2012, the Obama campaign employed dozens of data scientists who data-mined and
experimented their way to identifying voters who needed extra attention, choosing optimal
donor-specific fundraising appeals and programs, and focusing get-out-the-vote efforts
where they were most likely to be useful.
Key Components of Data Science:

1.Data Collection: Gathering data from various sources, which could be


structured (like databases) or unstructured (like text, images, or videos).

2.Data Cleaning: Preparing data for analysis by handling missing values,


correcting errors, and ensuring consistency.

3.Data Exploration and Visualization: Analyzing data to understand its


structure, patterns, and relationships. Visualization tools like graphs, charts,
and dashboards are often used to make sense of the data.

4.Statistical Analysis: Applying statistical methods to summarize, interpret,


and infer insights from the data.
5. Machine Learning: Using algorithms and models to make
predictions or decisions based on data. This can involve
supervised learning, unsupervised learning, or reinforcement
learning techniques.

6. Data Engineering: Designing and implementing the


infrastructure that allows data to be stored, processed, and
accessed efficiently.

7. Data Interpretation and Communication: Translating data-


driven
insights into actionable recommendations, often by creating
reports or visualizations that are understandable to non-technical
stakeholders.
Visualizing Data
• A fundamental part of the data scientist’s toolkit is data visualization. Although it is
very easy to create visualizations, it’s much harder to produce good ones.

• There are two primary uses for data visualization:


1. To explore data
2. To communicate data

matplotlib
• A wide variety of tools exist for visualizing data. We will be using the matplotlib
library, which is widely used.
• matplotlib is not part of the core Python library. With your virtual environment
activated (to set one up, go back to “Virtual Environments” and follow the
instructions), install it using this command:
python -m pip install matplotlib
from pyplot as plt

years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]


gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]

# create a line chart, years on x-axis, gdp on y-axis


plt.plot(years, gdp, color='green', marker='o', linestyle='solid')

# add a title
plt.title("Nominal GDP")

# add a label to the y-axis


plt.ylabel("Billions of $")
plt.show()
Bar Chart
A bar chart is a good choice when you want to show how some quantity
varies among some discrete set of items. For instance, Figure 3-2 shows how
many Academy Awards were won by each of a variety of movies:

movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side


Story"] num_oscars = [5, 11, 3, 8, 10]

# plot bars with left x-coordinates [0, 1, 2, 3, 4], heights [num_oscars]


plt.bar(range(len(movies)), num_oscars)

plt.title("My Favorite Movies") # add a title


plt.ylabel("# of Academy Awards") # label the y-axis

# label x-axis with movie names at bar centers


plt.xticks(range(len(movies)), movies) plt.show()
A bar chart can also be a good choice for plotting histograms of
bucketed numeric values, as in Figure 3-3, in order to visually
explore how the values are distributed:
from collections import Counter
grades = [83, 95, 91, 87, 70, 0, 85, 82, 100, 67, 73, 77, 0]

# Bucket grades by decile, but put 100 in with the 90s


histogram = Counter(min(grade // 10 * 10, 90) grade in grades)

plt.bar([x + 5 x in histogram.keys()], # Shift bars right by 5


histogram.values(), # Give each bar its correct height
10, # Give each bar a width of 10
edgecolor=(0, 0, 0)) # Black edges for each bar

plt.axis([-5, 105, 0, 5]) # x-axis from -5 to 105,


# y-axis from 0 to 5

plt.xticks([10 * i for i in range(11)]) # x-axis labels at 0, 10, ..., 100


plt.xlabel("Decile") plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades") plt.show()
Line Charts
As we saw already, we can make line charts using plt.plot(). These are a
good choice
for showing trends
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x, y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
plt.plot(xs, variance, 'g-', label='variance') # green solid line
plt.plot(xs, bias_squared, 'r-.', label='bias^2') # red dot-dashed line
plt.plot(xs, total_error, 'b:', label='total error') # blue dotted line
# because we've assigned labels to each series
# we can get a legend for free
# loc=9 means "top center"
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.title("The Bias-Variance Tradeoff")
plt.show()
Scatterplots
• A scatterplot is the right choice for visualizing the relationship between two
paired sets of data.
• For example, Figure illustrates the relationship between the number of
friends
your users have and the number of minutes they spend on the site every
day:
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # put the label with its point
xytext=(5, -5), # but slightly offset
textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
•Correlation: Scatter plots are useful for identifying correlations. positive correlation
shows an upward trend, and a negative correlation shows a downward trend.
•Outliers: Scatter plots make it easy to spot outliers, which are points that don't follow
the general pattern.
•Data Distribution: Scatter plots help visualize how data is distributed across two
dimensions.
Linear Algebra
• Linear algebra is the branch of mathematics that deals with vector spaces.
• A vector is a mathematical object that has both magnitude (or length)
and direction.
• A vector is typically represented as a directed line segment in space.
• Abstractly, vectors are objects that can be added together (to form new
vectors) and that can be multiplied by scalars (i.e., numbers), also to
form new vectors.
• Concretely (for us), vectors are points in some finite-dimensional space.
Although you might not think of your data as vectors, they are a good
way to represent numeric data.
• For example, if you have the heights, weights, and ages of a large
number of people, you can treat your data as three-dimensional vectors
(height, weight, age). If you’re teaching a class with four exams, you can
treat student grades as four-dimensional vectors (exam1, exam2,
exam3, exam4).
height_weight_age = [70, # inches,
170, # pounds,
40 ] # years

grades = [95, # exam1


80, # exam2
75, # exam3
62 ] # exam4

To begin with, we’ll frequently need to add two vectors. Vectors add componentwise.

if two vectors v and w are the same length, their sum is just the vector
whose first element is v[0] + w[0], whose second element is v[1] + w[1],
and so on.
We can easily implement this by zip-ing the vectors together and using a list
comprehension to add the corresponding elements:

def vector_add(v, w):


"""adds corresponding elements"""
return [v_i + w_i
for v_i, w_i in zip(v, w)]

Similarly, to subtract two vectors we just subtract corresponding elements:

def vector_subtract(v, w):


"""subtracts corresponding elements"""
return [v_i - w_i
for v_i, w_i in zip(v, w)]
• Sometimes we want to componentwise sum a list of vectors. That is, create a new
vector whose first element is the sum of all the first elements, whose second element is the
sum of all the second elements, and so on.
• The easiest way to do this is by adding one vector at a time:

• distance between two vectors, defined as:


Matrices
A matrix is a two-dimensional collection of numbers. We will represent matrices as
lists of lists, with each inner list having the same size and representing a row of the
matrix.
A = [[1, 2, 3], # A has 2 rows and 3 columns
[4, 5, 6]]
B = [[1, 2], # B has 3 rows and 2 columns
[3, 4],
[5, 6]]

Matrices will be important to us for several reasons.


• First, we can use a matrix to represent a data set consisting of multiple
vectors, simply by considering each vector as a row of the matrix.
• Second, as we’ll see later, we can use n X k an matrix to represent a linear
function that maps k-dimensional vectors to n-dimensional vectors. Several
of our techniques and concepts will involve such functions
• Third, matrices can be used to represent binary relationships.
# user 0 1 2 3 4 5 6 7 89
#
friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0
[1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1
[1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2
[0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6
[0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7
[0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9

friendships[0][2] == 1 # True, 0 and 2 are friends


friendships[0][8] == 1 # False, 0 and 8 are not friends
Statistics
Statistics refers to the mathematics and techniques with which we understand data. It is a
rich, enormous field, more suited to a shelf (or room) in a library rather than a chapter in a
book, and so our discussion will necessarily not be a deep one.

Describing a single set of data involves summarizing its key characteristics


and understanding its distribution. Here’s how you can describe a single
set of data:

1. Central Tendency
Central tendency measures indicate where the center of the data lies.
Mean: The average of the data values.
Median: The middle value when the data is sorted.
Mode: The most frequent value in the data
def mean(x):
return sum(x) / len(x)
mean(num_friends)

def median(v):
"""finds the 'middle-most' value of v"""
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2
if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
median(num_friends) # 6.0
• Clearly, the mean is simpler to compute, and it varies smoothly as our data changes.
• If we have n data points and one of them increases by some small amount e, then necessarily
the mean will increase by e / n. (This makes the mean amenable to all sorts of calculus tricks.)
• Whereas in order to find the median, we have to sort our data. And changing one of our
data points by a small amount e might increase the median by e, by some number less than
e, or not at all (depending on the rest of the data).
• At the same time, the mean is very sensitive to outliers in our data.
• If outliers are likely to be bad data (or otherwise unrepresentative of whatever phenomenon
we’re trying to understand), then the mean can sometimes give us a misleading picture.
A generalization of the median is the quantile, which represents the value less than which
a certain percentile of the data lies. (The median represents the value less than which 50%
of the data lies.)

def quantile(x, p):


"""returns the pth-percentile value in x"""
p_index = int(p * len(x))
return sorted(x)[p_index]
quantile(num_friends, 0.10) # 1
quantile(num_friends, 0.25) # 3
quantile(num_friends, 0.75) # 9
quantile(num_friends, 0.90) # 13
Less commonly you might want to look at the mode, or most-common value[s]:
def mode(x):
"""returns a list, might be more than one mode"""
counts = Counter(x)
max_count = max(counts.values())
return [x_i for x_i, count in counts.iteritems()
if count == max_count]
mode(num_friends) # 1 and 6
2. Dispersion (Spread)
Dispersion measures describe the spread of data values around the central
tendency.
•Range: The difference between the maximum and minimum values.
def data_range(x):
return max(x) - min(x)
data_range(num_friends) # 99

•Variance: The average of the squared differences from the mean.


A more complex measure of dispersion is the variance, which is computed
as:
def de_mean(x):
"""translate x by subtracting its mean (so the result has mean 0)"""
x_bar = mean(x)
return [x_i - x_bar for x_i in x]

def variance(x):
"""assumes x has at least two elements"""
n = len(x)
Standard Deviation: The square root of the variance, representing average distance from the
mean.
def standard_deviation(x):
return math.sqrt(variance(x))
standard_deviation(num_friends) # 9.03

Interquartile Range (IQR): The difference between the 75th percentile


(Q3) and the 25th percentile (Q1), indicating the range of the middle 50%
of the data.
A more robust alternative computes the difference between the 75th
percentile value and the 25th percentile value:

def interquartile_range(x):
return quantile(x, 0.75) - quantile(x, 0.25)
interquartile_range(num_friends) # 6
Correlation
• Correlation is a statistical measure that describes the degree and
direction of the linear relationship between two variables. In simpler
terms, it tells us how changes in one variable are associated with
changes in another.

• The correlation is unitless and always lies between -1 (perfect anti-


correlation) and 1 (perfect correlation).

• We’ll first look at covariance, the paired analogue of variance. Whereas


variance
measures how a single variable deviates from its mean, covariance
measures how
two variables vary in tandem from their means:

def covariance(x, y):


n = len(x)
return dot(de_mean(x), de_mean(y)) / (n - 1)
def correlation(x, y):
stdev_x = standard_deviation(x)
stdev_y = standard_deviation(y)
if stdev_x > 0 and stdev_y > 0:
return covariance(x, y) / stdev_x / stdev_y
else:
return 0 # if no variation, correlation is zero
correlation(num_friends, daily_minutes) # 0.25
Correlation Coefficient (r):
The correlation coefficient, typically denoted as r, ranges from -1 to 1.
r = 1: Perfect positive correlation, meaning as one variable increases, the other also increases
perfectly.
r = -1: Perfect negative correlation, meaning as one variable increases, the other decreases
perfectly.
r = 0: No correlation, meaning there is no linear relationship between the variables.

Types of Correlation:
•Positive Correlation: Both variables move in the same direction.
•Negative Correlation: As one variable increases, the other decreases.
•No Correlation: No predictable relationship between the variables.
Methods to Calculate Correlation:
•Pearson Correlation: Measures the linear relationship between two
continuous variables.
•Spearman Rank Correlation: Measures the relationship between two
variables based on the rank of the data rather than the actual values. Useful
when the data is not normally distributed or when dealing with ordinal data.
•Kendall Tau Correlation: Another rank-based correlation method, useful for
small sample sizes and for ordinal data.
Simpson’s Paradox
• Simpson's Paradox is a phenomenon in probability and statistics
where a trend that appears in several different groups of data reverses
or disappears when the groups are combined.
• This paradox highlights how misleading aggregated data can be and
the importance of analyzing data in its proper context.
• Simpson's Paradox occurs when the relationship between two variables
reverses upon considering a third variable, often due to lurking
variables or confounding factors.
• Correlations can be misleading when confounding variables are ignored.
For example, imagine that you can identify all of your members as either East
Coast data scientists or West Coast data scientists. You decide to examine which
coast’s data scientists are friendlier:

Coast # of members avg. # of friends


West Coast 101 8.2
East Coast 103 6.5

It certainly looks like the West Coast data scientists are friendlier than the East
Coast data scientists.
When playing with the data you discover something very strange. If you only look at people
with PhDs, the East Coast data scientists have more friends on average. And if you only look at
people without PhDs, the East Coast data scientists also have more friends on average!

Coast degree # of members avg. # of friends


West Coast PhD 35 3.1
East Coast PhD 70 3.2

West Coast no PhD 66 10.9


East Coast no PhD 33 13.4

Once you account for the users’ degrees, the correlation goes in the
opposite direction!
Bucketing the data as East Coast/West Coast disguised the fact that the
East Coast data scientists skew much more heavily toward PhD types.
Simpson's Paradox can occur due to several reasons:
1.Confounding Variables: These are variables that affect both the independent and dependent
variables, potentially misleading the results.
2.Data Aggregation: Aggregating data without considering underlying groups can obscure the
true relationship between variables.
3.Unequal Group Sizes: When the sizes of the groups are very different, the aggregated data
might give more weight to the larger group, skewing the results.
Some Other Correlational Caveats
A correlation of zero indicates that there is no linear relationship between the two
variables. However, there may be other sorts of relationships. For example, if:
x = [-2, -1, 0, 1, 2]
y = [ 2, 1, 0, 1, 2]
then x and y have zero correlation. But they certainly have a relationship — each
element of y equals the absolute value of the corresponding element of x.
What they don’t have is a relationship in which knowing how x_i compares to mean(x)
gives us information about
how y_i compares to mean(y). That is the sort of relationship that correlation looks for.
In addition, correlation tells you nothing about how large the relationship is. The
variables:
x = [-2, 1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]
are perfectly correlated, but (depending on what you’re measuring) it’s quite
possible that this relationship isn’t all that interesting.
A caveat is essentially a "heads-up" or caution, suggesting that there are additional factors to
consider before proceeding or making a decision.

Correlation and causation


• Correlattion and causation are two fundamental concepts in statistics and
data analysis, but they describe very different relationships between
variables.
• Understanding the difference between them is crucial for correctly
interpreting data.

• Definition: Correlation refers to a statistical relationship between two


variables, where changes in one variable are associated with changes in
another. This relationship can be positive (both variables increase or
decrease together), negative (one variable increases while the other
decreases), or zero (no apparent relationship).
Causation
• Causation indicates that one event or variable directly causes another event or variable to
happen. In other words, changes in one variable are responsible for changes in another.

• Cause and Effect: In a causal relationship, there is a clear cause-and-effect link. For example,
smoking causes an increased risk of lung cancer.
Key Differences
Directionality:
Correlation: Indicates that two variables are related, but doesn't specify whether one causes
the other.
Causation: Implies that changes in one variable directly lead to changes in another.
Confounding Variables:
Correlation: Might be influenced by a third variable that affects both of the correlated
variables, leading to a spurious or misleading correlation.
Causation: Accounts for all variables and establishes that the relationship is not due to any
external factors.
Examples:
Correlation Example: There is a correlation between ice cream sales and drowning
incidents. However, this doesn’t mean ice cream sales cause drowning. A third variable (hot
weather) increases both.
Causation Example: Exposure to sunlight causes sunburn. Here, the relationship is direct
and causal.
Probability
• The concept of probability has a long history that goes back thousands of years when words
like “probably”, “likely”, “maybe”, “perhaps” and “possibly” were introduced into spoken
languages
• However, the mathematical theory of probability was formulated only in the 17 th century
• Probability can also be defined as a scientific measure of chance
• Probability can be expressed mathematically as a numerical index with a range between zero
(an absolute impossibility) to unity (an absolute certainty)
• Most events have a probability index strictly between 0 and 1, which means that each event has
at least two possible outcomes: favorable outcome or success, and unfavorable outcome or
failure
• For our purposes you should think of probability as a way of quantifying the uncertainty
associated with events chosen from a some universe of events.
• Ex: Tossing a coin
• Notationally, we write P(E) to mean “the probability of the event E.”
• We’ll use probability theory to build models, evaluate models and all over the place.
Dependence and Independence
• Roughly speaking, we say that two events E and F are dependent if knowing something
about whether E happens gives us information about whether F happens (and vice versa).
Otherwise they are independent.
• Mathematically, we say that two events E and F are independent if the probability that
they both happen is the product of the probabilities that each one happens:

In the example, the probability of “first flip heads” is 1/2, and the probability of “both flips
tails” is 1/4, but the probability of “first flip heads and both flips tails” is 0.
Conditional Probability
• Let A be an event in the world and B be another event. Suppose that events A and B are not
mutually exclusive, but occur conditionally on the occurrence of the other
• The probability that event A will occur if event B occurs is called the Conditional
Probability
• Conditional probability is denoted mathematically as p(A|B) in which the vertical bar
represents “given” and the complete probability expression is interpreted as “Conditional
probability of event A occurring given that event B has occurred”
• The number of times A and B can occur, or the probability that both A and B will occur, is
called the joint probability of A and B. It is represented mathematically as p(AB)
® Bayesian Theorem
 The concept of Bayesian Theorem is based on the Conditional
Probability
 Conditional Probability: Conditional probability is defined as the
likelihood of an event or outcome occurring, based on the
occurrence of a previous event or outcome

Sample Space: It is a collection


or a set of possible outcomes of
a random events

A B
 A and B are Events
 The overlapping portion belongs
to both A and B are represented
as AB
® Bayesian Theorem
 Conditional Probability of the events A and B can be written as
follows
P(A|B) = P(AB)/P(B)  (1), Assuming that the last event occurred
was B

P(B|A) = P(AB)/P(A)  (2), Assuming that the last event occurred


was A
 Slightly re-arrange the Equations (1), and (2)
P(AB) = P(A|B)*P(B) = P(B|A)*P(A)
P(A|B)*P(B) = P(B|A)*P(A)

P(A|B) = P(B|A)*P(A)/P(B)  (3), Bayesian Theorem


 A1, A2, and A3 are the Events
® Bayesian  The events are
Theorem - Mutually Exclusive: Occurrence of one
event negates the possibility of
occurrence of other events
A2 - Collectively Exhaustive: Adding of
A1 probabilities of all the events
constitute the Sample Space

A3 A2B
A1B

 New event B is common to all, and which A1 A2


B
can be represented as follows
A3
P(B) = (A1B)+ (A2B)+(A3B) A3B
P(B) = (A1B)+ (A2B)+(A3B)

P(B) = P(B|A1)*P(A1)+P(B|A2)*P(A2)+P(B|
A3)*P(A3)

P(B) =
® Bayesian Theorem: Example
For example, a doctor knows that the disease meningitis causes the
patient to have a stiff neck, say, 70% of the time. The doctor also
knows some unconditional facts: the prior probability that a patient has
meningitis is 1/50,000, and the prior probability that any patient has a
stiff neck is 1%. Letting s be the proposition that the patient has a stiff
neck and m be the proposition that the patient has meningitis, we
have

P(s|m) = 0.7
P(m) = 1/50000
P(s) = 0.01
P (m|s) = P (s|m)P(m)/p(s)
P(m|s) = 0.7 × (1/50000) / 0.01 = 0.0014
 What is the chance of rain during the day?
We will use Rain to mean rain during the day, and Cloud to mean
cloudy morning. The chance of Rain given Cloud is written P(Rain |
Cloud)
So let's put that in the formula:
P(Rain | Cloud) = P(Rain)*P(Cloud | Rain)/P(Cloud)
P(Rain) is Probability of Rain = 10%
P(Cloud | Rain) is Probability of Cloud, given that Rain happens =
50%
P(Cloud) is Probability of Cloud = 40%
P(Rain | Cloud) = 0.1 x 0.5/0.4 = .125
or a 12.5% chance of rain.
Bayes’s Theorem
One of the data scientist’s best friends is Bayes’s theorem, which is a way of “reversing”
conditional probabilities.
Let’s say we need to know the probability of some event E conditional on some other event
F occurring. But we only have information about the probability of F conditional on E
occurring. Using the definition of conditional probability twice tells us that:

The event F can be split into the two mutually exclusive events “F and E” and “F and not
E.” If we write ¬E for “not E” (i.e., “E doesn’t happen”), then:
Random Variables
A random variable is a variable whose possible values have an associated probability
distribution.
A random variable is a function that assigns a real number to each outcome in a sample space
of a random experiment.
A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up
tails.
A more complicated one might measure the number of heads observed when flipping a coin 10
times or a value picked from range(10) where each number is equally likely.
There are two main types of random variables:
1.Discrete Random Variable:
1. Takes on a countable number of distinct values.
2. Examples include the number of heads in 10 coin flips, the number of students
present in a class, or the roll of a die.
3. The probability distribution of a discrete random variable is described by a
probability mass function (PMF).
2.Continuous Random Variable:
1. Takes on an infinite number of possible values, typically within a given range.
2. Examples include the time it takes to run a mile, the exact height of students in a
classroom, or the temperature at a specific time of day.
3. The probability distribution of a continuous random variable is described by a
probability density function (PDF).
Continuous Distributions
• Continuous distributions are probability distributions that describe the likelihood of
continuous outcomes—those that can take any value within a certain range.
• Unlike discrete distributions, where outcomes are countable, continuous distributions are
used when the set of possible outcomes forms a continuum, such as measurements of time,
temperature, or distance.
• we represent a continuous distribution with a probability density function (pdf) such that
the probability of seeing a value in a certain interval equals the integral of the density
function over the interval.

The density function for the uniform distribution is just:

def uniform_pdf(x):
return 1 if x >= 0 and x < 1 else 0
We will often be more interested in the cumulative distribution function (cdf), which gives
the probability that a random variable is less than or equal to a certain value. It’s not hard
to create the cumulative distribution function for the uniform distribution

def uniform_cdf(x):
"returns the probability that a uniform random
variable is <= x"
if x < 0: return 0 # uniform random is never less
than 0
elif x < 1: return x # e.g. P(X <= 0.4) = 0.4
else: return 1 # uniform random is always less than
1
The Normal Distribution
• The normal distribution is the king of distributions. It is the classic bell curve–shaped
distribution and is completely determined by two parameters: its mean (mu) and its
standard deviation (sigma).
• The mean indicates where the bell is centered, and the standard deviation how “wide” it is.

which we can implement as:

def normal_pdf(x, mu=0, sigma=1):


sqrt_two_pi = math.sqrt(2 * math.pi)
return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) /
(sqrt_two_pi * sigma))
we plot some of these pdfs to see what they look like:
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs,[normal_pdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')
plt.plot(xs,[normal_pdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
plt.plot(xs,[normal_pdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
plt.plot(xs,[normal_pdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1')
plt.legend()
plt.title("Various Normal pdfs")
plt.show()
When µ=0 and σ=1 , it’s called the standard normal distribution. If Z is a standard
normal random variable, then it turns out that

is also normal but with µ mean and standard deviation σ. Conversely, if X is a normal
random variable with mean µ and standard deviation σ, is a standard normal variable.

The cumulative distribution function for the normal distribution cannot be written in an
“elementary” manner, but we can write it using Python’s math.erf:

def normal_cdf(x, mu=0,sigma=1):


return (1 + math.erf((x - mu) / math.sqrt(2) /
sigma)) / 2
xs = [x / 10.0 for x in range(-50, 50)]
plt.plot(xs,[normal_cdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1')
plt.plot(xs,[normal_cdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2')
plt.plot(xs,[normal_cdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5')
plt.plot(xs,[normal_cdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1')
plt.legend(loc=4) # bottom right
plt.title("Various Normal cdfs")
plt.show()

Figure 6-3. Various normal cdfs


The Central Limit Theorem
• One reason the normal distribution is so useful is the central limit theorem, which says
(in essence) that a random variable defined as the average of a large number of independent
and identically distributed random variables is itself approximately normally distributed.
• In particular, if x1, x2, x3….xn are random variables with µ mean and standard deviation σ,
and if n is large, then:

is approximately normally distributed with mean µ and standard deviation σ/√n .


Equivalently (but often more usefully),

is approximately normally distributed with mean 0 and standard deviation 1.


• An easy way to illustrate this is by looking at binomial random variables, which have two
parameters n and p.
• A Binomial(n,p) random variable is simply the sum of n independent Bernoulli(p) random
variables, each of which equals 1 with probability p and 0 with probability
1─p
def bernoulli_trial(p):
return 1 if random.random() < p else 0
def binomial(n, p):
return sum(bernoulli_trial(p) for _ in range(n))

The mean of a Bernoulli(p) variable is p, and its standard deviation is . The


central limit theorem says that as n gets large, a Binomial(n,p) variable is approximately a
normal random variable with mean µ=np and standard deviation is

If we plot both, you can easily see the resemblance:


Some of the questions from module 1
1. What is data Visualization? Explain bar chart and line chart
2. Explain the following i) vector addition ii) vector sum iii) vector mean iv) vector
multiplication
3. Explain the following statistical techniques i) mean ii) median iii) mode iv)
interquartile range
4. Explain Simpson’s Paradox with an example.
5. Writen Python code segment to visualize line chart and scatterplot with example
6. Briefly summarize the difference between variance and covariance. Write Python
code for finding covariance
7. Describe Normal Distribution with a Python routine for PDF and CDF
8. Explain the difference between correlation and causation.
9. Describe the statement “Correlation is not Causation” with an example in detail.
10.Describe Baye’s Theorem in detail with an example.
11.Discuss random variables with an example in detail.
12.Explain standard deviation and interquartile range and write python code to
compute standard deviation and interquartile range.
13.Discuss the Central Limit Theorem and its significance in relation to the Normal

You might also like