Module1 DS Ppt
Module1 DS Ppt
What is Data Science? Visualizing Data, matplotlib, Bar Charts, Line Charts,
Scatterplots, Linear Algebra, Vectors, Matrices, Statistics, Describing a Single Set of
Data, Correlation, Simpson's Paradox, Some Other Correlational Caveats,
Correlation and Causation, Probability, Dependence and Independence, Conditional
Probability, Bayes's Theorem, Random Variables, Continuous Distributions, The
Normal Distribution, The Central Limit Theorem.
The Ascendance of Data
• We live in a world that’s drowning in data. Websites track every user’s every click. Your
smartphone is building up a record of your location and speed every second of every day.
• “Quantified selfers” wear pedometers-on-steroids that are ever recording their heart rates,
movement habits, diet, and sleep patterns.
• Smart cars collect driving habits, smart homes collect living habits, and smart marketers
collect purchasing habits.
• The Internet itself represents a huge graph of knowledge that contains (among other things)
an enormous cross-referenced encyclopedia; domain-specific databases about movies,
music, sports results, pinball machines, memes, and cocktails, etc.
What is Data Science?
• Facebook asks you to list your hometown and your current location, ostensibly to make it
easier for your friends to find and connect with you. But it also analyzes these locations to
identify global migration patterns
• As a large retailer, Target tracks your purchases and interactions, both online and in-store.
And it uses the data to predictively model which of its customers are pregnant, to better
market baby-related purchases to them.
• In 2012, the Obama campaign employed dozens of data scientists who data-mined and
experimented their way to identifying voters who needed extra attention, choosing optimal
donor-specific fundraising appeals and programs, and focusing get-out-the-vote efforts
where they were most likely to be useful.
Key Components of Data Science:
matplotlib
• A wide variety of tools exist for visualizing data. We will be using the matplotlib
library, which is widely used.
• matplotlib is not part of the core Python library. With your virtual environment
activated (to set one up, go back to “Virtual Environments” and follow the
instructions), install it using this command:
python -m pip install matplotlib
from pyplot as plt
# add a title
plt.title("Nominal GDP")
To begin with, we’ll frequently need to add two vectors. Vectors add componentwise.
if two vectors v and w are the same length, their sum is just the vector
whose first element is v[0] + w[0], whose second element is v[1] + w[1],
and so on.
We can easily implement this by zip-ing the vectors together and using a list
comprehension to add the corresponding elements:
1. Central Tendency
Central tendency measures indicate where the center of the data lies.
Mean: The average of the data values.
Median: The middle value when the data is sorted.
Mode: The most frequent value in the data
def mean(x):
return sum(x) / len(x)
mean(num_friends)
def median(v):
"""finds the 'middle-most' value of v"""
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2
if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
median(num_friends) # 6.0
• Clearly, the mean is simpler to compute, and it varies smoothly as our data changes.
• If we have n data points and one of them increases by some small amount e, then necessarily
the mean will increase by e / n. (This makes the mean amenable to all sorts of calculus tricks.)
• Whereas in order to find the median, we have to sort our data. And changing one of our
data points by a small amount e might increase the median by e, by some number less than
e, or not at all (depending on the rest of the data).
• At the same time, the mean is very sensitive to outliers in our data.
• If outliers are likely to be bad data (or otherwise unrepresentative of whatever phenomenon
we’re trying to understand), then the mean can sometimes give us a misleading picture.
A generalization of the median is the quantile, which represents the value less than which
a certain percentile of the data lies. (The median represents the value less than which 50%
of the data lies.)
def variance(x):
"""assumes x has at least two elements"""
n = len(x)
Standard Deviation: The square root of the variance, representing average distance from the
mean.
def standard_deviation(x):
return math.sqrt(variance(x))
standard_deviation(num_friends) # 9.03
def interquartile_range(x):
return quantile(x, 0.75) - quantile(x, 0.25)
interquartile_range(num_friends) # 6
Correlation
• Correlation is a statistical measure that describes the degree and
direction of the linear relationship between two variables. In simpler
terms, it tells us how changes in one variable are associated with
changes in another.
Types of Correlation:
•Positive Correlation: Both variables move in the same direction.
•Negative Correlation: As one variable increases, the other decreases.
•No Correlation: No predictable relationship between the variables.
Methods to Calculate Correlation:
•Pearson Correlation: Measures the linear relationship between two
continuous variables.
•Spearman Rank Correlation: Measures the relationship between two
variables based on the rank of the data rather than the actual values. Useful
when the data is not normally distributed or when dealing with ordinal data.
•Kendall Tau Correlation: Another rank-based correlation method, useful for
small sample sizes and for ordinal data.
Simpson’s Paradox
• Simpson's Paradox is a phenomenon in probability and statistics
where a trend that appears in several different groups of data reverses
or disappears when the groups are combined.
• This paradox highlights how misleading aggregated data can be and
the importance of analyzing data in its proper context.
• Simpson's Paradox occurs when the relationship between two variables
reverses upon considering a third variable, often due to lurking
variables or confounding factors.
• Correlations can be misleading when confounding variables are ignored.
For example, imagine that you can identify all of your members as either East
Coast data scientists or West Coast data scientists. You decide to examine which
coast’s data scientists are friendlier:
It certainly looks like the West Coast data scientists are friendlier than the East
Coast data scientists.
When playing with the data you discover something very strange. If you only look at people
with PhDs, the East Coast data scientists have more friends on average. And if you only look at
people without PhDs, the East Coast data scientists also have more friends on average!
Once you account for the users’ degrees, the correlation goes in the
opposite direction!
Bucketing the data as East Coast/West Coast disguised the fact that the
East Coast data scientists skew much more heavily toward PhD types.
Simpson's Paradox can occur due to several reasons:
1.Confounding Variables: These are variables that affect both the independent and dependent
variables, potentially misleading the results.
2.Data Aggregation: Aggregating data without considering underlying groups can obscure the
true relationship between variables.
3.Unequal Group Sizes: When the sizes of the groups are very different, the aggregated data
might give more weight to the larger group, skewing the results.
Some Other Correlational Caveats
A correlation of zero indicates that there is no linear relationship between the two
variables. However, there may be other sorts of relationships. For example, if:
x = [-2, -1, 0, 1, 2]
y = [ 2, 1, 0, 1, 2]
then x and y have zero correlation. But they certainly have a relationship — each
element of y equals the absolute value of the corresponding element of x.
What they don’t have is a relationship in which knowing how x_i compares to mean(x)
gives us information about
how y_i compares to mean(y). That is the sort of relationship that correlation looks for.
In addition, correlation tells you nothing about how large the relationship is. The
variables:
x = [-2, 1, 0, 1, 2]
y = [99.98, 99.99, 100, 100.01, 100.02]
are perfectly correlated, but (depending on what you’re measuring) it’s quite
possible that this relationship isn’t all that interesting.
A caveat is essentially a "heads-up" or caution, suggesting that there are additional factors to
consider before proceeding or making a decision.
• Cause and Effect: In a causal relationship, there is a clear cause-and-effect link. For example,
smoking causes an increased risk of lung cancer.
Key Differences
Directionality:
Correlation: Indicates that two variables are related, but doesn't specify whether one causes
the other.
Causation: Implies that changes in one variable directly lead to changes in another.
Confounding Variables:
Correlation: Might be influenced by a third variable that affects both of the correlated
variables, leading to a spurious or misleading correlation.
Causation: Accounts for all variables and establishes that the relationship is not due to any
external factors.
Examples:
Correlation Example: There is a correlation between ice cream sales and drowning
incidents. However, this doesn’t mean ice cream sales cause drowning. A third variable (hot
weather) increases both.
Causation Example: Exposure to sunlight causes sunburn. Here, the relationship is direct
and causal.
Probability
• The concept of probability has a long history that goes back thousands of years when words
like “probably”, “likely”, “maybe”, “perhaps” and “possibly” were introduced into spoken
languages
• However, the mathematical theory of probability was formulated only in the 17 th century
• Probability can also be defined as a scientific measure of chance
• Probability can be expressed mathematically as a numerical index with a range between zero
(an absolute impossibility) to unity (an absolute certainty)
• Most events have a probability index strictly between 0 and 1, which means that each event has
at least two possible outcomes: favorable outcome or success, and unfavorable outcome or
failure
• For our purposes you should think of probability as a way of quantifying the uncertainty
associated with events chosen from a some universe of events.
• Ex: Tossing a coin
• Notationally, we write P(E) to mean “the probability of the event E.”
• We’ll use probability theory to build models, evaluate models and all over the place.
Dependence and Independence
• Roughly speaking, we say that two events E and F are dependent if knowing something
about whether E happens gives us information about whether F happens (and vice versa).
Otherwise they are independent.
• Mathematically, we say that two events E and F are independent if the probability that
they both happen is the product of the probabilities that each one happens:
In the example, the probability of “first flip heads” is 1/2, and the probability of “both flips
tails” is 1/4, but the probability of “first flip heads and both flips tails” is 0.
Conditional Probability
• Let A be an event in the world and B be another event. Suppose that events A and B are not
mutually exclusive, but occur conditionally on the occurrence of the other
• The probability that event A will occur if event B occurs is called the Conditional
Probability
• Conditional probability is denoted mathematically as p(A|B) in which the vertical bar
represents “given” and the complete probability expression is interpreted as “Conditional
probability of event A occurring given that event B has occurred”
• The number of times A and B can occur, or the probability that both A and B will occur, is
called the joint probability of A and B. It is represented mathematically as p(AB)
® Bayesian Theorem
The concept of Bayesian Theorem is based on the Conditional
Probability
Conditional Probability: Conditional probability is defined as the
likelihood of an event or outcome occurring, based on the
occurrence of a previous event or outcome
A B
A and B are Events
The overlapping portion belongs
to both A and B are represented
as AB
® Bayesian Theorem
Conditional Probability of the events A and B can be written as
follows
P(A|B) = P(AB)/P(B) (1), Assuming that the last event occurred
was B
A3 A2B
A1B
P(B) = P(B|A1)*P(A1)+P(B|A2)*P(A2)+P(B|
A3)*P(A3)
P(B) =
® Bayesian Theorem: Example
For example, a doctor knows that the disease meningitis causes the
patient to have a stiff neck, say, 70% of the time. The doctor also
knows some unconditional facts: the prior probability that a patient has
meningitis is 1/50,000, and the prior probability that any patient has a
stiff neck is 1%. Letting s be the proposition that the patient has a stiff
neck and m be the proposition that the patient has meningitis, we
have
P(s|m) = 0.7
P(m) = 1/50000
P(s) = 0.01
P (m|s) = P (s|m)P(m)/p(s)
P(m|s) = 0.7 × (1/50000) / 0.01 = 0.0014
What is the chance of rain during the day?
We will use Rain to mean rain during the day, and Cloud to mean
cloudy morning. The chance of Rain given Cloud is written P(Rain |
Cloud)
So let's put that in the formula:
P(Rain | Cloud) = P(Rain)*P(Cloud | Rain)/P(Cloud)
P(Rain) is Probability of Rain = 10%
P(Cloud | Rain) is Probability of Cloud, given that Rain happens =
50%
P(Cloud) is Probability of Cloud = 40%
P(Rain | Cloud) = 0.1 x 0.5/0.4 = .125
or a 12.5% chance of rain.
Bayes’s Theorem
One of the data scientist’s best friends is Bayes’s theorem, which is a way of “reversing”
conditional probabilities.
Let’s say we need to know the probability of some event E conditional on some other event
F occurring. But we only have information about the probability of F conditional on E
occurring. Using the definition of conditional probability twice tells us that:
The event F can be split into the two mutually exclusive events “F and E” and “F and not
E.” If we write ¬E for “not E” (i.e., “E doesn’t happen”), then:
Random Variables
A random variable is a variable whose possible values have an associated probability
distribution.
A random variable is a function that assigns a real number to each outcome in a sample space
of a random experiment.
A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up
tails.
A more complicated one might measure the number of heads observed when flipping a coin 10
times or a value picked from range(10) where each number is equally likely.
There are two main types of random variables:
1.Discrete Random Variable:
1. Takes on a countable number of distinct values.
2. Examples include the number of heads in 10 coin flips, the number of students
present in a class, or the roll of a die.
3. The probability distribution of a discrete random variable is described by a
probability mass function (PMF).
2.Continuous Random Variable:
1. Takes on an infinite number of possible values, typically within a given range.
2. Examples include the time it takes to run a mile, the exact height of students in a
classroom, or the temperature at a specific time of day.
3. The probability distribution of a continuous random variable is described by a
probability density function (PDF).
Continuous Distributions
• Continuous distributions are probability distributions that describe the likelihood of
continuous outcomes—those that can take any value within a certain range.
• Unlike discrete distributions, where outcomes are countable, continuous distributions are
used when the set of possible outcomes forms a continuum, such as measurements of time,
temperature, or distance.
• we represent a continuous distribution with a probability density function (pdf) such that
the probability of seeing a value in a certain interval equals the integral of the density
function over the interval.
def uniform_pdf(x):
return 1 if x >= 0 and x < 1 else 0
We will often be more interested in the cumulative distribution function (cdf), which gives
the probability that a random variable is less than or equal to a certain value. It’s not hard
to create the cumulative distribution function for the uniform distribution
def uniform_cdf(x):
"returns the probability that a uniform random
variable is <= x"
if x < 0: return 0 # uniform random is never less
than 0
elif x < 1: return x # e.g. P(X <= 0.4) = 0.4
else: return 1 # uniform random is always less than
1
The Normal Distribution
• The normal distribution is the king of distributions. It is the classic bell curve–shaped
distribution and is completely determined by two parameters: its mean (mu) and its
standard deviation (sigma).
• The mean indicates where the bell is centered, and the standard deviation how “wide” it is.
is also normal but with µ mean and standard deviation σ. Conversely, if X is a normal
random variable with mean µ and standard deviation σ, is a standard normal variable.
The cumulative distribution function for the normal distribution cannot be written in an
“elementary” manner, but we can write it using Python’s math.erf: