Government first grade collage
Unit -3
Hod BCA
Unit-3
R as a set of statistical tables
R is a language and environment specifically designed for
statistical computing and data analysis.
It provides a wide variety of statistical and graphical
techniques, and it is extensible through packages.
In R, you can represent statistical information in the form of
data frames, which are essentially tables containing rows and
columns of data.
You can perform various statistical analyses and generate
graphical representations of the data using R.
The use of R is to provide a comprehensive set of statistical table
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
Statistics And Probability
Statistics and probability are closely related fields, and statistics can
be used to analyze data and make inferences about probabilistic
events. Both are important fields in data analysis, and R is a popular
programming language and environment for working with data.
Statistics: The science of collecting, analyzing, presenting, and
interpreting data.
Statistics is a branch of mathematics and a scientific discipline that
deals with the collection, analysis, interpretation, presentation, and
organization of data. It plays a crucial role in understanding and
summarizing information from data, making decisions, and drawing
inferences or conclusions based on empirical evidence.
Installing and Loading Packages:
To perform statistical and probabilistic analysis, you may need to
load specific packages. The two most
common packages for this purpose are stats and distributions.
# Install packages (only needed once)
[Link]("stats")
[Link]("distributions")
# Load packages
library(stats)
library(distributions)
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
Descriptive Statistics
In the descriptive Statistics, the Data is described in a summarized
way. The summarization is done from the sample of the population
using different parameters like Mean or standard deviation.
Descriptive Statistics are a way of using charts, graphs, and summary
measures to organize, represent, and explain a set of Data.
• Data is typically arranged and displayed in tables or graphs
summarizing details such as histograms, pie charts, bars or
scatter plots.
• Descriptive Statistics are just descriptive and thus do not
require normalization beyond the Data collected.
Inferential Statistics
In the Inferential Statistics, we try to interpret the Meaning of
descriptive Statistics. After the Data has been collected, analyzed,
and summarized we use Inferential Statistics to describe the Meaning
of the collected Data.
• Inferential Statistics use the probability principle to assess
whether trends contained in the research sample can be
generalized to the larger population from which the sample
originally comes.
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
• Inferential Statistics are intended to test hypotheses and
investigate relationships between variables and can be used to
make population predictions.
• Inferential Statistics are used to draw conclusions and
inferences, i.e., to make valid generalizations from samples
Process of Descriptive Analysis:
Refers to the process of summarizing and exploring a dataset to gain
a better understanding of its basic characteristics. The primary
objective of descriptive analysis is to describe and visualize the data
in a way that helps researchers or analysts recognize patterns,
trends, and important features. This analysis provides a foundation
for more advanced statistical analysis and hypothesis testing.
Step 1: Data Collection
Before conducting any analysis, you must first collect relevant data.
This process involves identifying data sources, selecting appropriate
data-collecting methods, and verifying that the data acquired
accurately represents the population or topic of interest.
You can collect data through surveys, experiments, observations,
existing databases, or other methods
Step 2: Data Preparation
Data preparation is crucial for ensuring the dataset is clean,
consistent, and ready for analysis. This step covers the following
tasks:
Data Cleaning: Handle missing values, exceptions, and errors in the
dataset. Input missing values or develop appropriate statistical
techniques for dealing with them.
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
Data Transformation: Convert data into an appropriate format.
Examples of this are changing data types, encoding categorical
variables, or scaling numerical variables.
Data Reduction: For large datasets, try reducing their size by
sampling or aggregation to make the analysis more manageable.
Step 3: Apply Methods
In this step, you will analyze and describe the data using a variety of
methodologies and procedures. The following are some common
descriptive analysis methods:
Frequency Distribution Analysis: Create frequency tables or bar
charts to show the number or proportion of occurrences for each
category for categorical variables.
Measures of Central Tendency: Calculate numerical variables’ mean,
median, and mode to determine the center or usual value.
Measures of Dispersion: Calculate the range, variance, and standard
deviation to examine the dispersion or variability of the data.
Measures of Position: Identify the position of a single value or its
response to others.
Identify which variables are important to your descriptive analysis
and research questions. Various methods are used for numerical and
categorical variables, so it is essential to distinguish between them.
After the data set has been analyzed, researchers may interpret the
findings in light of the goals. The analysis was successful if the
conclusions were what was anticipated. Otherwise, they must search
for weaknesses in their strategy and repeat these processes to get
better outcomes.
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
Covariance and Correlation in R Programming
are terms used in statistics to measure relationships between two random variables. Both of
these terms measure linear dependency between a pair of random variables or bivariate
data.
Covariance in R Programming Language
In R programming, covariance can be measured using the cov() function. Covariance is a
statistical term used to measure the direction of the linear relationship between the data
vectors. Mathematically,
where,
x represents the x data vector
y represents the y data vector
xˉ xˉ represents mean of x data vector
yˉ yˉ represents mean of y data vector
N represents total observations
Covariance Syntax in R
Syntax: cov(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute covariance. Default is “pearson”.
Example:
R
# Data vectors
x <- c(1, 3, 5, 10)
y <- c(2, 4, 6, 20)
# Print covariance using different methods
print(cov(x, y))
print(cov(x, y, method = "pearson"))
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
print(cov(x, y, method = "kendall"))
print(cov(x, y, method = "spearman"))
Output:
[1] 30.66667
[1] 30.66667
[1] 12
[1] 1.666667
Correlation in R Programming Language
cor() function in R programming measures the correlation coefficient value. Correlation is a
relationship term in statistics that uses the covariance method to measure how strongly the vectors
are related. Mathematically,
where,
x represents the x data vector
y represents the y data vector
xˉ xˉ represents mean of x data vector
yˉ yˉ represents mean of y data vector
Correlation in R
Syntax: cor(x, y, method)
where,
x and y represents the data vectors
method defines the type of method to be used to compute covariance. Default is “pearson”.
Example:
# Data vectors
x <- c(1, 3, 5, 10)
y <- c(2, 4, 6, 20)
# Print correlation using different methods
print(cor(x, y))
print(cor(x, y, method = "pearson"))
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
print(cor(x, y, method = "kendall"))
print(cor(x, y, method = "spearman"))
Output:
[1] 0.9724702
[1] 0.9724702
[1] 1
[1] 1
Difference between Covariance and Correlation
We can discuss some of the main difference between them as below:
Covariance Correlation
Covariance quantifies the
interdependence of two variables. It Dividing the covariance by the sum of the standard
measures the strength of the correlation deviations of the variables, it standardises the
between changes in one variable and covariance.
changes in another.
Covariance: Covariance is not scaled, and Correlation is a standardised measurement that
the units used to quantify the variables has a range of -1 to 1. It enables meaningful
affect how much it is worth. Comparing comparisons between several datasets or variables
covariances across many datasets or and is independent of the magnitude of the
variables is therefore challenging. variables.
By dividing the covariance by the sum of the
The scales of the variables have an impact
standard deviations of the variables, correlation
on covariance, which is not standardised.
standardises the covariance. This enables for
As a result, comparing the size of
meaningful interpretation of the strength and
covariances across several datasets or
direction of the association and makes correlation
variables is challenging.
values comparable.
Understanding the combined variability of Correlation is frequently used to assess how
two variables and their potential link is strongly two variables are linearly related. It
helped by covariance. It is frequently used frequently appears in data analysis, regression
in statistical models, risk analysis, and models, prediction models, and multicollinearity
portfolio analysis. assessments.
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
Probability distribution:
• Probability denotes the possibility of something happening.
• A probability distribution is a statistical function that describes all the possible values and
probabilities for a random variable within a given range.
• This range will be bound by the minimum and maximum possible values, but where the
possible value would be plotted on the probability distribution will be determined by a
number of factors.
• The mean (average), standard deviation, skewness, and kurtosis of the distribution are
among these factors.
• Is a mathematical function or model that describes how the possible outcomes of a random
variable are distributed or the likelihood of those outcomes occurring. In other words, it
specifies the probabilities associated with each possible value of a random variable.
• Probability distributions are fundamental in statistics and probability theory and are used to
model a wide range of real-world phenomena, from the outcomes of rolling dice to the
measurements of physical quantities.
Discrete Probability Distributions
A discrete probability distribution can assume a discrete number of values. For example, coin tosses
and counts of events are discrete functions. These are discrete distributions because there are no in-
between values. For example, you can have only heads or tails in a coin toss. Similarly, if you’re
counting the number of books that a library checks out per hour, you can count 21 or 22 books, but
nothing in between.
A probability mass function (PMF) mathematically describes a probability distribution for a discrete
variable. You can display a PMF with an equation or graph.
What is a Probability Mass Function?
A probability mass function (PMF) is a mathematical function that calculates the probability a
discrete random variable will be a specific value. PMFs also describe the probability distribution for
the full range of values for a discrete variable
The standard notation for a probability mass function is P(X = k) = f (k). Where:
• X is the discrete random variable.
• k is one of the possible discrete values.
• f (k) is a mathematical function that calculates the likelihood for the value of k.
So, putting it all together, P(X = k) = f (k) means: The chance of variable X assuming the specific value
of k equals f (k).
• -lambda is the mean
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
• e - Euler’s constant (approximately 2.718)
• Continuous Probability Distributions
• Continuous probability functions are also known as probability density functions. You know
that you have a continuous distribution if the variable can assume an infinite number of
values between any two values. Continuous variables are often measurements on a scale,
such as height, weight, and temperature.
• Unlike discrete probability distributions where each particular value has a non-zero
likelihood, specific values in continuous probability distribution functions have a zero
probability. For example, the likelihood of measuring a temperature that is exactly 32
degrees is zero.
• A probability density function (PDF) is a mathematical function that describes a continuous
probability distribution. It provides the probability density of each value of a variable, which
can be greater than one.
• A cumulative distribution function is another type of function that describes a continuous
probability distribution.
R Functions for Probability Distributions
Every distribution that R handles has four functions. There is a root name, for example, the root
name for the normal distribution is norm. This root is prefixed by one of the letters.
• p for "probability", the cumulative distribution function (c. d. f.)
• q for "quantile", the inverse c. d. f.
• d for "density", the density function (p. f. or p. d. f.)
• r for "random", a random variable having the specified distribution
For the normal distribution, these functions are pnorm, qnorm, dnorm, and rnorm.
For the binomial distribution, these functions are pbinom, qbinom, dbinom, and rbinom. And so
forth.
Normal distribution: also known as a Gaussian distribution, is a continuous probability distribution
that is symmetric and bell-shaped. In a normal distribution, the data is centered around a mean
(average) value, and it follows a specific pattern where most of the data points are close to the
mean, and as you move away from the mean, the number of data points decreases symmetrically.
The properties of the normal distribution are well-defined, and it is widely used in statistics and data
analysis.
The probability density function (PDF) of a normal distribution is given by the formula:
where:
x = value of the variable or data being examined and f(x) the probability function
μ = the mean
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
σ = the standard deviation
Normal Distribution is a probability function used in statistics that tells about how the data
values are distributed. It is the most important probability distribution function used in statistics
because of its advantages in real case scenarios.
• For example, the height of the population, shoe size, IQ level, rolling a dice, and many more.
• It is generally observed that data distribution is normal when there is a random collection of
data from independent sources.
• The graph produced after plotting the value of the variable on x-axis and count of the value
on y-axis is bell-shaped curve graph.
• The graph signifies that the peak point is the mean of the data set and half of the values of
data set lie on the left side of the mean and other half lies on the right part of the mean
telling about the distribution of the values. The graph is symmetric distribution.
• Normal distribution is widely used in statistics and provides a convenient way to model a
variety of phenomena in the natural and social sciences. It's important to note that not all
real-world phenomena follow a perfect normal distribution, but many approximate it well.
• In R, you can work with normal distributions using the dnorm(), pnorm(), qnorm(), and
rnorm() functions.
Binomial Distribution in R Programming
Binomial distribution in R is a probability distribution used in statistics. The binomial distribution is a
discrete distribution and has only two outcomes i.e. success or failure. All its trials are independent,
the probability of success remains the same and the previous outcome does not affect the next
outcome. The outcomes from different trials are independent. Binomial distribution helps us to find
the individual probabilities as well as cumulative probabilities over a certain range.
Functions for Binomial Distribution
We have four functions for handling binomial distribution in R namely:
dbinom()
dbinom(k, n, p)
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
pbinom()
pbinom(k, n, p)
where n is total number of trials, p is probability of success, k is the value at which the probability
has to be found out.
qbinom()
qbinom(P, n, p)
Where P is the probability, n is the total number of trials and p is the probability of success.
rbinom()
rbinom(n, N, p)
Where n is numbers of observations, N is the total number of trials, p is the probability of success.
dbinom() Function
This function is used to find probability at a particular value for a data that follows binomial
distribution i.e. it finds:
P(X = k)
Syntax:
dbinom(k, n, p)
pbinom() Function
The function pbinom() is used to find the cumulative probability of a data following binomial
distribution till a given value ie it finds
P(X <= k)
Syntax:
pbinom(k, n, p)
qbinom() Function
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.
Syntax:
qbinom(P, n, p)
rbinom() Function
This function generates n random variables of a particular probability.
Syntax:
rbinom(n, N, p)
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
Binomial distributions examples list: Binomial distributions are commonly used to model
situations involving a fixed number of trials with two possible outcomes (success or failure)
Coin Flipping: Model the number of heads (successes) in a fixed number of coin flips.
• Manufacturing Defects: Determine the probability of a certain number of defective items in
a batch produced with a known defect rate.
• Quality Control: Estimate the probability of a specific number of defective products in a
sample from a production line.
Average
The mean is a measure of central tendency that represents the typical value in a dataset.
It is calculated by summing all values in a dataset and dividing by the total number of values.
• The average, also known as the mean, is a measure of central tendency.
• It represents the "typical" value in a dataset.
• To calculate the mean, you sum up all the values in the dataset and then divide by the total
number of values.
• Formula: Mean = (Sum of all values) / (Number of values).
• The mean is sensitive to extreme values (outliers).
In R, you can calculate the mean using the mean() function:
> data <- c(10, 15, 20, 25, 30)
> # Calculate the mean
> mean_value <- mean(data)
> mean_value
[1] 20
Median:
The median is the middle value in a dataset when the data is ordered from smallest to largest. If
there is an even number of values, the median is the average of the two middle values.
• It is a measure of the central position that is not affected by extreme values (outliers).
• The median is the middle value in a dataset when the data is ordered from smallest to largest. If
there is an even number of values, the median is the average of the two middle values.
• Median is less affected by outliers compared to the mean.
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
In R, you can calculate the median using the median() function:
# Example dataset
data <- c(10, 15, 20, 25, 30)
# Calculate the median
median_value <- median(data)
> median(data)
[1] 20
> data <- c(10, 15, 5,20, 25, 30)
> median(data)
Mode:
The mode is the value that appears most frequently in a dataset.
• A dataset can have one mode (unimodal), multiple modes (multimodal), or no mode at all.
• It is often used for categorical or discrete data, but it can also be applied to continuous data.
• Mode is often used for categorical or discrete data.
Example dataset: 2, 3, 3, 5, 6, 7, 7, 9
Mean: (2 + 3 + 3 + 5 + 6 + 7 + 7 + 9) / 8 = 42 / 8 = 5.25
Median: Since there are 8 values, the median is the average of the 4th and 5th values: (5 + 6) / 2 =
5.5
Mode: The mode is 3 and 7 because they both occur twice, making this dataset bimodal.
# Example dataset
> data <- c(2, 3, 3, 5, 6, 7, 7, 9)
> # Custom function to calculate the mode
> calculate_mode <- function(x) {
+ unique_x <- unique(x)
+ unique_x[[Link](tabulate(match(x, unique_x)))]
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
+}
> # In the example, we've used a custom function to
> #calculate the mode since R doesn't have a built-in mode function.
> # Calculate the mode
> mode_value <- calculate_mode(data)
> mode_value
[1] 3
Variance:
Variance is a measure of how much individual data points deviate from the mean. It quantifies the
spread or dispersion of data.
• Variance is a measure of the spread or dispersion of data points in a dataset.
• It quantifies how much individual data points deviate from the mean.
• A high variance indicates that the data points are spread out over a wider range.
• Formula: Variance = (Sum of the squared differences from the mean) / (Number of values - 1).
S^2 = sample variance
x_i = the value of the one observation
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s
Government first grade collage
Unit -3
Hod BCA
\bar{x} = the mean value of all observations
n = the number of observations
• Variance is always a non-negative value.
It is calculated by taking the average of the squared differences between each data point and the
mean.
Standard Deviation:
The standard deviation is a measure of the typical or average amount of deviation of data points
from the mean.
The standard deviation is closely related to the variance and measures the average amount of
variation or dispersion in a dataset.
• It is the square root of the variance.
• The standard deviation provides a measure of the "typical" or "average" amount of deviation of
data points from the mean.
• Formula: Standard Deviation = √(Variance).
• It shares the same units as the data, making it more interpretable than the variance.
• It is the square root of the variance and shares the same units as the data, making it more
interpretable.
In R, you can calculate the standard deviation using the sd() function:
# Example dataset
> data <- c(10, 15, 20, 25, 30)
> # Calculate the standard deviation
> std_deviation_value <- sd(data)
> std_deviation_value
[1] 7.905694
sigma = population standard deviation
N = the size of the population
x_i = each value from the population
\mu = the population mean
Government first grade collage Thirtahalli depertment of computer science ,yogeesh c s