0% found this document useful (0 votes)

7 views17 pages

Unit-1 Data Science

Data science is a field focused on analyzing large datasets to uncover patterns and inform business decisions using techniques like predictive modeling and data visualization. It follows a five-stage lifecycle: capture, maintain, process, analyze, and communicate data. The document also discusses the relationship between data science and big data, the importance of datafication in business, and various machine learning concepts and matrix operations relevant to data analysis.

Uploaded by

0808cb231056.ies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views17 pages

Unit-1 Data Science

Uploaded by

0808cb231056.ies

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

What Is Data Science?

Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions. Data science uses complex machine learning algorithms to build
predictive models. More specifically, data science is used for complex data analysis,
predictive modeling, recommendation generation and data visualization.
1. Analysis of Complex Data
Data science allows for quick and precise analysis. With various software tools and
techniques at their disposal, data analysts can easily identify trends and detect patterns
within even the largest and most complex datasets. This enables businesses to make
better decisions, whether it’s regarding how to best segment customers or conducting a
thorough market analysis.
2. Predictive Modeling
Data science can also be used for predictive modeling. In essence, by finding patterns in
data through the use of machine learning, analysts can forecast possible future
outcomes with some degree of accuracy. These models are especially useful in
industries like insurance, marketing, healthcare and finance, where anticipating the
likelihood of certain events happening is central to the success of the business.
3. Recommendation Generation
Some companies, such as Netflix, Amazon and Spotify, rely on data science and big data
to generate recommendations for their users based on their past behavior. It’s thanks to
data science that users of these and similar platforms can be served up content that is
uniquely tailored to their preferences and interests.
4. Data Visualization
Data science is also used to create data visualizations — think graphs, charts,
dashboards — and reporting, which helps non-technical business leaders and busy
executives easily understand otherwise complex information about the state of their
business.

Data Science Lifecycle

Data science can be thought of as having a five-stage lifecycle:
1. Capture
This stage is when data scientists gather raw and unstructured data. The capture stage
typically includes data acquisition, data entry, signal reception and data extraction.
2. Maintain
This stage is when data is put into a form that can be utilized. The maintenance stage
includes data warehousing, data cleansing, data staging, data processing and data
architecture.
3. Process
This stage is when data is examined for patterns and biases to see how it will work as a
predictive analysis tool. The process stage includes data mining, clustering and
classification, data modeling and data summarization.
4. Analyze
This stage is when multiple types of analyses are performed on the data. The analysis
stage involves data reporting, data visualization, business intelligence and decision
making.
5. Communicate
This stage is when data scientists and analysts showcase the data through reports,
charts and graphs. The communication stage typically includes exploratory and
confirmatory analysis, predictive analysis, regression, text mining and qualitative
analysis.

Big Data:
It is huge, large, or voluminous data, information, or the relevant statistics acquired by
large organizations and ventures. Many software and data storages is created and
prepared as it is difficult to compute the big data manually. It is used to discover patterns
and trends and make decisions related to human behavior and interaction technology.
Advantages of Big Data:
1. Able to handle and process large and complex data sets that cannot be easily
managed with traditional database systems
2. Provides a platform for advanced analytics and machine learning applications
3. Enables organizations to gain insights and make data-driven decisions based on
large amounts of data
4. Offers potential for significant cost savings through efficient data management
and analysis.
Disadvantages of Big Data:
1. Requires specialized skills and expertise in data engineering, data management,
and big data tools and technologies
2. Can be expensive to implement and maintain due to the need for specialized
infrastructure and software
3. May face privacy and security concerns when handling sensitive data
4. Can be challenging to integrate with existing systems and processes.
Data Science vs Big Data Table: Major Comparison
Parameter Data Science Big Data

Definition Data Science is a discipline that Used to describe massive quantities of data that are
covers all things data-related, too complex and vast to be stored and handled
including how to make the best use by traditional data processing software. Big Data
of big data. The main method for encompasses all types of data, which aid in
utilizing the potential of Big Data is providing the appropriate information, to the
data science. appropriate person, in the appropriate quantity to
aid in making educated decisions.

Concept The capacity to collect data Volume, variety, and velocity are the main Vs of big
electronically led to the data. It represents a variety of factors, including data
development of the field of data volumes, the complexity of data kinds and
science, which combines the study of structures, and the rate at which new data is
statistics with computer science produced. Big Data refers to data or information that
to evaluate absurdly large amounts may be used to examine insights and produce
of data that could result in the strategic business decisions and well-informed
discovery of new information. conclusions.

Purpose Utilizing new data structures, ideas, The ability of analysts to evaluate the enormous and
tools, and algorithms, data science complicated data sets was previously impossible.
aims to take advantage of Big Data's This is the true worth of Big data. The goal is to
potential. assist organizations in developing fresh growth
chances or gaining a sizable advantage over
conventional company methods.

Formation Among the primary tools used in Hadoop, Spark, Flink, etc., are among the tools that
data science include SAS, R, Python, are mostly used in Big data.
etc.

Application Mainly used for scientific purposes It is mostly employed for commercial objectives and
Areas such as internet searches, digital client satisfaction. A few application areas of big
advertisements, risk detection, etc. data are research and development, health and
sports, telecommunication, etc.

Main Focus Science of the data. Its main focus is on the process of handling
voluminous data.

Approach It makes decisions in business by With the help of big data, businesses track their
using mathematics and statistics market presence, which helps them develop agility.
with programming skills which
further helps create a model to test
the hypothesis.

Datafication
in business, datafication can be defined as a process that “aims to transform most
aspects of a business into quantifiable data that can be tracked, monitored, and
analyzed. It refers to the use of tools and processes to turn an organization into a data-
driven enterprise.”
There are three areas of business where datafication can really make an impact:
1. Analytics In today’s data-driven world, analytics is king. By collecting and
analyzing data, businesses can gain valuable insights into consumer behavior,
trends, and preferences, allowing them to make informed decisions that drive
growth and success.
2. Marketing Campaigns Marketing campaigns can be supercharged with
datafication, allowing companies to personalize ads and offers for specific
customers based on their interests and behaviors.
3. Forecasting Predictive analytics can help businesses forecast future trends and
stay ahead of the competition by anticipating changes in consumer demand.

The Data Science Landscape

Data science is part of the computer sciences It comprises the disciplines of i) analytics,
ii) statistics and iii) machine learning.

Analytics
Analytics generates insights from data using simple presentation, manipulation,
calculation or visualization of data. In the context of data science, it is also sometimes
referred to as exploratory data analytics. It often serves the purpose to familiarize oneself
with the subject matter and to obtain some initial hints for further analysis. To this end,
analytics is often used to formulate appropriate questions for a data science project.
The limitation of analytics is that it does not necessarily provide any conclusive evidence
for a cause-and-effect relationship. Also, the analytics process is typically a manual and
time-consuming process conducted by a human with limited opportunity for automation.
In today’s business world, many corporations do not go beyond descriptive analytics,
even though more sophisticated analytical disciplines can offer much greater value, such
as those laid out in the analytic value escalator.
Statistics
In many instances, analytics may be sufficient to address a given problem. In other
instances, the issue is more complex and requires a more sophisticated approach to
provide an answer, especially if there is a high-stakes decision to be made under
uncertainty. This is when statistics comes into play. Statistics provides a methodological
approach to answer questions raised by the analysts with a certain level of confidence.
Analysts help you get good questions, whereas statisticians bring you good answers.
Statisticians bring rigor to the table.
Sometimes simple descriptive statistics are sufficient to provide the necessary insight.
Yet, on other occasions, more sophisticated inferential statistics — such as regression
analysis — are required to reveal relationships between cause and effect for a certain
phenomenon. The limitation of statistics is that it is traditionally conducted with software
packages, such as SPSS and SAS, which require a distinct calculation for a specific
problem by a statistician or trained professional. The degree of automation is rather
limited.

Machine Learning

Artificial intelligence refers to the broad idea that machines can perform tasks normally
requiring human intelligence, such as visual perception, speech recognition, decision-
making and translation between languages. In the context of data science, machine
learning can be considered as a sub-field of artifical intelligence that is concerned with
decision making. In fact, in its most essential form, machine learning is decision making
at scale. Machine learning is the field of study of computer algorithms that allow
computer programs to identify and extract patterns from data. A common purpose of
machine learning algorithms is therefore to generalize and learn from data in order to
perform certain tasks
In traditional programming, input data is applied to a model and a computer in order to
achieve a desired output. In machine learning, an algorithm is applied to input and output
data in order to identify the most suitable model. Machine Learning can thus be
complementary to traditional programming as it can provide a useful model to explain a
phenomenon.
Traditional Programming vs. Machine Learning — Source: Own illustration adapted from Prince Barpaga

Machine Learning vs. Data Mining

The terms machine learning and data mining are closely related and often used
interchangeably. Data mining is a concept that pre-dates the current field of machine
learning. The idea of data mining — also referred to as Knowledge Discovery in Databases
(KDD) in the academic context — emerged in the late 1980s and early 1990s when the
need for analysing large datasets became apparent . Essentially, data mining refers to a
structured way of extracting insight from data which draws on machine learning
algorithms. The main difference lies in the fact that data mining is a rather manual
process that requires human intervention and decision making, while machine learning
— apart from the initial setup and fine-tuning — runs largely independently.

Supervised and Unsupervised Learning

The majority of machine learning algorithms can be categorized into supervised and
unsupervised learning. The main distinction between these types of machine learning is
that supervised learning is conducted on data which includes both the input and output
data. It is also often referred to as “labeled data” where the label is the target attribute.
The algorithm can therefore validate its model by checking against the correct output
value. Typically, supervised machine learning algorithms are regression and
classification analysis. Conversely, in unsupervised machine learning, the dataset does
not include the target attribute. The data is thus unlabeled. The most common type of
unsupervised learning is cluster analysis
Other than the main streams of supervised and unsupervised ML algorithms, there are
additional variations, such as semi-supervised and reinforcement learning algorithms. In
semi-supervised learning a small amount of labeled data is used to bolster a larger set of
unlabeled data. Reinforcement learning trains an algorithm with a reward system,
providing feedback when an artificial intelligence agent performs the best action in a
particular situation .

Types of ML Problems — Regression, Classification and Clustering

In order to structure the field of machine learning, the vast number of ML algorithms are
often grouped by similarity in terms of their function (how they work), e.g. tree-based and
neural network-inspired methods. Given the large number of different algorithms, this
approach is rather complex. Instead, it is considered more useful to group ML algorithms
by the type of problem they are supposed to solve. The most common types of ML
problems are regression, classification and clustering. There are numerous specific ML
algorithms, most of which come with a lot of different variations to address these
problems. Some algorithms are capable of solving more than one problem.
Regression
Regression is a supervised ML approach. Regression is used to predict a continuous
value. The outcome of a regression analysis is a formula (or model) that describes one or
many independent variables a dependent target value. There are many different types of
regression models, such as linear regression, logistics regression, ridge regression, lasso
regression and polynomial regression. However, by far the most popular model for
making predictions is the linear regression model. The basic formula for a univariate
linear regression model is shown underneath:

Other regression models, although they share some resemblance to linear regression,
are more suited for classifications, such as logistic regression [1]. Regression problems,
i.e. forecasting or predicting a numerical value, can also be solved by artifical neural
networks which are inspired by the structure and/or function of biological neural
networks. They are an enormous subfield comprised of hundreds of algorithms and
variations used commonly for regression and classification problems. A neural network
is favored over regression models if there is a large number of variables. Like artifical
neural networks, regression and classification tasks can also be achieved by the k-
nearest neighbor algorithm.
Matrices in Data Science

Matrices are a foundational concept in data science that underpins a wide range of
mathematical and computational operations used for analyzing and manipulating data.
They provide a structured and organized way to represent information, making it easier to
process and extract meaningful insights. In this comprehensive explanation, we’ll delve
deeper into matrices in the context of data science, exploring their properties,
operations, and applications.

1. Matrix Basics
A matrix is a two-dimensional array of numbers arranged in rows and columns. Each
element in a matrix is identified by its row and column index. A matrix with “m” rows and
“n” columns is often referred to as an “m x n” matrix. Matrices are used to represent
datasets, where each row corresponds to an observation or sample, and each column
represents a feature or attribute of that sample. This structured representation makes it
convenient to apply mathematical operations and transformations to the data.

2. Data Representation
In data science, matrices serve as a powerful tool for representing datasets. Consider a
dataset containing information about various individuals, such as age, income, and
education level. By organizing this data into a matrix, where each row corresponds to an
individual and each column represents a different attribute, we create a structured
representation that facilitates analysis. This tabular arrangement simplifies operations
like finding averages, and correlations, and performing statistical analyses.

3. Linear Transformation
Matrices are key players in the realm of linear transformations, which are fundamental to
data manipulation and feature engineering. These transformations involve scaling,
rotating, reflecting, and translating data points. In data science, linear transformations
are utilized for data preprocessing and dimensionality reduction. For example, Principal
Component Analysis (PCA) leverages matrices to identify orthogonal axes that maximize
the variance in data, leading to effective dimensionality reduction.

4. Matrix Operation
Matrices support a multitude of operations that are essential in data science:

Addition and subtraction

Matrices with the same dimensions can be added or subtracted element-wise,
facilitating tasks such as aggregating data from multiple sources.

5. Scalar Multiplication
Each element of a matrix can be multiplied by a scalar value, which can be useful for
scaling data.

6. Matrix Multiplication
Matrix multiplication is a central operation that combines the rows and columns of
matrices to produce a new matrix. The element at position (i, j) in the resulting matrix is
the dot product of the “i”-th row of the first matrix and the “j”-th column of the second
matrix. Matrix multiplication is crucial for composing linear transformations and forms
the foundation of various machine learning algorithms.

7. Transpose
The transpose of a matrix is obtained by interchanging its rows and columns. This
operation is valuable for solving systems of linear equations and for extracting features in
certain algorithms.

8. Eigenvectors and eigenvalues

Eigenvalues and eigenvectors are intrinsic properties of matrices with far-reaching
implications in data science. An eigenvector is a non-zero vector that remains in the same
direction after a linear transformation defined by a matrix. The corresponding eigenvalue
indicates the scaling factor of the eigenvector during this transformation. In data science,
eigenvalues and eigenvectors are employed in dimensionality reduction, such as in the
aforementioned PCA. By selecting the top eigenvectors, it is possible to capture the most
important information while reducing data dimensionality.

9. Matrix Factorization
Matrix factorization involves breaking down a matrix into the product of two or more
matrices. This technique has broad applications, from recommendation systems to
image processing. In collaborative filtering, matrices are factorized to uncover latent
factors that explain user-item relationships, forming the basis for personalized
recommendations.

10. Solving Linear Equations

Matrices are instrumental in solving systems of linear equations, which arise in various
data science scenarios. In regression analysis, for example, matrices are employed to
find the optimal parameters that best fit a linear model to the data. This forms the
foundation of predictive modeling.

11. Image and Signal Processing

In image and signal processing, matrices are used to represent images and signals as
pixel values in a grid. Operations like convolution are applied to matrices to perform tasks
such as edge detection and feature extraction in images. Convolutional Neural Networks
(CNNs) use matrix convolutions to learn and recognize patterns in images.

12. Graphs and Networks

Matrices are used to represent relationships in graphs and networks. The adjacency
matrix, for instance, represents connections between nodes in a graph. Matrices like the
Laplacian matrix help analyze graph properties and identify clusters or communities
within networks.

Introduction: Descriptive statistics

In Descriptive statistics you are describing, presenting, summarizing, and organizing

your data, either through numerical calculations or graphs or tables. Some of the
common measurements in descriptive statistics are central tendency and others the
variability of the dataset.

Descriptive statistical analysis helps us to understand our data and is very important
part of Machine Learning. Doing a descriptive statistical analysis of our dataset is
absolutely crucial.

Measure of Central Tendency:

It describes a whole set of data with a single value that represents the centre of its
distribution. There are three main measures of central tendency:

1. Mean: It is the sum of the observation divided by the sample size. It is not a
robust statistics as it is affected by extreme values. So, very large or very low
value(i.e. Outliers) can distort the answer.

2. Median: It is the middle value of data. It splits the data in half and also called
50th percentile. It is much less affected by the outliers and skewed data
than mean. If the no. of elements in the dataset is odd, the middle most
element is the median. If the no. of elements in the dataset is even, the
median would be the average of two central elements.

3. Mode: It is the value that occurs more frequently in a dataset. Therefore a

dataset has no mode, if no category is the same and also possible that a
dataset has more than one mode. It is the only measure of central tendency
that can be used for categorical variables.

Measures of Variability

Measures of Variability also known as spread of the data describes how similar or varied
are the set of observations. The most popular variability measures are the range,
interquartile range (IQR), variance, and standard deviation.
1. Range: The range describes the difference between the largest and the
smallest points in your data. The bigger the range the more spread out is the
data.

2. IQR: The interquartile range (IQR) is a measure of statistical dispersion

between upper (75th) quartiles i.e Q3 and lower (25th) quartiles i.e Q1. You
can understand this by below example.

While the range measures where the beginning and end of your datapoint are, the
interquartile range is a measure of where the majority of the values lie.

3. Variance: It is the average squared deviation from mean. The variance is

computed by finding the difference between every data point and the mean,
squaring them, summing them up and then taking the average of those numbers.
The problem with Variance is that because of the squaring, it is not in the same
unit of measurement as the original data.
4. Standard Deviation: Standard Deviation is used more often because it is in the
original unit. It is simply the square root of the variance and because of that, it is
returned to the original unit of measurement.

When you have a low standard deviation, your data points tend to be close to the mean.
A high standard deviation means that your data points are spread out over a wide range.

Standard deviation is best used when data is unimodal. In a normal distribution,

approximately 34% of the data points are lying between the mean and one standard
deviation above or below the mean. Since a normal distribution is symmetrical, 68% of
the data points fall between one standard deviation above and one standard deviation
below the mean. Approximately 95% fall between two standard deviations below the
mean and two standard deviations above the mean. And approximately 99.7% fall
between three standard deviations above and three standard deviations below the
mean.

The picture below illustrates that perfectly.

With the so-called „Z-Score“, you can check how many standard deviations below (or
above) the mean, a specific data point lies.
Probability

I will just give a brief introduction of probability. Before going to the actual definition of
Probability let’s look at some terminologies.

• Experiment: An experiment could be something like — whether it rains in

Delhi on a daily basis or not.

• Outcome: Outcome is the result of a single trial. If it rains today, the

outcome of today’s trial is “it rained”.

• Event: An event is one or more outcomes of an experiment. For the

experiment of whether it rains in Delhi every day the event could be “it
rained” or it didn’t rain.

• Probability: This simply the likelihood of an event. So it there’s a 60% chance

of it raining today, the probability of raining is 0.6.

Bernoulli Trials

An experiment which has exactly two outcomes like coin toss is called Bernoulli Trials.

Probability distribution of the number of successes in n Bernoulli trials is known as

a Binomial distribution.

Formula for Binomial distribution is given below.

Binomial distribution Formula

Probability Mass Function for Binomial distribution with different probability of success
and 100 random variables.
Probability Mass Function

Probability distribution of continuous random variable (variable that can assume any
possible value between two points) is known as Probability Density Function. There will
be infinite no. of trials in case of continuous random variable.

Area under a probability density function gives the probability for the random variable to
be in that range.
If I have a population data and I take random samples of equal size from the data, the
sample means are approximately normally distributed.

Normal Distribution

It basically describes how large samples of data look like when they are plotted. It is
sometimes called the “bell curve“ or the “Gaussian curve“.

Inferential statistics and the calculation of probabilities require that a normal

distribution is given. This basically means, that if your data is not normally distributed,
you need to be very careful what statistical tests you apply to it since they could lead to
wrong conclusions.

In a perfect normal distribution, each side is an exact mirror of the other. It should look
like the distribution on the picture below:

Normal Distribution

In a normal distribution, the mean, mode and median are all equal and fall at the same
midline point.

Normal Distribution Function

A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard

normal distribution. Area under the standard normal distribution curve would be 1.
Central Limit Theorem

• If we take means of random samples from a distribution and we plot the

means, the graph approaches to a normal distribution when we have taken
sufficiently large number of such samples.

• The theorem also says that the mean of means will be approximately equal
to the mean of sample means i.e. population mean.

Normal distributions for higher standard deviations are flatter i.e. more spread as
compared to those for lower standard deviations.

Z scores
The distance in terms of number of standard deviations, the observed value is away
from the mean, is the standard score or the Z score.

A positive Z score indicates that the observed value is Z standard deviations above the
mean. Negative Z score indicates that the value is below the mean.

Observed value = µ+zσ [µ is the mean and σ is the standard deviation]

From above graph area around 2 standard deviation around the mean is 0.95, that
means 0.95 probability of data lying within that range.

For a particular z score, we can look into the Z-table to find the probability for values to
fall less than that particular z value.

Ids Unit 1 Final
No ratings yet
Ids Unit 1 Final
30 pages
Unit1 R Full Material
No ratings yet
Unit1 R Full Material
11 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Unit 1
No ratings yet
Unit 1
60 pages
Data
No ratings yet
Data
43 pages
Ids Unit-I
No ratings yet
Ids Unit-I
34 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Introduction to Data Science & Big Data
No ratings yet
Introduction to Data Science & Big Data
14 pages
Chapter 14 Big Data and Data Science - DONE DONE DONE
No ratings yet
Chapter 14 Big Data and Data Science - DONE DONE DONE
28 pages
Data Science vs Big Data vs Analytics
No ratings yet
Data Science vs Big Data vs Analytics
7 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Lecture Notes FDS Unit I
No ratings yet
Lecture Notes FDS Unit I
34 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
DS B&V-1
No ratings yet
DS B&V-1
30 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Data Science Essentials for Beginners
No ratings yet
Data Science Essentials for Beginners
14 pages
Lec 1 Data Science and Big Data
No ratings yet
Lec 1 Data Science and Big Data
3 pages
Data Science: A Guide for Professionals
No ratings yet
Data Science: A Guide for Professionals
8 pages
DS Notes
No ratings yet
DS Notes
159 pages
Big Data and Data Science Guide
No ratings yet
Big Data and Data Science Guide
62 pages
Unit 1
No ratings yet
Unit 1
28 pages
Module-1 Notes Basics 09.07.25
No ratings yet
Module-1 Notes Basics 09.07.25
45 pages
Dsbda U1 New
No ratings yet
Dsbda U1 New
6 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
1 Unit 1 Introduction To Data Science
No ratings yet
1 Unit 1 Introduction To Data Science
48 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
65 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science for Industry Innovators
No ratings yet
Data Science for Industry Innovators
2 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Session 1819
No ratings yet
Session 1819
47 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Impact of Data Science Across Industries
No ratings yet
Impact of Data Science Across Industries
3 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
53 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
No ratings yet
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
7 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
12 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Himadev
No ratings yet
Himadev
37 pages
2 Data Science Process 06-01-2024
No ratings yet
2 Data Science Process 06-01-2024
32 pages
DS Unit 1
No ratings yet
DS Unit 1
35 pages
TLMweek 1 Intro Ds
No ratings yet
TLMweek 1 Intro Ds
11 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
DS Week 01
No ratings yet
DS Week 01
11 pages
Unit 1 Dsa
No ratings yet
Unit 1 Dsa
26 pages
Data Science Introduction
No ratings yet
Data Science Introduction
24 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Unit 01 Ids
No ratings yet
Unit 01 Ids
39 pages
DataScientist v2
No ratings yet
DataScientist v2
14 pages
Applied - Data - Science MODULE 1 SEM8
No ratings yet
Applied - Data - Science MODULE 1 SEM8
16 pages
Introduction to Data Science Basics
100% (1)
Introduction to Data Science Basics
19 pages
Ids Mod1
No ratings yet
Ids Mod1
21 pages
Unit 1 Data Science and Big Data
No ratings yet
Unit 1 Data Science and Big Data
23 pages
BMC205 DSAA Unit1 Array Notes
No ratings yet
BMC205 DSAA Unit1 Array Notes
15 pages
Matrices JEE Main 2021 (July) Chapter-Wise Questions
No ratings yet
Matrices JEE Main 2021 (July) Chapter-Wise Questions
9 pages
MPS 06
No ratings yet
MPS 06
41 pages
Linear Algebra and Analytical Geometry - Farkaleet Series
No ratings yet
Linear Algebra and Analytical Geometry - Farkaleet Series
143 pages
Java Programms
No ratings yet
Java Programms
9 pages
C Language UNIT-3 C NOTES
No ratings yet
C Language UNIT-3 C NOTES
18 pages
Finite Element Method - Lecture 1
No ratings yet
Finite Element Method - Lecture 1
8 pages
Spirobot: Spirographs Spirograph
No ratings yet
Spirobot: Spirographs Spirograph
10 pages
BSc Electrical Engineering Program
No ratings yet
BSc Electrical Engineering Program
29 pages
Fayoum University Mathematics (2) Book 2024
No ratings yet
Fayoum University Mathematics (2) Book 2024
117 pages
DIY Car LED Feedback Display
No ratings yet
DIY Car LED Feedback Display
15 pages
Collegedekho 250406 221630
No ratings yet
Collegedekho 250406 221630
15 pages
1.4 Matrix Algebra
No ratings yet
1.4 Matrix Algebra
22 pages
Practice Questions
No ratings yet
Practice Questions
4 pages
CHAPTER 02 - Linear Codes
No ratings yet
CHAPTER 02 - Linear Codes
31 pages
AG5 Tool and Die Manufacturing Industry Skills Matrix Template en
No ratings yet
AG5 Tool and Die Manufacturing Industry Skills Matrix Template en
14 pages
Calculus-III Course Outline
No ratings yet
Calculus-III Course Outline
3 pages
Maths Tut 1
No ratings yet
Maths Tut 1
2 pages
MST Lab Manual (R20)
No ratings yet
MST Lab Manual (R20)
75 pages
Bio220 Lab Manual
No ratings yet
Bio220 Lab Manual
92 pages
Math A Ba-Bsc 1st Sem 2018
No ratings yet
Math A Ba-Bsc 1st Sem 2018
4 pages
Image Feature Extraction Based On PCA
No ratings yet
Image Feature Extraction Based On PCA
5 pages
1.1 Matrix and Properties
No ratings yet
1.1 Matrix and Properties
5 pages
Fundamentals of PAUT - 16-18
No ratings yet
Fundamentals of PAUT - 16-18
3 pages
Lumped vs Consistent Mass Matrices
No ratings yet
Lumped vs Consistent Mass Matrices
23 pages
Agreement & Certainty Matrix Guide
No ratings yet
Agreement & Certainty Matrix Guide
3 pages
1.7 Diagonal, Triangular, and Symmetric Matrices
No ratings yet
1.7 Diagonal, Triangular, and Symmetric Matrices
4 pages
Solution To Chapter 1 Analytical Exercises
No ratings yet
Solution To Chapter 1 Analytical Exercises
5 pages
Introduction To Matlab: Quantitative Macroeconomics
No ratings yet
Introduction To Matlab: Quantitative Macroeconomics
52 pages
4-1 r20 Cns Unit-1
No ratings yet
4-1 r20 Cns Unit-1
27 pages

Unit-1 Data Science

Uploaded by

Unit-1 Data Science

Uploaded by

What Is Data Science?

Data Science Lifecycle

The Data Science Landscape

Machine Learning vs. Data Mining

Supervised and Unsupervised Learning

Types of ML Problems — Regression, Classification and Clustering

Addition and subtraction

8. Eigenvectors and eigenvalues

10. Solving Linear Equations

11. Image and Signal Processing

12. Graphs and Networks

Introduction: Descriptive statistics

In Descriptive statistics you are describing, presenting, summarizing, and organizing

Measure of Central Tendency:

3. Mode: It is the value that occurs more frequently in a dataset. Therefore a

2. IQR: The interquartile range (IQR) is a measure of statistical dispersion

3. Variance: It is the average squared deviation from mean. The variance is

Standard deviation is best used when data is unimodal. In a normal distribution,

The picture below illustrates that perfectly.

• Experiment: An experiment could be something like — whether it rains in

• Outcome: Outcome is the result of a single trial. If it rains today, the

• Event: An event is one or more outcomes of an experiment. For the

• Probability: This simply the likelihood of an event. So it there’s a 60% chance

Probability distribution of the number of successes in n Bernoulli trials is known as

Formula for Binomial distribution is given below.

Binomial distribution Formula

Inferential statistics and the calculation of probabilities require that a normal

Normal Distribution Function

A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard

• If we take means of random samples from a distribution and we plot the

Observed value = µ+zσ [µ is the mean and σ is the standard deviation]

You might also like