What Is Data Science?
Data science is the domain of study that deals with vast volumes of data using modern
tools and techniques to find unseen patterns, derive meaningful information, and make
business decisions. Data science uses complex machine learning algorithms to build
predictive models. More specifically, data science is used for complex data analysis,
predictive modeling, recommendation generation and data visualization.
1. Analysis of Complex Data
Data science allows for quick and precise analysis. With various software tools and
techniques at their disposal, data analysts can easily identify trends and detect patterns
within even the largest and most complex datasets. This enables businesses to make
better decisions, whether it’s regarding how to best segment customers or conducting a
thorough market analysis.
2. Predictive Modeling
Data science can also be used for predictive modeling. In essence, by finding patterns in
data through the use of machine learning, analysts can forecast possible future
outcomes with some degree of accuracy. These models are especially useful in
industries like insurance, marketing, healthcare and finance, where anticipating the
likelihood of certain events happening is central to the success of the business.
3. Recommendation Generation
Some companies, such as Netflix, Amazon and Spotify, rely on data science and big data
to generate recommendations for their users based on their past behavior. It’s thanks to
data science that users of these and similar platforms can be served up content that is
uniquely tailored to their preferences and interests.
4. Data Visualization
Data science is also used to create data visualizations — think graphs, charts,
dashboards — and reporting, which helps non-technical business leaders and busy
executives easily understand otherwise complex information about the state of their
business.
Data Science Lifecycle
Data science can be thought of as having a five-stage lifecycle:
1. Capture
This stage is when data scientists gather raw and unstructured data. The capture stage
typically includes data acquisition, data entry, signal reception and data extraction.
2. Maintain
This stage is when data is put into a form that can be utilized. The maintenance stage
includes data warehousing, data cleansing, data staging, data processing and data
architecture.
3. Process
This stage is when data is examined for patterns and biases to see how it will work as a
predictive analysis tool. The process stage includes data mining, clustering and
classification, data modeling and data summarization.
4. Analyze
This stage is when multiple types of analyses are performed on the data. The analysis
stage involves data reporting, data visualization, business intelligence and decision
making.
5. Communicate
This stage is when data scientists and analysts showcase the data through reports,
charts and graphs. The communication stage typically includes exploratory and
confirmatory analysis, predictive analysis, regression, text mining and qualitative
analysis.
Big Data:
It is huge, large, or voluminous data, information, or the relevant statistics acquired by
large organizations and ventures. Many software and data storages is created and
prepared as it is difficult to compute the big data manually. It is used to discover patterns
and trends and make decisions related to human behavior and interaction technology.
Advantages of Big Data:
1. Able to handle and process large and complex data sets that cannot be easily
managed with traditional database systems
2. Provides a platform for advanced analytics and machine learning applications
3. Enables organizations to gain insights and make data-driven decisions based on
large amounts of data
4. Offers potential for significant cost savings through efficient data management
and analysis.
Disadvantages of Big Data:
1. Requires specialized skills and expertise in data engineering, data management,
and big data tools and technologies
2. Can be expensive to implement and maintain due to the need for specialized
infrastructure and software
3. May face privacy and security concerns when handling sensitive data
4. Can be challenging to integrate with existing systems and processes.
Data Science vs Big Data Table: Major Comparison
Parameter Data Science Big Data
Definition Data Science is a discipline that Used to describe massive quantities of data that are
covers all things data-related, too complex and vast to be stored and handled
including how to make the best use by traditional data processing software. Big Data
of big data. The main method for encompasses all types of data, which aid in
utilizing the potential of Big Data is providing the appropriate information, to the
data science. appropriate person, in the appropriate quantity to
aid in making educated decisions.
Concept The capacity to collect data Volume, variety, and velocity are the main Vs of big
electronically led to the data. It represents a variety of factors, including data
development of the field of data volumes, the complexity of data kinds and
science, which combines the study of structures, and the rate at which new data is
statistics with computer science produced. Big Data refers to data or information that
to evaluate absurdly large amounts may be used to examine insights and produce
of data that could result in the strategic business decisions and well-informed
discovery of new information. conclusions.
Purpose Utilizing new data structures, ideas, The ability of analysts to evaluate the enormous and
tools, and algorithms, data science complicated data sets was previously impossible.
aims to take advantage of Big Data's This is the true worth of Big data. The goal is to
potential. assist organizations in developing fresh growth
chances or gaining a sizable advantage over
conventional company methods.
Formation Among the primary tools used in Hadoop, Spark, Flink, etc., are among the tools that
data science include SAS, R, Python, are mostly used in Big data.
etc.
Application Mainly used for scientific purposes It is mostly employed for commercial objectives and
Areas such as internet searches, digital client satisfaction. A few application areas of big
advertisements, risk detection, etc. data are research and development, health and
sports, telecommunication, etc.
Main Focus Science of the data. Its main focus is on the process of handling
voluminous data.
Approach It makes decisions in business by With the help of big data, businesses track their
using mathematics and statistics market presence, which helps them develop agility.
with programming skills which
further helps create a model to test
the hypothesis.
Datafication
in business, datafication can be defined as a process that “aims to transform most
aspects of a business into quantifiable data that can be tracked, monitored, and
analyzed. It refers to the use of tools and processes to turn an organization into a data-
driven enterprise.”
There are three areas of business where datafication can really make an impact:
1. Analytics In today’s data-driven world, analytics is king. By collecting and
analyzing data, businesses can gain valuable insights into consumer behavior,
trends, and preferences, allowing them to make informed decisions that drive
growth and success.
2. Marketing Campaigns Marketing campaigns can be supercharged with
datafication, allowing companies to personalize ads and offers for specific
customers based on their interests and behaviors.
3. Forecasting Predictive analytics can help businesses forecast future trends and
stay ahead of the competition by anticipating changes in consumer demand.
The Data Science Landscape
Data science is part of the computer sciences It comprises the disciplines of i) analytics,
ii) statistics and iii) machine learning.
Analytics
Analytics generates insights from data using simple presentation, manipulation,
calculation or visualization of data. In the context of data science, it is also sometimes
referred to as exploratory data analytics. It often serves the purpose to familiarize oneself
with the subject matter and to obtain some initial hints for further analysis. To this end,
analytics is often used to formulate appropriate questions for a data science project.
The limitation of analytics is that it does not necessarily provide any conclusive evidence
for a cause-and-effect relationship. Also, the analytics process is typically a manual and
time-consuming process conducted by a human with limited opportunity for automation.
In today’s business world, many corporations do not go beyond descriptive analytics,
even though more sophisticated analytical disciplines can offer much greater value, such
as those laid out in the analytic value escalator.
Statistics
In many instances, analytics may be sufficient to address a given problem. In other
instances, the issue is more complex and requires a more sophisticated approach to
provide an answer, especially if there is a high-stakes decision to be made under
uncertainty. This is when statistics comes into play. Statistics provides a methodological
approach to answer questions raised by the analysts with a certain level of confidence.
Analysts help you get good questions, whereas statisticians bring you good answers.
Statisticians bring rigor to the table.
Sometimes simple descriptive statistics are sufficient to provide the necessary insight.
Yet, on other occasions, more sophisticated inferential statistics — such as regression
analysis — are required to reveal relationships between cause and effect for a certain
phenomenon. The limitation of statistics is that it is traditionally conducted with software
packages, such as SPSS and SAS, which require a distinct calculation for a specific
problem by a statistician or trained professional. The degree of automation is rather
limited.
Machine Learning
Artificial intelligence refers to the broad idea that machines can perform tasks normally
requiring human intelligence, such as visual perception, speech recognition, decision-
making and translation between languages. In the context of data science, machine
learning can be considered as a sub-field of artifical intelligence that is concerned with
decision making. In fact, in its most essential form, machine learning is decision making
at scale. Machine learning is the field of study of computer algorithms that allow
computer programs to identify and extract patterns from data. A common purpose of
machine learning algorithms is therefore to generalize and learn from data in order to
perform certain tasks
In traditional programming, input data is applied to a model and a computer in order to
achieve a desired output. In machine learning, an algorithm is applied to input and output
data in order to identify the most suitable model. Machine Learning can thus be
complementary to traditional programming as it can provide a useful model to explain a
phenomenon.
Traditional Programming vs. Machine Learning — Source: Own illustration adapted from Prince Barpaga
Machine Learning vs. Data Mining
The terms machine learning and data mining are closely related and often used
interchangeably. Data mining is a concept that pre-dates the current field of machine
learning. The idea of data mining — also referred to as Knowledge Discovery in Databases
(KDD) in the academic context — emerged in the late 1980s and early 1990s when the
need for analysing large datasets became apparent . Essentially, data mining refers to a
structured way of extracting insight from data which draws on machine learning
algorithms. The main difference lies in the fact that data mining is a rather manual
process that requires human intervention and decision making, while machine learning
— apart from the initial setup and fine-tuning — runs largely independently.
Supervised and Unsupervised Learning
The majority of machine learning algorithms can be categorized into supervised and
unsupervised learning. The main distinction between these types of machine learning is
that supervised learning is conducted on data which includes both the input and output
data. It is also often referred to as “labeled data” where the label is the target attribute.
The algorithm can therefore validate its model by checking against the correct output
value. Typically, supervised machine learning algorithms are regression and
classification analysis. Conversely, in unsupervised machine learning, the dataset does
not include the target attribute. The data is thus unlabeled. The most common type of
unsupervised learning is cluster analysis
Other than the main streams of supervised and unsupervised ML algorithms, there are
additional variations, such as semi-supervised and reinforcement learning algorithms. In
semi-supervised learning a small amount of labeled data is used to bolster a larger set of
unlabeled data. Reinforcement learning trains an algorithm with a reward system,
providing feedback when an artificial intelligence agent performs the best action in a
particular situation .
Types of ML Problems — Regression, Classification and Clustering
In order to structure the field of machine learning, the vast number of ML algorithms are
often grouped by similarity in terms of their function (how they work), e.g. tree-based and
neural network-inspired methods. Given the large number of different algorithms, this
approach is rather complex. Instead, it is considered more useful to group ML algorithms
by the type of problem they are supposed to solve. The most common types of ML
problems are regression, classification and clustering. There are numerous specific ML
algorithms, most of which come with a lot of different variations to address these
problems. Some algorithms are capable of solving more than one problem.
Regression
Regression is a supervised ML approach. Regression is used to predict a continuous
value. The outcome of a regression analysis is a formula (or model) that describes one or
many independent variables a dependent target value. There are many different types of
regression models, such as linear regression, logistics regression, ridge regression, lasso
regression and polynomial regression. However, by far the most popular model for
making predictions is the linear regression model. The basic formula for a univariate
linear regression model is shown underneath:
Other regression models, although they share some resemblance to linear regression,
are more suited for classifications, such as logistic regression [1]. Regression problems,
i.e. forecasting or predicting a numerical value, can also be solved by artifical neural
networks which are inspired by the structure and/or function of biological neural
networks. They are an enormous subfield comprised of hundreds of algorithms and
variations used commonly for regression and classification problems. A neural network
is favored over regression models if there is a large number of variables. Like artifical
neural networks, regression and classification tasks can also be achieved by the k-
nearest neighbor algorithm.
Matrices in Data Science
Matrices are a foundational concept in data science that underpins a wide range of
mathematical and computational operations used for analyzing and manipulating data.
They provide a structured and organized way to represent information, making it easier to
process and extract meaningful insights. In this comprehensive explanation, we’ll delve
deeper into matrices in the context of data science, exploring their properties,
operations, and applications.
1. Matrix Basics
A matrix is a two-dimensional array of numbers arranged in rows and columns. Each
element in a matrix is identified by its row and column index. A matrix with “m” rows and
“n” columns is often referred to as an “m x n” matrix. Matrices are used to represent
datasets, where each row corresponds to an observation or sample, and each column
represents a feature or attribute of that sample. This structured representation makes it
convenient to apply mathematical operations and transformations to the data.
2. Data Representation
In data science, matrices serve as a powerful tool for representing datasets. Consider a
dataset containing information about various individuals, such as age, income, and
education level. By organizing this data into a matrix, where each row corresponds to an
individual and each column represents a different attribute, we create a structured
representation that facilitates analysis. This tabular arrangement simplifies operations
like finding averages, and correlations, and performing statistical analyses.
3. Linear Transformation
Matrices are key players in the realm of linear transformations, which are fundamental to
data manipulation and feature engineering. These transformations involve scaling,
rotating, reflecting, and translating data points. In data science, linear transformations
are utilized for data preprocessing and dimensionality reduction. For example, Principal
Component Analysis (PCA) leverages matrices to identify orthogonal axes that maximize
the variance in data, leading to effective dimensionality reduction.
4. Matrix Operation
Matrices support a multitude of operations that are essential in data science:
Addition and subtraction
Matrices with the same dimensions can be added or subtracted element-wise,
facilitating tasks such as aggregating data from multiple sources.
5. Scalar Multiplication
Each element of a matrix can be multiplied by a scalar value, which can be useful for
scaling data.
6. Matrix Multiplication
Matrix multiplication is a central operation that combines the rows and columns of
matrices to produce a new matrix. The element at position (i, j) in the resulting matrix is
the dot product of the “i”-th row of the first matrix and the “j”-th column of the second
matrix. Matrix multiplication is crucial for composing linear transformations and forms
the foundation of various machine learning algorithms.
7. Transpose
The transpose of a matrix is obtained by interchanging its rows and columns. This
operation is valuable for solving systems of linear equations and for extracting features in
certain algorithms.
8. Eigenvectors and eigenvalues
Eigenvalues and eigenvectors are intrinsic properties of matrices with far-reaching
implications in data science. An eigenvector is a non-zero vector that remains in the same
direction after a linear transformation defined by a matrix. The corresponding eigenvalue
indicates the scaling factor of the eigenvector during this transformation. In data science,
eigenvalues and eigenvectors are employed in dimensionality reduction, such as in the
aforementioned PCA. By selecting the top eigenvectors, it is possible to capture the most
important information while reducing data dimensionality.
9. Matrix Factorization
Matrix factorization involves breaking down a matrix into the product of two or more
matrices. This technique has broad applications, from recommendation systems to
image processing. In collaborative filtering, matrices are factorized to uncover latent
factors that explain user-item relationships, forming the basis for personalized
recommendations.
10. Solving Linear Equations
Matrices are instrumental in solving systems of linear equations, which arise in various
data science scenarios. In regression analysis, for example, matrices are employed to
find the optimal parameters that best fit a linear model to the data. This forms the
foundation of predictive modeling.
11. Image and Signal Processing
In image and signal processing, matrices are used to represent images and signals as
pixel values in a grid. Operations like convolution are applied to matrices to perform tasks
such as edge detection and feature extraction in images. Convolutional Neural Networks
(CNNs) use matrix convolutions to learn and recognize patterns in images.
12. Graphs and Networks
Matrices are used to represent relationships in graphs and networks. The adjacency
matrix, for instance, represents connections between nodes in a graph. Matrices like the
Laplacian matrix help analyze graph properties and identify clusters or communities
within networks.
Introduction: Descriptive statistics
In Descriptive statistics you are describing, presenting, summarizing, and organizing
your data, either through numerical calculations or graphs or tables. Some of the
common measurements in descriptive statistics are central tendency and others the
variability of the dataset.
Descriptive statistical analysis helps us to understand our data and is very important
part of Machine Learning. Doing a descriptive statistical analysis of our dataset is
absolutely crucial.
Measure of Central Tendency:
It describes a whole set of data with a single value that represents the centre of its
distribution. There are three main measures of central tendency:
1. Mean: It is the sum of the observation divided by the sample size. It is not a
robust statistics as it is affected by extreme values. So, very large or very low
value(i.e. Outliers) can distort the answer.
2. Median: It is the middle value of data. It splits the data in half and also called
50th percentile. It is much less affected by the outliers and skewed data
than mean. If the no. of elements in the dataset is odd, the middle most
element is the median. If the no. of elements in the dataset is even, the
median would be the average of two central elements.
3. Mode: It is the value that occurs more frequently in a dataset. Therefore a
dataset has no mode, if no category is the same and also possible that a
dataset has more than one mode. It is the only measure of central tendency
that can be used for categorical variables.
Measures of Variability
Measures of Variability also known as spread of the data describes how similar or varied
are the set of observations. The most popular variability measures are the range,
interquartile range (IQR), variance, and standard deviation.
1. Range: The range describes the difference between the largest and the
smallest points in your data. The bigger the range the more spread out is the
data.
2. IQR: The interquartile range (IQR) is a measure of statistical dispersion
between upper (75th) quartiles i.e Q3 and lower (25th) quartiles i.e Q1. You
can understand this by below example.
While the range measures where the beginning and end of your datapoint are, the
interquartile range is a measure of where the majority of the values lie.
3. Variance: It is the average squared deviation from mean. The variance is
computed by finding the difference between every data point and the mean,
squaring them, summing them up and then taking the average of those numbers.
The problem with Variance is that because of the squaring, it is not in the same
unit of measurement as the original data.
4. Standard Deviation: Standard Deviation is used more often because it is in the
original unit. It is simply the square root of the variance and because of that, it is
returned to the original unit of measurement.
When you have a low standard deviation, your data points tend to be close to the mean.
A high standard deviation means that your data points are spread out over a wide range.
Standard deviation is best used when data is unimodal. In a normal distribution,
approximately 34% of the data points are lying between the mean and one standard
deviation above or below the mean. Since a normal distribution is symmetrical, 68% of
the data points fall between one standard deviation above and one standard deviation
below the mean. Approximately 95% fall between two standard deviations below the
mean and two standard deviations above the mean. And approximately 99.7% fall
between three standard deviations above and three standard deviations below the
mean.
The picture below illustrates that perfectly.
With the so-called „Z-Score“, you can check how many standard deviations below (or
above) the mean, a specific data point lies.
Probability
I will just give a brief introduction of probability. Before going to the actual definition of
Probability let’s look at some terminologies.
• Experiment: An experiment could be something like — whether it rains in
Delhi on a daily basis or not.
• Outcome: Outcome is the result of a single trial. If it rains today, the
outcome of today’s trial is “it rained”.
• Event: An event is one or more outcomes of an experiment. For the
experiment of whether it rains in Delhi every day the event could be “it
rained” or it didn’t rain.
• Probability: This simply the likelihood of an event. So it there’s a 60% chance
of it raining today, the probability of raining is 0.6.
Bernoulli Trials
An experiment which has exactly two outcomes like coin toss is called Bernoulli Trials.
Probability distribution of the number of successes in n Bernoulli trials is known as
a Binomial distribution.
Formula for Binomial distribution is given below.
Binomial distribution Formula
Probability Mass Function for Binomial distribution with different probability of success
and 100 random variables.
Probability Mass Function
Probability distribution of continuous random variable (variable that can assume any
possible value between two points) is known as Probability Density Function. There will
be infinite no. of trials in case of continuous random variable.
Area under a probability density function gives the probability for the random variable to
be in that range.
If I have a population data and I take random samples of equal size from the data, the
sample means are approximately normally distributed.
Normal Distribution
It basically describes how large samples of data look like when they are plotted. It is
sometimes called the “bell curve“ or the “Gaussian curve“.
Inferential statistics and the calculation of probabilities require that a normal
distribution is given. This basically means, that if your data is not normally distributed,
you need to be very careful what statistical tests you apply to it since they could lead to
wrong conclusions.
In a perfect normal distribution, each side is an exact mirror of the other. It should look
like the distribution on the picture below:
Normal Distribution
In a normal distribution, the mean, mode and median are all equal and fall at the same
midline point.
Normal Distribution Function
A normal distribution with a mean of 0 and a standard deviation of 1 is called a standard
normal distribution. Area under the standard normal distribution curve would be 1.
Central Limit Theorem
• If we take means of random samples from a distribution and we plot the
means, the graph approaches to a normal distribution when we have taken
sufficiently large number of such samples.
• The theorem also says that the mean of means will be approximately equal
to the mean of sample means i.e. population mean.
Normal distributions for higher standard deviations are flatter i.e. more spread as
compared to those for lower standard deviations.
Z scores
The distance in terms of number of standard deviations, the observed value is away
from the mean, is the standard score or the Z score.
A positive Z score indicates that the observed value is Z standard deviations above the
mean. Negative Z score indicates that the value is below the mean.
Observed value = µ+zσ [µ is the mean and σ is the standard deviation]
From above graph area around 2 standard deviation around the mean is 0.95, that
means 0.95 probability of data lying within that range.
For a particular z score, we can look into the Z-table to find the probability for values to
fall less than that particular z value.