Savitribai Phule Pune University
Honours* in Data Science
Third Year of Engineering (2019 Course)
310501: DATA SCIENCE AND
VISUALIZATION
Examination Scheme:
Teaching Scheme:
In-Sem (Paper): 30 Marks
TH: 04 Hours/Week . Credit 04
End-Sem (Paper): 70 Marks
Department of Engineering
Course Objectives:
• To learn data collection and preprocessing techniques for data science
• To Understand and practice analytical methods for solving real life
problems.
• To study data exploration techniques
• To learn different types of data and its visualization
• To study different data visualization techniques and tools
• To map element of visualization well to perceive information
•
AGENDA
Unit I
INTRODUCTION: DATA SCIENCE AND VISUALIZATION
07 Hours
Defining data science and big data, Recognizing the different types of data,
Gaining insight into the data science process,
Data Science Process: Overview, Different steps, Machine Learning Definition
and Relation with Data Science
Data All Around
• Lots of data is being collected and warehoused
– Web data, e-commerce
– Financial transactions, bank/credit transactions
– Online trading and purchasing
– Social Network
– Cloud
Data and Big Data
•“90% of the world’s data was generated in the last few years.”
• Due to the advent of new technologies, devices, and communication
means like social networking sites, the amount of data produced by
mankind is growing rapidly every year.
• The amount of data produced by us from the beginning of time till 2003
was 5 billion gigabytes. If you pile up the data in the form of disks it may
fill an entire football field.
• The same amount was created in every two days in 2011, and in every six
minutes in 2016. This rate is still growing enormously.
Big Data Definition
No single standard definition… “Big Data” is data whose scale, diversity,
and complexity require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden knowledge from it…
What is Big Data
• Big Data is a collection of large datasets that cannot be
processed using traditional computing techniques.
• It is not a single technique or a tool, rather it involves
many areas of business and technology.
Big Data
• Big Data is any data that is expensive to manage
and hard to extract value from
– Volume
• The size of the data
– Velocity
• The latency of data processing relative to the
growing demand for interactivity
– Variety and Complexity
• The diversity of sources, formats, quality,
structures.
Big Data
•With the help of mining huge quantity of
structured & unstructured data, organizations can:
-reduce costs
•-raise efficiencies
-identifies new market opportunities.
-enhances organization’s competitive benefit.
DATA SCIENCE AND BIG DATA
•With the help of mining huge quantity of
structured & unstructured data, organizations can:
-reduce costs
•-raise efficiencies
-identifies new market opportunities.
-enhances organization’s competitive benefit.
Data Scientists
❖ Convert the organization’s raw data into the useful
information.
❖ Managing & understanding large amounts of data.
❖ Create data visualization models that facilitates
demonstrating the business value of digital information.
❖ Can illustrates digital information easily with the help of
smart phones, Internet of Things devices , Social media.
What kind of Problems Solved by Data
Science?
Data and its structure
• Data comes in many forms, but at a high level, it falls into three
categories: structured, semi-structured, and unstructured.
• Structured data :
- highly organized data
- exists within a repository such as a database (or a comma-
separated values [CSV] file).
- easily accessible.
- format of the data makes it appropriate for queries and
computation (by using languages such as Structured Query
Language (SQL)).
• Unstructured data : lacks any content structure at all (for example,
an audio stream or natural language text).
• Semi-structure data: Include metadata or data that can be more
easily processed than unstructured data by using semantic tagging.
Data and its structure
Data engineering
•Data wrangling:
• Process of manipulating raw data to make it useful for data
analytics or to train a machine learning model.
• Include:
- sourcing the data from one or more data sets (in addition to
reducing the set to the required data),
- normalizing the data so that data merged from multiple data
sets is consistent.
- parsing data into some structure or storage for further use.
• process by which you identify, collect, merge, and preprocess one
or more data sets in preparation for data cleansing.
Data cleansing
• After you have collected and merged your data set, the next step
is cleansing.
• Data sets in the wild are typically messy and infected with any
number of common issues.
• Common issues, including missing values (or too many values),
bad or incorrect delimiters ( which segregate the data),
inconsistent records, or insufficient parameters.
• When data set is syntactically correct, the next step is to ensure
that it is semantically correct.
Data preparation/preprocessing
• final step in data engineering.
• This step assumes that you have a cleansed data set
that might not be ready for processing by a machine
learning algorithm.
•Using normalization, you transform an input feature
to distribute the data evenly into an acceptable
range for the machine learning algorithm.
Machine learning
•Create and validate a machine learning model.
•Sometimes, the machine learning model is the product, which
is deployed in the context of an application to provide some
capability (such as classification or prediction).
•In other cases, the product isn’t the trained machine learning
algorithm but rather the data that it produces
Model learning
•In one model, the algorithm process the data, &
create new data product as the result.
•But, in a production sense, the machine learning
model is the product itself, deployed to provide
insight or add value (such as the deployment of a
neural network to provide prediction capabilities for
an insurance market).
Machine learning approaches
•Machine learning approaches:
•Supervised learning
•Unsupervised learning
•Reinforcement learning
1. Supervised learning:
-algorithm is trained to produce the correct class and alter
the model when it fails to do so.
-The model is trained until it reaches some level of accuracy.
2. Unsupervised learning:
•- has no class; instead, it inspects the data and groups it
based on some structure that is hidden within the data.
•- these types of algorithms can be used in
recommendation systems by grouping customers based on
the viewing or purchasing history.
Reinforcement learning
•is a semi-supervised learning algorithm.
•- provides a reward after the model makes some
number of decisions that lead to a satisfactory result.
•Model validation
• used to understand how model behave in production
after a model is trained.
• for that purpose it reserve a small amount of available
training data to be tested against final model.(called as
test data)
• training data is used to train machine learning model.
•Test data is used when the model is complete to
validate how well it generalizes to unseen data.
Reinforcement learning
•Operations:
• end goal of the data science pipeline.
• creating a visualization for data product.
•Deploying machine learning model in a production
environment to operate on unseen data to provide
prediction or classification.
•Model deployment:
•When the product of the machine learning phase is a
model then it will be deployed into some production
environment to apply to new data.
•This model could be a prediction system.
Reinforcement learning
•Model visualization:
• In smaller scale data science , the product is data ;instead of model
• produced in the machine learning phase.
• Data product answers some questions about the original data set.
• Options for visualization are vast and can be produced from the R
programming language.
Business Intellegence Vs Data Science
I. Data science is basically a field in which information and knowledge are
extracted from the data by using various scientific methods, algorithms,
and processes.
II. It can thus be defined as a combination of various mathematical tools,
algorithms, statistics, and machine learning techniques which are thus
used to find the hidden patterns and insights from the data which helps
in the decision making process.
III. Data science deals with both structured as well as unstructured data.
IV. It is related to both data mining and big data.
V. Data science involves studying the historic trends and thus using its
conclusions to redefine present trends and also predict future trends.
Business Intelligence:
I. Business intelligence(BI) is basically a set of technologies, applications,
and processes that are used by enterprises for business data analysis.
II. It is basically used for the conversion of raw data into meaningful
information which is thus used for business decision making and
profitable actions.
III. It deals with the analysis of structured and sometimes unstructured data
which paves the way for new and profitable business opportunities.
IV. It supports decision making based on facts rather than assumption-based
decision making.
V. Thus it has a direct impact on the business decisions of an enterprise.
Business intelligence tools enhance the chances of an enterprise to enter
a new market as well as help in studying the impact of marketing efforts.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
• Steps Involved in Data Preprocessing:
• 1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.
• (a). Missing Data:
This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
• Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
• Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
• Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
• 2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
• Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
• Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
• Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
• Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
• 3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of
data. While working with huge volume of data, analysis became harder in
such cases. In order to get rid of this, we uses data reduction technique. It
aims to increase the storage efficiency and reduce data storage and analysis
costs.
• The various steps to data reduction are:
• Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data
cube.
• Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute.the attribute having p-value greater than significance level
can be discarded.
• Numerosity Reduction:
This enable to store the model of data instead of whole data, for
example: Regression Models.
• Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and
PCA (Principal Component Analysis).
• There are many online data sources where you can get free data sets to use
in your project. In this article, we have mentioned some of these data
sources that you can download and use for free. So whether you want to
make a Data Visualization, Data Cleaning, Machine Learning or any other
type of project, there is a data set for you to use!
• 1. Google Cloud Public Datasets
• Google is not just a search engine, it’s much more! There are many public
data sets that you can access on the Google cloud and analyze to obtain
new insights from this data. There are more than 100 datasets and all of
them are hosted by BigQuery and Cloud Storage. You can also use Google’s
Machine Learning capabilities to analyze the data sets such as BigQuery ML,
Vision AI, Cloud AutoML, etc. You can also use Google Data Studio to create
data visualizations and interactive dashboards so that you can obtain better
insights and find patterns in the data. Google Cloud Public Datasets has data
from various data providers such as GitHub, United States Census Bureau,
NASA, BitCoin, US Department of Transportation, etc. You can access these
data sets for free and get free query access of about 1 TB of data per month
in BigQuery.
• Amazon Web Services Open Data Registry
• Kaggle:-
There are around 23,000 public datasets on Kaggle that you can download
for free. In fact, many of these datasets have been downloaded millions of
times already. You can use the search box to search for public datasets on
whatever topic you want ranging from health to science to popular
cartoons! You can also create new public datasets on Kaggle and those may
earn you medals and also lead you towards advanced Kaggle titles like
Expert, Master, and Grandmaster. You can also download competition data
sets from Kaggle while participating in these competitions. The competitive
Kaggle data sets are much more detailed, curated, and well cleaned than the
public data sets available on Kaggle so you might have to sort through them.
But all in all, if you are interested in Data Science, then Kaggle is the place
for you!
• 5. UCI Machine Learning Repository
The UCI Machine Learning Repository is a great place to look for interesting
data sets as it is one of the first and oldest data sources available on the
internet (It was created in 1987!). These data sets are great for machine
learning and you can easily download the data sets from the repository
without any registration. All of the data sets on the UCI Machine Learning
Repository are contributed by different users and so they happen to be a
little small with different levels of data cleanliness. But most of the data sets
are well maintained and you can easily use them for machine learning
algorithms.
• National Center for Environmental Information
• Global Health Observatory