0% found this document useful (0 votes)

96 views34 pages

Unit 1

Try

Uploaded by

Sahil Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views34 pages

Unit 1

Try

Uploaded by

Sahil Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Savitribai Phule Pune University

Honours* in Data Science

Third Year of Engineering (2019 Course)
310501: DATA SCIENCE AND
VISUALIZATION

Examination Scheme:
Teaching Scheme:
In-Sem (Paper): 30 Marks
TH: 04 Hours/Week . Credit 04
End-Sem (Paper): 70 Marks

Department of Engineering
Course Objectives:

• To learn data collection and preprocessing techniques for data science

• To Understand and practice analytical methods for solving real life

problems.

• To study data exploration techniques

• To learn different types of data and its visualization

• To study different data visualization techniques and tools

• To map element of visualization well to perceive information

•
AGENDA
Unit I
INTRODUCTION: DATA SCIENCE AND VISUALIZATION
07 Hours

Defining data science and big data, Recognizing the different types of data,
Gaining insight into the data science process,

Data Science Process: Overview, Different steps, Machine Learning Definition

and Relation with Data Science
Data All Around
• Lots of data is being collected and warehoused

– Web data, e-commerce

– Financial transactions, bank/credit transactions

– Online trading and purchasing

– Social Network

– Cloud
Data and Big Data
•“90% of the world’s data was generated in the last few years.”

• Due to the advent of new technologies, devices, and communication

means like social networking sites, the amount of data produced by
mankind is growing rapidly every year.

• The amount of data produced by us from the beginning of time till 2003
was 5 billion gigabytes. If you pile up the data in the form of disks it may
fill an entire football field.

• The same amount was created in every two days in 2011, and in every six
minutes in 2016. This rate is still growing enormously.
Big Data Definition
No single standard definition… “Big Data” is data whose scale, diversity,
and complexity require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden knowledge from it…
What is Big Data

• Big Data is a collection of large datasets that cannot be

processed using traditional computing techniques.

• It is not a single technique or a tool, rather it involves

many areas of business and technology.
Big Data

• Big Data is any data that is expensive to manage

and hard to extract value from

– Volume
• The size of the data

– Velocity
• The latency of data processing relative to the
growing demand for interactivity

– Variety and Complexity

• The diversity of sources, formats, quality,
structures.
Big Data

•With the help of mining huge quantity of

structured & unstructured data, organizations can:
-reduce costs
•-raise efficiencies
-identifies new market opportunities.
-enhances organization’s competitive benefit.
DATA SCIENCE AND BIG DATA

•With the help of mining huge quantity of

structured & unstructured data, organizations can:
-reduce costs
•-raise efficiencies
-identifies new market opportunities.
-enhances organization’s competitive benefit.
Data Scientists

❖ Convert the organization’s raw data into the useful

information.

❖ Managing & understanding large amounts of data.

❖ Create data visualization models that facilitates

demonstrating the business value of digital information.

❖ Can illustrates digital information easily with the help of

smart phones, Internet of Things devices , Social media.
What kind of Problems Solved by Data
Science?
Data and its structure

• Data comes in many forms, but at a high level, it falls into three
categories: structured, semi-structured, and unstructured.
• Structured data :
- highly organized data
- exists within a repository such as a database (or a comma-
separated values [CSV] file).
- easily accessible.
- format of the data makes it appropriate for queries and
computation (by using languages such as Structured Query
Language (SQL)).
• Unstructured data : lacks any content structure at all (for example,
an audio stream or natural language text).
• Semi-structure data: Include metadata or data that can be more
easily processed than unstructured data by using semantic tagging.
Data and its structure
Data engineering

•Data wrangling:
• Process of manipulating raw data to make it useful for data
analytics or to train a machine learning model.

• Include:
- sourcing the data from one or more data sets (in addition to
reducing the set to the required data),
- normalizing the data so that data merged from multiple data
sets is consistent.
- parsing data into some structure or storage for further use.

• process by which you identify, collect, merge, and preprocess one

or more data sets in preparation for data cleansing.
Data cleansing

• After you have collected and merged your data set, the next step
is cleansing.
• Data sets in the wild are typically messy and infected with any
number of common issues.

• Common issues, including missing values (or too many values),

bad or incorrect delimiters ( which segregate the data),
inconsistent records, or insufficient parameters.
• When data set is syntactically correct, the next step is to ensure
that it is semantically correct.
Data preparation/preprocessing

• final step in data engineering.

• This step assumes that you have a cleansed data set

that might not be ready for processing by a machine
learning algorithm.

•Using normalization, you transform an input feature

to distribute the data evenly into an acceptable
range for the machine learning algorithm.
Machine learning

•Create and validate a machine learning model.

•Sometimes, the machine learning model is the product, which

is deployed in the context of an application to provide some
capability (such as classification or prediction).

•In other cases, the product isn’t the trained machine learning
algorithm but rather the data that it produces
Model learning

•In one model, the algorithm process the data, &

create new data product as the result.

•But, in a production sense, the machine learning

model is the product itself, deployed to provide
insight or add value (such as the deployment of a
neural network to provide prediction capabilities for
an insurance market).
Machine learning approaches
•Machine learning approaches:
•Supervised learning
•Unsupervised learning
•Reinforcement learning
1. Supervised learning:
-algorithm is trained to produce the correct class and alter
the model when it fails to do so.
-The model is trained until it reaches some level of accuracy.
2. Unsupervised learning:
•- has no class; instead, it inspects the data and groups it
based on some structure that is hidden within the data.
•- these types of algorithms can be used in
recommendation systems by grouping customers based on
the viewing or purchasing history.
Reinforcement learning
•is a semi-supervised learning algorithm.
•- provides a reward after the model makes some
number of decisions that lead to a satisfactory result.

•Model validation
• used to understand how model behave in production
after a model is trained.
• for that purpose it reserve a small amount of available
training data to be tested against final model.(called as
test data)
• training data is used to train machine learning model.
•Test data is used when the model is complete to
validate how well it generalizes to unseen data.
Reinforcement learning

•Operations:
• end goal of the data science pipeline.
• creating a visualization for data product.
•Deploying machine learning model in a production
environment to operate on unseen data to provide
prediction or classification.
•Model deployment:
•When the product of the machine learning phase is a
model then it will be deployed into some production
environment to apply to new data.
•This model could be a prediction system.
Reinforcement learning

•Model visualization:
• In smaller scale data science , the product is data ;instead of model
• produced in the machine learning phase.
• Data product answers some questions about the original data set.
• Options for visualization are vast and can be produced from the R
programming language.
Business Intellegence Vs Data Science

I. Data science is basically a field in which information and knowledge are

extracted from the data by using various scientific methods, algorithms,
and processes.
II. It can thus be defined as a combination of various mathematical tools,
algorithms, statistics, and machine learning techniques which are thus
used to find the hidden patterns and insights from the data which helps
in the decision making process.
III. Data science deals with both structured as well as unstructured data.
IV. It is related to both data mining and big data.
V. Data science involves studying the historic trends and thus using its
conclusions to redefine present trends and also predict future trends.
Business Intelligence:
I. Business intelligence(BI) is basically a set of technologies, applications,
and processes that are used by enterprises for business data analysis.

II. It is basically used for the conversion of raw data into meaningful
information which is thus used for business decision making and
profitable actions.

III. It deals with the analysis of structured and sometimes unstructured data
which paves the way for new and profitable business opportunities.

IV. It supports decision making based on facts rather than assumption-based

decision making.

V. Thus it has a direct impact on the business decisions of an enterprise.

Business intelligence tools enhance the chances of an enterprise to enter
a new market as well as help in studying the impact of marketing efforts.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
• Steps Involved in Data Preprocessing:
• 1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this
part, data cleaning is done. It involves handling of missing data, noisy
data etc.

• (a). Missing Data:

This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

• Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
• (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It
can be generated due to faulty data collection, data entry errors etc. It
can be handled in following ways :
• Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.

• Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).

• Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
• 2. Data Transformation:
This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
• Normalization:
It is done in order to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)

• Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

• Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
• Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in
hierarchy. For Example-The attribute “city” can be converted to
“country”.
• 3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of
data. While working with huge volume of data, analysis became harder in
such cases. In order to get rid of this, we uses data reduction technique. It
aims to increase the storage efficiency and reduce data storage and analysis
costs.

• The various steps to data reduction are:

• Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data
cube.

• Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute.the attribute having p-value greater than significance level
can be discarded.
• Numerosity Reduction:
This enable to store the model of data instead of whole data, for
example: Regression Models.

• Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be
lossy or lossless. If after reconstruction from compressed data,
original data can be retrieved, such reduction are called lossless
reduction else it is called lossy reduction. The two effective
methods of dimensionality reduction are: Wavelet transforms and
PCA (Principal Component Analysis).
• There are many online data sources where you can get free data sets to use
in your project. In this article, we have mentioned some of these data
sources that you can download and use for free. So whether you want to
make a Data Visualization, Data Cleaning, Machine Learning or any other
type of project, there is a data set for you to use!
• 1. Google Cloud Public Datasets
• Google is not just a search engine, it’s much more! There are many public
data sets that you can access on the Google cloud and analyze to obtain
new insights from this data. There are more than 100 datasets and all of
them are hosted by BigQuery and Cloud Storage. You can also use Google’s
Machine Learning capabilities to analyze the data sets such as BigQuery ML,
Vision AI, Cloud AutoML, etc. You can also use Google Data Studio to create
data visualizations and interactive dashboards so that you can obtain better
insights and find patterns in the data. Google Cloud Public Datasets has data
from various data providers such as GitHub, United States Census Bureau,
NASA, BitCoin, US Department of Transportation, etc. You can access these
data sets for free and get free query access of about 1 TB of data per month
in BigQuery.
• Amazon Web Services Open Data Registry
• Kaggle:-
There are around 23,000 public datasets on Kaggle that you can download
for free. In fact, many of these datasets have been downloaded millions of
times already. You can use the search box to search for public datasets on
whatever topic you want ranging from health to science to popular
cartoons! You can also create new public datasets on Kaggle and those may
earn you medals and also lead you towards advanced Kaggle titles like
Expert, Master, and Grandmaster. You can also download competition data
sets from Kaggle while participating in these competitions. The competitive
Kaggle data sets are much more detailed, curated, and well cleaned than the
public data sets available on Kaggle so you might have to sort through them.
But all in all, if you are interested in Data Science, then Kaggle is the place
for you!
• 5. UCI Machine Learning Repository

The UCI Machine Learning Repository is a great place to look for interesting
data sets as it is one of the first and oldest data sources available on the
internet (It was created in 1987!). These data sets are great for machine
learning and you can easily download the data sets from the repository
without any registration. All of the data sets on the UCI Machine Learning
Repository are contributed by different users and so they happen to be a
little small with different levels of data cleanliness. But most of the data sets
are well maintained and you can easily use them for machine learning
algorithms.

• National Center for Environmental Information

• Global Health Observatory

Dsbda Unit1
No ratings yet
Dsbda Unit1
232 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Foundations of Data Science PPT TEXT BOOK
No ratings yet
Foundations of Data Science PPT TEXT BOOK
132 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
Unit 1
No ratings yet
Unit 1
60 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Project Report
No ratings yet
Project Report
29 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
20IT501 BDA Unit1
No ratings yet
20IT501 BDA Unit1
18 pages
Data Science - PPT
No ratings yet
Data Science - PPT
45 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Industry 4.0 & AI in Data Management
No ratings yet
Industry 4.0 & AI in Data Management
8 pages
Summary of Data Science
No ratings yet
Summary of Data Science
5 pages
Module 1
No ratings yet
Module 1
35 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Data Science and Big Data Analytics Unit 1 Notes
No ratings yet
Data Science and Big Data Analytics Unit 1 Notes
13 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Mod 3
No ratings yet
Mod 3
96 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Kadir
No ratings yet
Kadir
84 pages
Class X Data Science
No ratings yet
Class X Data Science
29 pages
Datascience Presentation
No ratings yet
Datascience Presentation
94 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Data Science
No ratings yet
Data Science
6 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Data Scince Report
No ratings yet
Data Scince Report
11 pages
Unit 01 Ids
No ratings yet
Unit 01 Ids
39 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Trends in Data Science: AI and DS-I
No ratings yet
Trends in Data Science: AI and DS-I
32 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
Module 1 - Data Science Introduction - Detailed
No ratings yet
Module 1 - Data Science Introduction - Detailed
131 pages
Unit I
No ratings yet
Unit I
52 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
30 pages
AI Lecture 6
No ratings yet
AI Lecture 6
23 pages
Fdsunit 1
No ratings yet
Fdsunit 1
27 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
Datascience
78% (9)
Datascience
28 pages
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
No ratings yet
CSIC 221: Machine Learning & Data Analytics: Mayank Dave Professor Dept. of Computer Engineering
23 pages
Data Science Guide: Concepts & Roles
100% (1)
Data Science Guide: Concepts & Roles
67 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
31 pages
DS Unit 1
No ratings yet
DS Unit 1
37 pages
Unit 1
No ratings yet
Unit 1
28 pages
Unit I
No ratings yet
Unit I
262 pages
Data
No ratings yet
Data
43 pages
Unit 1
No ratings yet
Unit 1
11 pages
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
No ratings yet
Unit I-Introduction of Data Science & R Programming: What Is Data Science? What Is Data Science?
30 pages
DBMS Aug 25 Solution
No ratings yet
DBMS Aug 25 Solution
18 pages
1 Hamming Code
No ratings yet
1 Hamming Code
7 pages
CNS Solved Question Paper Oct-22
No ratings yet
CNS Solved Question Paper Oct-22
6 pages
Just A Circular
No ratings yet
Just A Circular
4 pages
INSAID Assigmnment
No ratings yet
INSAID Assigmnment
15 pages
BIDA Unit 1
No ratings yet
BIDA Unit 1
26 pages
Marketing KWL Chart Guide
No ratings yet
Marketing KWL Chart Guide
20 pages
Lecture 11 PPT - PPT Business Model
100% (3)
Lecture 11 PPT - PPT Business Model
30 pages
INTRODUCTION
No ratings yet
INTRODUCTION
3 pages
IOT
No ratings yet
IOT
44 pages
Free Bar Graph Maker - Create Bar Chart Online Draxlr
No ratings yet
Free Bar Graph Maker - Create Bar Chart Online Draxlr
1 page
Comprehensive Company SWOT Reports
No ratings yet
Comprehensive Company SWOT Reports
176 pages
VP Director Information Technology Data in Nashville TN Resume Eric Jensen
No ratings yet
VP Director Information Technology Data in Nashville TN Resume Eric Jensen
2 pages
Customer Service & Relationship Marketing
No ratings yet
Customer Service & Relationship Marketing
37 pages
Data WareHouse
No ratings yet
Data WareHouse
48 pages
Innovation in Management Control: "The Contributions of Business Intelligence To The Role of The Management Controller."
No ratings yet
Innovation in Management Control: "The Contributions of Business Intelligence To The Role of The Management Controller."
90 pages
Jurnal 1
No ratings yet
Jurnal 1
28 pages
Conceptual Marketing Plan Nick Scali Limited
No ratings yet
Conceptual Marketing Plan Nick Scali Limited
6 pages
Business Intelligence Essentials
No ratings yet
Business Intelligence Essentials
45 pages
Social Media Campaign Analysis Guide
No ratings yet
Social Media Campaign Analysis Guide
2 pages
BDA BI JIT Chapter-2
No ratings yet
BDA BI JIT Chapter-2
31 pages
ETL Testing Material-Final
100% (1)
ETL Testing Material-Final
121 pages
Big Data Analytics Lifecycle Guide
No ratings yet
Big Data Analytics Lifecycle Guide
1 page
Power BI Training for Analysts
No ratings yet
Power BI Training for Analysts
10 pages
Marketing Management PPT Guide
No ratings yet
Marketing Management PPT Guide
4 pages
BICS Workshop Student Handbook FINAL
No ratings yet
BICS Workshop Student Handbook FINAL
138 pages
SCD Dim
No ratings yet
SCD Dim
18 pages
MARK 201 Class 1
No ratings yet
MARK 201 Class 1
59 pages
Burgmann Standard Seals Replace Competitors Seals
No ratings yet
Burgmann Standard Seals Replace Competitors Seals
2 pages
Infosys JD Dnacto
No ratings yet
Infosys JD Dnacto
2 pages
Kotler & Keller Ch08
No ratings yet
Kotler & Keller Ch08
11 pages
Business Intelligence & Analytics
No ratings yet
Business Intelligence & Analytics
7 pages
Swot Analysis Tows Matrix: Strenghts (S) Weaknesses (W)
No ratings yet
Swot Analysis Tows Matrix: Strenghts (S) Weaknesses (W)
4 pages
(Ebook PDF) Using MIS 10th Edition by David M. Kroenkepdf Download
100% (2)
(Ebook PDF) Using MIS 10th Edition by David M. Kroenkepdf Download
53 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Savitribai Phule Pune University

Honours* in Data Science

• To learn data collection and preprocessing techniques for data science

• To Understand and practice analytical methods for solving real life

• To study data exploration techniques

• To learn different types of data and its visualization

• To study different data visualization techniques and tools

• To map element of visualization well to perceive information

Data Science Process: Overview, Different steps, Machine Learning Definition

– Web data, e-commerce

– Financial transactions, bank/credit transactions

– Online trading and purchasing

• Due to the advent of new technologies, devices, and communication

• Big Data is a collection of large datasets that cannot be

• It is not a single technique or a tool, rather it involves

• Big Data is any data that is expensive to manage

– Variety and Complexity

•With the help of mining huge quantity of

•With the help of mining huge quantity of

❖ Convert the organization’s raw data into the useful

❖ Managing & understanding large amounts of data.

❖ Create data visualization models that facilitates

❖ Can illustrates digital information easily with the help of

• process by which you identify, collect, merge, and preprocess one

• Common issues, including missing values (or too many values),

• final step in data engineering.

• This step assumes that you have a cleansed data set

•Using normalization, you transform an input feature

•Create and validate a machine learning model.

•Sometimes, the machine learning model is the product, which

•In one model, the algorithm process the data, &

•But, in a production sense, the machine learning

I. Data science is basically a field in which information and knowledge are

IV. It supports decision making based on facts rather than assumption-based

V. Thus it has a direct impact on the business decisions of an enterprise.

• (a). Missing Data:

• Fill the Missing values:

• The various steps to data reduction are:

• Attribute Subset Selection:

• National Center for Environmental Information

• Global Health Observatory

You might also like