DS-BDS (Unit 1) Technical
DS-BDS (Unit 1) Technical
*ww.V**A*******9************
Syllabus
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data Explosion: Data Volume, Data Variety, Data Velocity and Veracity. Big data
infrastructure and challenges Big Data Processing Architectures:Data Warehouse,
Re-Engineering the Data Warehouse, shared everything and shared nothing architecture, Big data
learning approaches. Data Science- The Big Picture: Relation between AI, Statistical Learning,
Machine Learning, Data Mining and Big Data Analytics.
Contents
1.1 Introduction to Data Science... April-20, Marks 4
1.2 Defining Big Data April-18, Markss5
1.3 Data Explosion April-18, 19,
.. .Dec.-18, 19, Marks 6
1.4 Big Data Examples
1.5 Data Processing Infrastructure Challenges... May-18, .
Marks6
1.6 Big Data Processing Architectures.. April-18, 19, 20,
May-18, ***** Marks 6
1.7 Big Data Learning Approaches. April-18, 20 Marks 6
1.8 Data Science:The Big Picture.... Dec.-19, . Marks6
(1 1)
Introduction: Data Science and Big Data
1-2
Data Analytics
Data Science and Big
to Data Science
SPPU: April-20
Introduction
1.1 but which
collection of facts and figures which relay something specific,
D a t a is a
It can be numbers, words, measurements
are not organized in any way.
We data is raw material
can say,
observations or even just descriptions of things.
information.
in the production of
document data, transaction data, graph
Types of data are record data, data matrix,
data and ordered data.
from various forms of data. At its core, data science aims to discover and
insights
from data that can be used to make sound business
extract actionable knowledge
decisions and predictions.
Data science uses advanced analytical theory and various methods such as time
series analysis for predicting the future. From historical data, instead of knowing
how many products sold in the previous quarter, data science helps in forecasting
future product sales and revenue more accurately.
.Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find unseen patterns, derive meaningful
information and make business decisions. Data science uses complex machine
learning algorithms to build predictive models.
Data science enables businesses to process huge amounts of structured and
unstructured big data to detect patterns.
useful results or talking to a chatbot for customer service. These are all real-life
applications for data science.
Following are some main reasons for using data science technology
oWith the help of data science technology, we can convert the massive
amount
of raw and unstructured data into
meaningful insights.
Data science technology is opting by various companies,
whether it is a big
brand or a startup. Google, Amazon, Netflix, which
handle the huge amount
of data, are using data science
algorithms for better customer experience.
Data science is working for
automating transportation such as creating a
self-driving car, which is the future of transportation.
Data science can help in different
predictions such as various surveys, elections,
flight ticket confirmation, etc.
5. Predict future market trends Collecting and analyzing data on a larger scale
can enable you to identify emerging trends in your market. Tracking purchase
data, celebrities and influencers and search engine queries can reveal what
products people are interested in.
6. Recommendation systems Netflix and Amazon give movie and product
recommendations based on what you like to watch, purchase or browse on
their platforms.
7. Streamline manufacturing: Another way you can use data science in business is
to identify inefficiencies in manufacturing processes. Manufacturing machines
gather data from production processes at high volumes. In cases where the
volume of data collected is too high for a human to manually analyze it, an
algorithm can be written to clean, sort and interpret it quickly and accurately to
gather insights.
1.1.2 Relationship between Data Science and Information Science
Data science, as the interdisciplinary field, employs techniques and theories drawn
from many fields within the context of mathematics, statistics, information science
and computer science. Data science and information science are twin disciplines by
nature. The mission, task and nature of data science are consistent with those of
information science.
Data science is heavy on computer science and mathematics. Information science is
used in areas such as knowledge management, data management and interaction
design.
Information science is the science and practice dealing with the effective collection,
storage, retrieval and use of information. It is concerned with recordable
information and knowledge and the technologies and related services that facilitate
their management and use.
Business
understanding
Data mining Data
exploration
Data science
Feature
lifecycle
engineering Data
visualization
Data
cleanin9 Predictive
modeling
Fig. 1.1.1 Data science life cycle
a) Business understanding: Understand the basic solve
b) Data exploration
problem you are trying
Understand the pattern and bias in
c)Data visualization Create and
your data
study of the visual
representation or data.
TECHNICAL PUBLICATIONS -
an
up-thrust for knowledge
Data Science and Big Data Analytics 1-5 Introduction: Data Science and Big Data
d) Predictive modeling I t is the stage where the machine learning finally comes
into your data.
e) Data cleaning : Detecting and correcting corrupt or inaccurate records.
) Feature engineering It is the process of cutting dowrn the features.
Review Question
1. Explain data science and its various applicationms. SPPU April-20 (In Sem), Marks 4
1.2 Defining Big Data SPPU: April-18
.Big data can be defined as very large volumes of data available at various sources,
in varying degrees of complexity, generated at different speed i.e., velocities and
Big data' is a term used to describe a collection of data that is huge in size and
yet growing exponentially with time. In short, such data is so large and complex
that none of the traditional data management tools are able to store it or process it
efficiently.
The processing of big data begins with the raw data that isn't aggregated or
organized and is most often impossible to store in the memory of a single
computer.
Big data processing is a set of techniques or programming models to access
large-scale data to extract useful information for supporting and providing
decisions. Hadoop is the open-source implementation of MapReduce and is widely
used for big data processing
1.2.1 Difference between Data Science and Big Data
is used in Biotech energy, gamingand Used in retail, education, healthcare and social
media.
be.
5. Early identification of risk to the product / services, if any
understand and work with big data and big data tools.
Review Question
1. Justify your answer with example "Data science and big data are same or diferent".
SPPU April-18 (n Sem), Marks 5
The essence of computer applications is to store things in the real world into
computer systems in the form of data, ie., it is a process of producing data. Some
data are the records related to culture and society and others are the descriptions
of phenomena of the universe and life. The large scale of data is rapidly generated
and stored in computer systems, which is called data explosion.
Data is gernerated automatically by mobile devices and computers, think facebook
search queries, directions and GPS locations and image capture.
Sensors also generate volumes of data, including medical data and commere
location-based sensors. Experts expect 55 billion IP- enabled sensors by 2021. Even
storage of all this data is expensive. Analysis gets more important and more
expensive every year.
Fig. 1.3.1 shows the big data explosion by the current data boom and how ical
it is for to be able to extract from all of this data.
us
meaning
TECHNICAL PUBLICATIONS-an up-thrust for knowledge
Date Science and Big Deta Analytics 1-7 Introduction: Data Science and Big Data
E
Fig. 1.3.1 Data explosion
Geographical
Machine data Data information
volume ems and
geo-spatial data
Amazon,
facebook,
Yahoo, Google Data wwww Social
Web based velocity media
companies
Fig. 1.3.3 (a) Data veloclty
Structured
Data Unstructured
variety
Semi-structured
*www..
Cloud is used to store data and information It is used to describe a huge volume of
on remote servers. data and information.
Cloud computing is economical as it has low Big data is a highly scalable, robust
maintenance costs centralized platform no ecosystem and cost-effective.
upfront cost and disaster safe implementation.
Vendors and solution providers of cloud Vendors and solution providers of big
computing are Google, Amazon web service, data are Cloudera, Hortonworks, Apache
Dell, Microsoft, Apple and IBM. and MapR. ************
The main focus of cloud computing is to Main focus of big data is about solving
provide computer resources and services with problems when a huge amount of data
the help of network connection. generating and processing.
Review Questions
1. State one example of big data and explain horw all V's are applied for big data example.
SPPU Dec.-18 (End Sem), Marks 4
2. Explain 5 V's for defining big data along with the factors responsible for data explosion.
SPPU: April-19 (In Sem), Marks 5
3. Explain big data along with 5 V's.
SPPU: April-18 (In Sem), Marks 5, Dec.-19 (End Sem),Marks 6
1.4 Big Data Examples
.Machine data consists of information generated from industrial equipment,
real-time data from sensors that track parts and monitor machinery and even web
2. Transportation : Data is transfer from one place to other place and process the
data then load into memory for manipulation. The data is transported between
computer and storage layers. Increase in bandwidth is not solution to this
problem.
3. Processing Data processing needs to combine the logic and mathematical
computation in one cycle. This processing can be accomplished by CPU or
processor, memory and software. With each generation CPU processing speed is
increased, have improved processing capabilities. Memory is required for
compute and processing. Memory has become cheaper and faster with
evolution of processor capability. Software are used to write the programs for
responsible for storage and are added. Each layer has its own limitation, causes
Review Question
in big data.
1. List and explain data processing infrastructure challenges
SPPU: May-18 (End Sem), Marks 6
wwwwwww wMMAIN6MAMH
**wwwwww
wwwwwwww
wwww.
OLAP server OLAP server
Bottom tier
Monitoring Administration
Data warehouse
Metadata Data marts
repositoryy
Extract/Clean/Transform/
Load/Refresh
The top tier represents the front-end client layer. The client level which includes
the tools and Application Programming Interface (AP) used for high-level data
analysis, inquiring and reporting. User can use reporting tools, query, analysis or
data mining tools.
Main memory
Disk
utilization. The
The main idea behind such a system is maximizing resource
due to
disadvantage is that shared resources also lead to reduced performance
contention.
interconnection network
resources. Each node is under the control of its own copy of the operating system
and thus can be viewed as a local site.
Shared nothing is also known as
Massively Parallel Processing (MPP) solutions
typically employed by large data warehouse systems.
Data is horizontally partitioned across nodes, such that each node has a
subset of
the rows from each table that was distributed and all the replicated tables.
Shared nothing can be made to scale to hundreds thousands of machines.
or even
Because of this, it is generally regarded as the best-scaling architecture.
Shared-nothing architecture scales better and is well suited for a cloud data
warehouse considering very low-cost commodity PCs and networking hardware.
Data engineering
Re.engineering the
deta warehouse
Review Questions
1. Explain the role of shared everything and shared nothing architecture in big data.
2. What is data
SPPU
April-18 (In Sem), Marks 5
warehouse ?
Explain design and architecture of data warehouse.
SPPU May-18 (End Sem), Marks 6
3. Explain shared-everything and shared-nothing architectures in detail with
respect to data. big
SPPU April-19 (In Sem), Marks 5
TECHNICAL PUBLICATIONS a n up-thrust
for knowledge
1-18 Introduction: Data Solence and Blg Data
Data Science and Big Data Analytics
Machine leaning
K means
Classification Regression K medoid
DBSCAN
Knn Linear CLARA
Naive Bayes regression OPTICS
Support vector Lasso Neural network
machines regression
Decision trees Decision trees
Neural network Neural network
Fig. 1.7.1
Supervised learming is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training examples.
The task of the supervised learner is to
predict the output behavior of a system for
any set of input values, after an initial training phase. Supervised learning is also
called classification.
b) Unsupervised learning sifts through unlabelled data to try and find unusual
patterns that might escape the human eye.
c)Reinforcement learning uses trial and error to learn from mistakes and get
closer to achieving a specific goal.
Here is just one example of how data science can intersect with AlI based on
machine learning. Let us assume that an Internet search engine company wants to
provide and monetize the most relevant online searches in response to the query:
Big data analytics is the often complex process of examining big data to uncover
information such as hidden patterns, correlations, market trends and customer
preferences that can help organizations make informed business decisions.
Review Question
1. Explain machine learning approaches in big data. SPPU Dec.-19 (End Sem), Marks 6
a information
********
battributes
characteristics vwww
d none
Q.3 In big data, refer to heterogeneous sources and the nature of data, both
structured and unstructured.
********
avolume b variety
C
velocity
*********
dall of these
Q.4 Various types of data analytics are .
*******:
aj descriptive model m
b predictive-model-.
cprescriptivemodel
********* d all of these