0% found this document useful (0 votes)

315 views22 pages

DS-BDS (Unit 1) Technical

Uploaded by

cdef39271

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

315 views22 pages

DS-BDS (Unit 1) Technical

Uploaded by

cdef39271

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

UNIT I

*ww.V**A*******9************

Introduction : Data Science

1 and Big Data

Syllabus
Introduction to Data science and Big Data, Defining Data science and Big Data, Big Data
examples, Data Explosion: Data Volume, Data Variety, Data Velocity and Veracity. Big data
infrastructure and challenges Big Data Processing Architectures:Data Warehouse,
Re-Engineering the Data Warehouse, shared everything and shared nothing architecture, Big data
learning approaches. Data Science- The Big Picture: Relation between AI, Statistical Learning,
Machine Learning, Data Mining and Big Data Analytics.

Contents
1.1 Introduction to Data Science... April-20, Marks 4
1.2 Defining Big Data April-18, Markss5
1.3 Data Explosion April-18, 19,
.. .Dec.-18, 19, Marks 6
1.4 Big Data Examples
1.5 Data Processing Infrastructure Challenges... May-18, .
Marks6
1.6 Big Data Processing Architectures.. April-18, 19, 20,
May-18, ***** Marks 6
1.7 Big Data Learning Approaches. April-18, 20 Marks 6
1.8 Data Science:The Big Picture.... Dec.-19, . Marks6

1.9 Multiple Choice Questions

(1 1)
Introduction: Data Science and Big Data
1-2
Data Analytics
Data Science and Big

to Data Science
SPPU: April-20
Introduction
1.1 but which
collection of facts and figures which relay something specific,
D a t a is a
It can be numbers, words, measurements
are not organized in any way.
We data is raw material
can say,
observations or even just descriptions of things.
information.
in the production of
document data, transaction data, graph
Types of data are record data, data matrix,
data and ordered data.

interdisciplinary field that seeks to extract knowledge or

Data science is an

from various forms of data. At its core, data science aims to discover and
insights
from data that can be used to make sound business
extract actionable knowledge
decisions and predictions.
Data science uses advanced analytical theory and various methods such as time
series analysis for predicting the future. From historical data, instead of knowing
how many products sold in the previous quarter, data science helps in forecasting
future product sales and revenue more accurately.
.Data science is the domain of study that deals with vast volumes of data using
modern tools and techniques to find unseen patterns, derive meaningful
information and make business decisions. Data science uses complex machine
learning algorithms to build predictive models.
Data science enables businesses to process huge amounts of structured and
unstructured big data to detect patterns.

1.1.1 Applications of Data Science

Asking a personal assistant like Alexa or Siri for a recommendation demands data
Science. So does operating a self-driving search engine that provides
car, using a

useful results or talking to a chatbot for customer service. These are all real-life
applications for data science.
Following are some main reasons for using data science technology
oWith the help of data science technology, we can convert the massive
amount
of raw and unstructured data into
meaningful insights.
Data science technology is opting by various companies,
whether it is a big
brand or a startup. Google, Amazon, Netflix, which
handle the huge amount
of data, are using data science
algorithms for better customer experience.
Data science is working for
automating transportation such as creating a
self-driving car, which is the future of transportation.
Data science can help in different
predictions such as various surveys, elections,
flight ticket confirmation, etc.

TECHNICAL PUBLICATIONS a n up-thrust for knowledge

Data Science and Big Data Analytics 1-3 Introduction: Data Science and Big Deta

1. Healthcare : Healthcare companies are using data science to build sophisticated

medical instruments to detect and cure diseases.
2. Gaming Video and computer games are now being created with the help of
data science and that has taken the gaming experience to the next level.
3. Image recognition Identifying patterns in images and detecting objects in an
image is one of the most popular data science applications.
4. Logistics: Data science is used by logistics companies to optimize routes to
ensure faster delivery of products and increase operational efficiency.

5. Predict future market trends Collecting and analyzing data on a larger scale
can enable you to identify emerging trends in your market. Tracking purchase
data, celebrities and influencers and search engine queries can reveal what
products people are interested in.
6. Recommendation systems Netflix and Amazon give movie and product
recommendations based on what you like to watch, purchase or browse on
their platforms.
7. Streamline manufacturing: Another way you can use data science in business is
to identify inefficiencies in manufacturing processes. Manufacturing machines
gather data from production processes at high volumes. In cases where the
volume of data collected is too high for a human to manually analyze it, an
algorithm can be written to clean, sort and interpret it quickly and accurately to
gather insights.
1.1.2 Relationship between Data Science and Information Science

Data science, as the interdisciplinary field, employs techniques and theories drawn
from many fields within the context of mathematics, statistics, information science
and computer science. Data science and information science are twin disciplines by
nature. The mission, task and nature of data science are consistent with those of
information science.
Data science is heavy on computer science and mathematics. Information science is
used in areas such as knowledge management, data management and interaction
design.
Information science is the science and practice dealing with the effective collection,
storage, retrieval and use of information. It is concerned with recordable
information and knowledge and the technologies and related services that facilitate
their management and use.

TECHNICAL PUBLICATIONS - an up-thrust for knowledge

Data Science and Big Data 1-4 Introduction: Data Sciend
Analytics and Big
Date
1.1.3 Business Intelligence versus Data Science

Business Intelligent (BI) Data Science

BI tends to provide reports, dashboards, and Data science tends to use disaggregated data
queries on business questions for the cutrent in a
forward-looking, exploratory
more
period or in the past. focusing on analyzing the present wa Ay,
enabling informed decisions about the futurand
B systems make it easy to answer questions Data science tends to be more
related quarter-to-date revenue, progress
to nature and may use exploratoryin
toward quarterly targets, and understand how deal with
scenario optimization to
much of a given product was sold in a
more
open-ended questions.
quarter or year.
prior

BI helpsmonitor the current state of business Data science, as used in

data understand the historical performance
to business, is basically
of a business. data-driven, where many interdisciplinary
sciences are applied together to extract
meaning8
BI is designed to handle static and
highly Data SCience can handle
structured data.
high-volume and complex,
multi-structured
high-speed,
data from a wide variety of data sources.

1.1.4 Data Science Life

Cycle
A data science life
cycle is an iterative set of data science steps
you take to deliver
a
project or
analysis. Fig 1.1.1 shows data science life cycle.

Business
understanding
Data mining Data
exploration

Data science

Feature
lifecycle
engineering Data
visualization

Data
cleanin9 Predictive
modeling
Fig. 1.1.1 Data science life cycle
a) Business understanding: Understand the basic solve

b) Data exploration
problem you are trying
Understand the pattern and bias in
c)Data visualization Create and
your data
study of the visual
representation or data.

TECHNICAL PUBLICATIONS -

an
up-thrust for knowledge
Data Science and Big Data Analytics 1-5 Introduction: Data Science and Big Data

d) Predictive modeling I t is the stage where the machine learning finally comes
into your data.
e) Data cleaning : Detecting and correcting corrupt or inaccurate records.
) Feature engineering It is the process of cutting dowrn the features.

g) Data mining: Gathering your data from different

Review Question
1. Explain data science and its various applicationms. SPPU April-20 (In Sem), Marks 4
1.2 Defining Big Data SPPU: April-18
.Big data can be defined as very large volumes of data available at various sources,
in varying degrees of complexity, generated at different speed i.e., velocities and

varying degrees of ambiguity, which cannot be processed using traditional

technologies, processing methods, algorithms or any commercial off-the-shelf
solutions.

Big data' is a term used to describe a collection of data that is huge in size and
yet growing exponentially with time. In short, such data is so large and complex
that none of the traditional data management tools are able to store it or process it

efficiently.
The processing of big data begins with the raw data that isn't aggregated or
organized and is most often impossible to store in the memory of a single

computer.
Big data processing is a set of techniques or programming models to access
large-scale data to extract useful information for supporting and providing
decisions. Hadoop is the open-source implementation of MapReduce and is widely
used for big data processing
1.2.1 Difference between Data Science and Big Data

Data science Big data

t is field of scientific analysis of data in Big data is storing and processing large
order to solve analytically complex problems volume of structured and unstructured data
with traditional
and the sigificant and necessary activity o that can not be possible
ceansin
wwwww08888888
preparing of data applications

is used in Biotech energy, gamingand Used in retail, education, healthcare and social
media.

GOalsData classificatton, nomaly detection Goals To provide better customer service,

prediction, scoring and tanki identifying new revenue opportunities,
effective marketirng etc.
*ww******************************

TECHNICAL PUBLICATIONS an up-thrust for knowledge

1-6 Introduction : Data Science and Big Data
Data Science and Big Data Analytics

1.2.2 Benefits of Big Data Processingg

Benefits of big data processing
1. Improved customer service.
while taking decisions.
2. Business can utilize outside intelligence
3. Reducing maintenance costs.
4. Re-develop your products : Big data can also help you understand how others

that you adapt them or your marketing, if need

perceive your products so can

be.
5. Early identification of risk to the product / services, if any

6. Better operational efficiency.

1.2.3 Big Data Challenges

.Collecting, storing and processing big data comes with its own set of challenges
1. Big data is growing exponentially and existing data management solutions have
be constantly updated to cope with the three Vs.
2. Organizations do not have enough skilled data professionals who can

understand and work with big data and big data tools.

Review Question
1. Justify your answer with example "Data science and big data are same or diferent".
SPPU April-18 (n Sem), Marks 5

1.3 Data Explosion SPPU: April-18, 19, Dec.-18. 19

The essence of computer applications is to store things in the real world into
computer systems in the form of data, ie., it is a process of producing data. Some
data are the records related to culture and society and others are the descriptions
of phenomena of the universe and life. The large scale of data is rapidly generated
and stored in computer systems, which is called data explosion.
Data is gernerated automatically by mobile devices and computers, think facebook
search queries, directions and GPS locations and image capture.
Sensors also generate volumes of data, including medical data and commere
location-based sensors. Experts expect 55 billion IP- enabled sensors by 2021. Even
storage of all this data is expensive. Analysis gets more important and more
expensive every year.
Fig. 1.3.1 shows the big data explosion by the current data boom and how ical
it is for to be able to extract from all of this data.
us
meaning
TECHNICAL PUBLICATIONS-an up-thrust for knowledge
Date Science and Big Deta Analytics 1-7 Introduction: Data Science and Big Data

E
Fig. 1.3.1 Data explosion

The phenomena of exponential multiplication of data that gets stored is termed as

"Data Explosion". Continuous inflow of real-time data from various processes,
machinery and manual inputs keeps flooding the storage servers every second.
Sending emails, making phone calls, collecting information for campaigns; each
day we create a massive amount of data just by going about our normal business
and this data explosion does not seem to be slowing down. In fact, 90 % of the
data that currently exists was created in just the last two years.
Reason for this data explosion is Innovation.
1. Business model transformation Innovation changed the way in which we do
business, provide services. The data world is governed by three fundamental
trends are business model transformation, globalization and personalization of
services.
o Organizations have traditionally treated data as a legal or compliance
requirement, supporting limited management reporting requirements.
Consequently, organizations have treated data as a cost to be minimized.
o The businesses are required to produce more data related to product and

provide services to cater each sector and channel of customer.

2. Globalization Globalization is an emerging trend in business where
organizations start operating on an international scale. From manufacturing to
customer service, globalization has changed the commerce of the world. Variety
and different formats of data are generated due to globalization.
3. Personalization of services: To enhance customer service, the form of
one-to-one marketing in the form of personalization of service is opted by the
customer. Customers expect communication through various channels increases
the speed of data generation.
4. New sources of data : The shift to online advertising supported by the likes of
Google, Yahoo and others is a key driver in the data boom. Social media,
mobile devices, sensor networks and new media are on the fingertips of
customers or users. The data generated through this is used by corporations for
decision support systems like business intelligence and analytics. The growth of
technology helped to emerge new business models over the last decade or
more. Integration of all the data across the enterprise is used to create business
decision support platform.

TECHNICAL PUBLICATIONS - an up-thrust for knowledge

Data Science and Big Data Analytics 1-8 Introduction: Data Science and Big Data

1.3.1 V's of Big Data

We differentiate big data characteristics from traditional data by one or more of
the five V's: Volume, velocity, variety, veracity and value.
1. Volume: Volumes of data are larger than that conventional relational database
infrastructure can cope with. It consisting of terabytes or petabytes of data.

Fig. 1.3.2 shows big data volume.

Clickstream logs Emails

Application logs Contracts

Geographical
Machine data Data information
volume ems and
geo-spatial data

Fig. 1.3.2 Big data volume

2. Velocity: The term velocity' refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real
potential in the data. t is being created in or near real-time.
3. It refers to
Variety: heterogeneous sources and the nature of data, both
structured and unstructured.
o
Fig. 1.3.3 (a) and Fig. 1.3.3 (b) shows big data velocity and data variety.

Sensor data Mobile

networks

Amazon,
facebook,
Yahoo, Google Data wwww Social
Web based velocity media
companies
Fig. 1.3.3 (a) Data veloclty

TECHNICAL PUBLICATIONS a n up-thrust for knowledge

Date Science and Big Data Analytics 1-9 Introduction: Data Science and Big Data

Structured

Data Unstructured
variety

Semi-structured
*www..

Fig. 1.3.3 (b) Data variety

4. Value : It represents the business value to be derived from big data.
The ultimate objective of any big data project should be to generate some
sort of value for the company doing all the analysis. Otherwise, you're just
performing some technological task for technology's sake.
For real-time spatial big data, decisions can be enhanced through
visualization of dynamic change in such spatial phenomena as cimate,
traffic, social-media-based attitudes and massive inventory locations.
Exploration of data trends can include spatial proximities and
relationships. Once spatial big data are structured, formal spatial analytics
can be applied, such as spatial autocorrelation, overlays, buffering, spatial
cluster techniques and location quotients.
5. Veracity : Big data must be fed with relevant and true data. We will not be
able to perform useful analytics if much of the incoming data comes from false
sources or has errors. Veracity refers to the level of trustiness or messiness of
data and if higher the trustiness of the data, then lower the messiness and vice
versa. It relates to the assurance of the data's quality, integrity, credibility and
accuracy. We must evaluate the data for accuracy before using it for business
insights because it is obtained from multiple sources.

1.3.2 Compare Cloud Computing and Big Data

Cloud computing Big data
provides resources on demand. It provides a way to handle huge volumes
of data and generate insights.
t refers to internet services from SaaS, PaaS toIt refers to data, which can be structured,
laaS semi-structured or unstructured.
wwwwwww. vuwwwiwwwwuwwauwwovwwoww.swoos

TECHNICAL PUBLICATIONS an up-thrust for knowledge

Data Science and Big Data Analytics 1- 10 Introduction: Data Sclence and Big Dat

Cloud is used to store data and information It is used to describe a huge volume of
on remote servers. data and information.

Cloud computing is economical as it has low Big data is a highly scalable, robust
maintenance costs centralized platform no ecosystem and cost-effective.
upfront cost and disaster safe implementation.
Vendors and solution providers of cloud Vendors and solution providers of big
computing are Google, Amazon web service, data are Cloudera, Hortonworks, Apache
Dell, Microsoft, Apple and IBM. and MapR. ************

The main focus of cloud computing is to Main focus of big data is about solving
provide computer resources and services with problems when a huge amount of data
the help of network connection. generating and processing.

Review Questions
1. State one example of big data and explain horw all V's are applied for big data example.
SPPU Dec.-18 (End Sem), Marks 4
2. Explain 5 V's for defining big data along with the factors responsible for data explosion.
SPPU: April-19 (In Sem), Marks 5
3. Explain big data along with 5 V's.
SPPU: April-18 (In Sem), Marks 5, Dec.-19 (End Sem),Marks 6
1.4 Big Data Examples
.Machine data consists of information generated from industrial equipment,
real-time data from sensors that track parts and monitor machinery and even web

logs that track user behavior online.

A t arcplan client CERN, the largest particle physics research center in the world,
the Large Hadron Collider (LHC) generates 40 terabytes of data every second
during experiments.
Regarding transactional data, large retailers and even B2B companies can generate
multitudes of data on a regular basis considering that their transactions consist of
one or many items, product IDs, prices, payment information, manufacturer and
distributor data and much more.

Factors responsible for data volume in big data are as follows

1. Machine data Machine data contains a definitive record of all activity and
behavior of your customers, users, transactions, applications, servers, networks
factory machinery and so on. It's configuration data, data from APIs and
message queues, change events, the output of diagnostic commands and cau
detail records, sensor data from remote equipment and more.

TECHNICAL PUBLICATIONS an up-thrust for knowledge

Date Science and Big Deta Anelytics 1-11 Introduction: Data Science and Big Data

2. Application log : Most homegrown and packaged applications write local

logfiles, logging services built into application servers like WebLogic,
WebSphere and JBoss. These files are critical for day-to-day debugging of
production applications by developers and application support. When
developers put timing information into their log events, they can also be used
to monitor and report on
application performance.
3. Business process logs:Complex events processing and business process
management system logs are treasure troves of business and IT relevant data.
These logs will generally include definitive records of customer activity across
multiple channels such as the web, IVR / contact center or retail.
4. Clickstream data: User activity on the Internet is captured in clickstream data.
This provides insight into a user's website and web page activity. This
informatiorn is valuable for usability analysis, marketing and general research.
5. Third party data: The sensitive data that's not in databases is on file systemns.
In some industries such as healthcare, the biggest data leakage risk is consumer
records on shared file systems. Different OS, third-party tools and storage
technologies provide different options for auditing read access to sensitive data
at the file system level. This audit data is a vital data source for monitoring and
investigating access to sensitive data.
6. Electronic mails: Every company have large collection of emails generated by
customers, employees and executives on daily basis. These email
communication are an important asset to an organization, which are audited
case-by-case basis arnd entire life cycle management of emails is done.
Some of the examples of big data are
1. Social media : Social media is one of the biggest contributors to the flood of
data we have today. Facebook generates around 500+ terabytes of data
everyday in the form of content generated by the users like status messages,
photos and video uploads, messages, comments etc.
2. Stock exchange : Data generated by stock exchanges is also in terabytes per
day. Most of this data is the trade data of users and companies.
3. Aviation industry: A single jet engine can generate around 10 terabytes of data
during a 30 minute flight.
4. Survey data : Online or offline surveys conducted on various topics which
typically has hundreds and thousands of responses and need to be processed
for analysis and visualization by creating a cluster of population and their
associated responses
5. Compliance data : Many organizations like healthcare, hospitals, life sciences,
finance etc, has to file compliance reports.

TECHNICAL PUBLICATIONS an up-thrust for knowledge

Data Science and Big Data Analytios 1-12 Introductlon : Data Solonce and Blg Data

1.5 Data Processing Infrastructure Challenges SPPU: May-13

Data processing infrastructure challenges are storage, transportation, processing
and throughput.
1. Storage The increase in the volume of data, increases the need for storing the
data and processing of data. Big data technology has changed the way we
gather and store data, including data 8torage device, data storage architecture
and data access techniques. It requires more sophisticated storage medium with
higher I/O speed to meet the challenges of big data issues. Direct-Attached
Storage (DAS), Network-Attached Storage (NAS) and Storage Area Network
(SAN) are the enterprise storage architecture that are commonly are in use.

2. Transportation : Data is transfer from one place to other place and process the
data then load into memory for manipulation. The data is transported between
computer and storage layers. Increase in bandwidth is not solution to this
problem.
3. Processing Data processing needs to combine the logic and mathematical
computation in one cycle. This processing can be accomplished by CPU or
processor, memory and software. With each generation CPU processing speed is
increased, have improved processing capabilities. Memory is required for

compute and processing. Memory has become cheaper and faster with
evolution of processor capability. Software are used to write the programs for

transforming and processing of data.

4. Speed and throughout : This is the major challenges for data processing.
Various architecture layers like hardware, software's networking and storage are

responsible for storage and are added. Each layer has its own limitation, causes

limitation in the overall throughput in the data processing

Review Question
in big data.
1. List and explain data processing infrastructure challenges
SPPU: May-18 (End Sem), Marks 6

1.6 Big Data Processing Architectures

SPPU: April-18,19,20, May-18

1.6.1 Data Warehouse

tile
A data warehouse is a subject-oriented, integrated, time-variant and non-volalata
collection of data in support of management's decision-making process. A a

warehouse stores historical data for purposes of decision support.

TECHNICAL PUBLICATIONS -an up-thrust for knowledge

Data Science and Big Data Analytics 1-14 Introduction: Data Science and Big Data
ata
Fig. 1.6.1 shows three tier architecture. Three tier architecture sometimes called
multi-tier architecture. ed
Toptier

wwwwwww wMMAIN6MAMH
**wwwwww

Query/report Analytics Data mining

Middle tier Output

wwwwwwww
wwww.
OLAP server OLAP server
Bottom tier

Monitoring Administration

Data warehouse
Metadata Data marts
repositoryy

Extract/Clean/Transform/
Load/Refresh

Operational databases External sources

Fig. 1.6.1 Three tier architecture
The bottom tier is the database of the warehouse, where the cleansed and
transformed data is loaded. The bottom tier is a warehouse database server.
The middle tier is the application layer giving an abstracted view of the
database.
It arranges the data to make it more suitable for
analysis. This is done with an
OLAP server, implemented using the ROLAP or MOLAP model.
OLAPS can interact with both relational databases and multidimensional
databases, which lets them collect data better based on broader parameters.
The top tier is the front-end of an organization's overall business
intelligence suite
The top-tier is where the user accesses and interacts with data via queries, data
visualizations and data analytics tools.

TECHNICAL PUBLICATIONS an up-thrust for knowledge

Date Science and Big Data Analytics 1-15 Introduction: Data Science and Big Data

The top tier represents the front-end client layer. The client level which includes
the tools and Application Programming Interface (AP) used for high-level data
analysis, inquiring and reporting. User can use reporting tools, query, analysis or
data mining tools.

1.6.2 Shared Everything Architecture

This architectural model consists of nodes that share all resources within the
system. Each node has access to the same computing resources and shared storage.
Shared-everything architecture refers to system architecture where all resources are
shared including storage, memory and the processor.

Fig. 1.6.2 shows shared everything architecture.

Main memory

CPU 1 CPU 2 CPU3 1 CPU4

Disk

Fig. 1.6.2 Shared everything architecture

utilization. The
The main idea behind such a system is maximizing resource
due to
disadvantage is that shared resources also lead to reduced performance
contention.

main problem. Oracle RAC uses this architecture.

Scalability is the
Distributed Shared Memory (DSM) are the
Symmetric multiprocessing (SMP) and
architecture.
types of shared everything
for read-write
I n the SMP architecture, all the CPUs share a single pool of memory
access concurrently and uniformly
without latency. Sometimes this is referred to as
architecture.
Uniform Memory Access (UMA)

TECHNICAL PUBLICATIONS an up-thrust for knowledge

Data Science and Big Data Analytics 1-16 Introduction: Data Science and Big Data

The DSM architecture addresses the scalability problem by providing multinla

pools of memory for processors to use. In the DSM architecture, the latency s
access memory depends on the relative distances of the processors and thei
dedicated memory pools. This architecture is also referred to as Nonunifom
Memory Access (NUMA) architecture.

1.6.3 Shared Nothing Architecture

.Shared nothing architecture is a distributed
computing architecture that consists of
multiple separated nodes that don't share resources. The nodes are independent
and self-sufficient as they have their own disk space and
memory.
Each node has its own private memory (M), processor (CPUs) and
storage devices
independent of any other node in the configuration. This means that every node
stores its own lock table and buffer
pool.
Fig. 1.6.3 shows shared nothing architecture.

interconnection network

Proc. 1 Proc. 2 Proc.N

Memory Memory Memory

Fig. 1.6.3 Shared nothing architecture

The key feature of shared-nothing architecture is that the

operating system,hardware
application server, owns responsibility for controlling and sharing
not the

resources. Each node is under the control of its own copy of the operating system
and thus can be viewed as a local site.
Shared nothing is also known as
Massively Parallel Processing (MPP) solutions
typically employed by large data warehouse systems.
Data is horizontally partitioned across nodes, such that each node has a
subset of
the rows from each table that was distributed and all the replicated tables.
Shared nothing can be made to scale to hundreds thousands of machines.
or even
Because of this, it is generally regarded as the best-scaling architecture.
Shared-nothing architecture scales better and is well suited for a cloud data
warehouse considering very low-cost commodity PCs and networking hardware.

TECHNICAL PUBLICATIONS an up-thrust for knowledge

Data Science and Big Deta Analytics 1-17 Introduction: Data Science and Big Data

1.6.4 Re- engineerlng the Data Warehouse

Re-engineering the data warehouse means building a next generation data
warehouse. Fig 1.6.4 shows
re-engineering the data warehouse.

Data engineering

Re.engineering the
deta warehouse

Replatforming Platform engineering

Fig. 1.6.4 Re-engineering the data warehouse
Various methods of
re-engineerings are
replatforming, platform engineering and
data engineering.
1.
Replatforming
Replatforming means, new infrastructure and hardware.
:

Depending on orgarnizations requirement, new technologies such as warehouse

appliances, tiered storage, private cloud can be deployed.
Advantages : Scalability, reliability, security, lower maintenance, code
optimization.
Disadvantages: It is time consuming and leads to disruption of business activities.
2. Data
engineering It is re-engineering of data structures for creating better
performance. There is a change in initial data model of data warehouse to make
new data model. It includes
partitioning the tables in vertical or horizontal
partition, colocation of related tables in same
storage region, distribution of
data, adding new data types and
adding new database functions for
performance boost.
3. Platform engineering It is related to modifying some
parts of the
infrastructure, which helps to gain better scalability and
improved performance.
Platform engineering is
popular concept in automotive
parts are crafted to offer improved quality, service and cost.industry product
as

Review Questions

1. Explain the role of shared everything and shared nothing architecture in big data.
2. What is data
SPPU
April-18 (In Sem), Marks 5
warehouse ?
Explain design and architecture of data warehouse.
SPPU May-18 (End Sem), Marks 6
3. Explain shared-everything and shared-nothing architectures in detail with
respect to data. big
SPPU April-19 (In Sem), Marks 5
TECHNICAL PUBLICATIONS a n up-thrust
for knowledge
1-18 Introduction: Data Solence and Blg Data
Data Science and Big Data Analytics

4. List and choices for re-engineering the data warehouse.

explain
SPPU:April-19 (n Sem), Marks 5
associated with the big data.
5. Discuss the processing complerities
SPPU April-19 (In Sem), Marks 5
6. What are the pitfalls of data warehouse ? Why companies are shifting to big data using hadoop.

SPPU : April-20 (In Sem), Marks 4

7. Draw and explain big data processing architecture with technologies used at each of the stage of

big data processing. SPPU: April-20 (In Sem), Marks6

1.7 Big Data Learning Approaches SPPU:April-18, 20
. Data is a boon for machine learning systems. The more data a system receives, the
more it learns to function better for businesses. Hence, using machine learning for
big data analytics happens to be a logical step for companies to maximize the
potential of big data adoption.
Big data refers to extremely large sets of structured and unstructured data that
cannot be handled with traditional methods. Big data analytics can make sense of
the data by uncovering trends and patterns. Machine learning can accelerate this
process with the help of decision-making algorithms. It can categorize the
incoming data, recognize patterns and translate the data into insights helpful for
business operations.
Machine learning algorithms are useful for collecting, analyzing and integrating
data for large organizations. They can be implemented in all elements of big data
operation, including data labeling and segmentation, data analytics and scenario
simulation.
.Machine Learning (ML) is considered as a very fundamental and vital component
of data analytics. In fact ML is predicted to be the main drivers of the big data
revolution for obvious reasons for its ability to learn from data and provide with
data driven insights, decisions and predictions.
Machine learning algorithms can figure out how to perfom important tasks by
generalizing from examples.
Machine learning provides business insight and intelligence. Decision makers are
provided with greater insights into their organizations. This adaptive technology is
being used by global enterprises to gain a competitive edge.
Supervised and unsupervised learning are the different types of machine learning
methods.

TECHNICAL PUBLICATIONS - an up-thrust for knowledge

Data Science and Big Deta
Analytics 1-19 Introduction: Data Science and
Big Data

Machine leaning

Supervised learning Unsupervised learning

K means
Classification Regression K medoid
DBSCAN
Knn Linear CLARA
Naive Bayes regression OPTICS
Support vector Lasso Neural network
machines regression
Decision trees Decision trees
Neural network Neural network
Fig. 1.7.1
Supervised learming is the machine learning task of inferring a function from
supervised training data. The training data consist of a set of training examples.
The task of the supervised learner is to
predict the output behavior of a system for
any set of input values, after an initial training phase. Supervised learning is also
called classification.

Unsupervised learning algorithms aim to learn rapidly and can be used in

real-time. Unsupervised learning is frequently employed for data clustering,
feature extraction etc. Unsupervised learning is also called clustering.
Review Questions
1. Enlist the impact of learning approaches in big data?Explain diferent kinds of learming
approaches. SPPU: April-18 (In Sem). Marks 5
2. Explain diferent learning approaches in big data. Explain with example.
SPPU April-20 (ln Sem). Marks 6

1.8 Data Science: The Big Picture SPPU Dec.-19

The field of data science is fundamentally about organizing and using data to
provide insights for human decision-making. Developing the capability to glean
insights from data has become crucial for start-ups and fortune 500 companies
alike.
Many organizations have been collecting unprecedented amounts of data from
both physical sensors and the online activity of milions of people but a pile of
unorganized data will not yield insights on its own.
TECHNICAL PUBLICATIONS - an up-thrust for knowiedge
Data Science and Big Date Analytics 1-20 Introduction : Data Science and Big Daa

This is where data scientists come into up and organizing

play by cleaning data th
make it suitable for analysis. They also build the statistical models necessary f
analyzing data to reveal notable patterns or trends.

Several key types of data analysis

a) Descriptive analytics aims to gain insight into either current or historical data
trends.
b) Predictive analytics looks to gain insight into future unknowns by using the
best available data to make predictions.
)Prescriptive analytics can recommend what humans should do given he
available data insights.

1.8.1 Relation between Al and Machine Learning

The emergence of modern AI based machine has
on
learning significantly boosted
predictive and prescriptive analytics.
AI refers to a set of tools that can automate machine actions in ways that mimic
intelligent behaviors. A small sample of such applications might include
a) Visually identifying cats and dogs in social media photos.
b) Translating between different languages in the text found on websites.
c) Detecting possible signs of cancer in patients' X-ray images.
Machine learning in modern Al systems
Most modern Al is based on machine learning: A category of computer
algorithms that can automatically learn from data. Instead of relying on humans to
program each step, ML models train on large datasets to identify notable patterns
within the data and make their own predictions based on that information.
They can then apply the lessons leamed from their training datasets to analyzing
completely new and unfamiliar datasets in the real world.
The importance of the training data means that ML perfomance depends greatly
upon having access to large and diverse datasets of high quality.
For example, machine model that trains to recognize
a
learming dogs by only
looking at 100 images of Siberian Huskies is unlikely to perform well when
suddenly tasked with identifying tens of thousands of images from a diverse array
of dog breeds.
ML models can follow several different approaches
a) Supervised learning relies heavily upon hand-labeled training datasets and 1s
the most common type of machine learning.

TECHNICAL PUBLICATIONS an up-thrust for knowedge

Data Analytics Introduction: Data Science and Big Data
Data Science and Big 1-21

b) Unsupervised learning sifts through unlabelled data to try and find unusual
patterns that might escape the human eye.
c)Reinforcement learning uses trial and error to learn from mistakes and get
closer to achieving a specific goal.

Here is just one example of how data science can intersect with AlI based on
machine learning. Let us assume that an Internet search engine company wants to
provide and monetize the most relevant online searches in response to the query:

"Allergy medicine for kids."

.Data scientists help collect and organize large datasets containing millions of user
search results related to allergy medication for kids. Then, they work with
software developers and engineers to build machine learning models that learm
from these datasets.
Through training, machine learning models can identify user preferences for
various search results, such as information about what allergy medications come in
the form of syrups and chewable tablets. This helps to continuously update the
search engine, so that it delivers more relevant results and ranks them higher.
The search and click trends identified by the machine learning models also
provide information about people's medical needs and shopping habits, such as
certain allergy medicine brands being more popular among families in a specific
geographic area at a certain time of year. .
Data scientists analyze these trends to find business insights that they can share
with corporate leaders and online advertisers.

1.8.2 Data Mining and Big Data Analytics

Data mining refers to extracting or mining knowledge from large amounts of data.
t is a process of discovering interesting patterns or knowledge from a large
amount of data stored either in databases, data warehouses or other informatiorn
Tepositories.
t is the computational process of discovering patterns in huge data sets involving
methods at the intersection of AI, machine learning, statistics and database
systems.
To make predictions, predictive mining tasks perform inference on the current
data. Predictive analysis provides answers of the future queries that move across
using historical data as the chief principle for decisions.
It involves the supervised learning functions used for the prediction of the target
value. The methods fall under this mining category are the classification,
time-series analysis and regression.
TECHNICAL PUBLICATIONs a n up-thrust for knowledge
Data Science and Big Data Analytics 1-22 Introduction: Data Science and Big Data

Descriptive analytics is the conventional form of business intelligence and dat.

ata
analysis, seeks to provide a depiction or "summary view" of facts and figures
understandable format, to either inform or prepare data for further analysis

.Descriptive analytics helps organizations to understand what happened in thene

past. It helps to understand the relationship between product and customers.

Big data analytics is the often complex process of examining big data to uncover
information such as hidden patterns, correlations, market trends and customer
preferences that can help organizations make informed business decisions.

Review Question
1. Explain machine learning approaches in big data. SPPU Dec.-19 (End Sem), Marks 6

1.9 Multiple Choice Questions

Q.1 Three characteristics of big data are

volume, velocity, variety

bvalue, variable, variance
C
volume,vanish, various
dvelocity, volume, vault
vewww

Q.2 Data is collection of data objects and their

a information
********
battributes
characteristics vwww
d none

Q.3 In big data, refer to heterogeneous sources and the nature of data, both
structured and unstructured.
********

avolume b variety
C
velocity
*********
dall of these
Q.4 Various types of data analytics are .

*******:

aj descriptive model m
b predictive-model-.
cprescriptivemodel
********* d all of these

Q.5 Machine learning is inherently a. field.

a interdisciplinary b multidisciplinary
C
single dnone

TECHNICAL PUBLICATIONS an up-thrust for knowledge

Data Science
No ratings yet
Data Science
8 pages
Foundations of Data Science Course Details
No ratings yet
Foundations of Data Science Course Details
54 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
155 pages
Database Management Systems EduEngg
No ratings yet
Database Management Systems EduEngg
306 pages
AR, VR, and MR: Applications & Uses
100% (1)
AR, VR, and MR: Applications & Uses
10 pages
CF LAB Manual FINAL
No ratings yet
CF LAB Manual FINAL
66 pages
AI: Definitions and Concepts Overview
No ratings yet
AI: Definitions and Concepts Overview
223 pages
Unit 3
No ratings yet
Unit 3
77 pages
Gujarat Technological University: Prerequisite
No ratings yet
Gujarat Technological University: Prerequisite
5 pages
Robotic Process Automation
No ratings yet
Robotic Process Automation
4 pages
Module-3: Device Configuration
No ratings yet
Module-3: Device Configuration
22 pages
LPUNIT1 ppt1
No ratings yet
LPUNIT1 ppt1
41 pages
Module 2 - Statistical Foundations
No ratings yet
Module 2 - Statistical Foundations
108 pages
Mathematical Foundations of Data Science Class Notes
No ratings yet
Mathematical Foundations of Data Science Class Notes
45 pages
AI and Ar Presentation
No ratings yet
AI and Ar Presentation
17 pages
Tech Max
No ratings yet
Tech Max
116 pages
Eigenvalues and Eigenvections Examples and Practice Exercises
No ratings yet
Eigenvalues and Eigenvections Examples and Practice Exercises
18 pages
Mathematical Foundations of Data Science and Informatics
No ratings yet
Mathematical Foundations of Data Science and Informatics
2 pages
BCT Unit-2
No ratings yet
BCT Unit-2
37 pages
CH 1
No ratings yet
CH 1
29 pages
6 Module 3 09 06 2023
No ratings yet
6 Module 3 09 06 2023
55 pages
3-Module-2 (Part-1) - 19-05-2023
No ratings yet
3-Module-2 (Part-1) - 19-05-2023
41 pages
2nd - Semester - Data Science
No ratings yet
2nd - Semester - Data Science
16 pages
Introduction To Datascience (R20DS501)
No ratings yet
Introduction To Datascience (R20DS501)
162 pages
cs3352 Foundation of Data Science
No ratings yet
cs3352 Foundation of Data Science
117 pages
DF - Techknowledge - 3 Module
No ratings yet
DF - Techknowledge - 3 Module
37 pages
Advanced Distributed Databases
100% (1)
Advanced Distributed Databases
20 pages
Data Science & ML Course Guide
No ratings yet
Data Science & ML Course Guide
83 pages
DF Module 2
No ratings yet
DF Module 2
44 pages
Information Security Lab Overview
No ratings yet
Information Security Lab Overview
34 pages
MCA Database Management Textbook
No ratings yet
MCA Database Management Textbook
16 pages
Data Visualization-1
No ratings yet
Data Visualization-1
29 pages
Mathematical Foundations of Data Science
No ratings yet
Mathematical Foundations of Data Science
180 pages
Compiler
No ratings yet
Compiler
60 pages
Module 1-Introduction To Virtual Reality
No ratings yet
Module 1-Introduction To Virtual Reality
22 pages
BITS F452 Blockchain
No ratings yet
BITS F452 Blockchain
3 pages
Information Security Lab Manual
No ratings yet
Information Security Lab Manual
24 pages
Database Management System For Online PDF
No ratings yet
Database Management System For Online PDF
151 pages
Vtu 5th Sem Computer Network-1 Notes 10cs55
100% (1)
Vtu 5th Sem Computer Network-1 Notes 10cs55
266 pages
Boosting and AdaBoost For Machine Learning
No ratings yet
Boosting and AdaBoost For Machine Learning
18 pages
Block-2 Dbms Ignou
100% (1)
Block-2 Dbms Ignou
90 pages
Mcsl-045 Unix Lab Manual Solved
No ratings yet
Mcsl-045 Unix Lab Manual Solved
15 pages
Lab Manual FOR Unix Operating System: Ms. A.K. Ingale
No ratings yet
Lab Manual FOR Unix Operating System: Ms. A.K. Ingale
31 pages
DC Aiml
100% (1)
DC Aiml
210 pages
Module 4
No ratings yet
Module 4
63 pages
Distributed Databases: Course Code:13IT1109 L TPC 4 0 0 3
No ratings yet
Distributed Databases: Course Code:13IT1109 L TPC 4 0 0 3
3 pages
Compiler Design Essentials
No ratings yet
Compiler Design Essentials
72 pages
Augmented & Virtual Reality 2
No ratings yet
Augmented & Virtual Reality 2
29 pages
Python Programming Lab Manual
No ratings yet
Python Programming Lab Manual
22 pages
Data Visualization - Chapter3
No ratings yet
Data Visualization - Chapter3
25 pages
UNIT - I Intro To DS
No ratings yet
UNIT - I Intro To DS
18 pages
Introduction to Data Science & Big Data
No ratings yet
Introduction to Data Science & Big Data
14 pages
Fdsa Unit 1
No ratings yet
Fdsa Unit 1
19 pages
Fundamentals of Data Science Course Overview
No ratings yet
Fundamentals of Data Science Course Overview
65 pages
Data Science Applications by Rajesh - 91
No ratings yet
Data Science Applications by Rajesh - 91
46 pages
Introduction To Data Science - A Beginner Guide
100% (1)
Introduction To Data Science - A Beginner Guide
18 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
12 pages
A Project On: Adobe Flash
No ratings yet
A Project On: Adobe Flash
29 pages
Engineering Graduate Resume
No ratings yet
Engineering Graduate Resume
5 pages
Software Design Best Practices
No ratings yet
Software Design Best Practices
2 pages
Class-3 FUN With MS Paint
No ratings yet
Class-3 FUN With MS Paint
5 pages
UPM R4A 202043500025 Riosabatian
No ratings yet
UPM R4A 202043500025 Riosabatian
5 pages
Sap La Ux400 en 23 Ex
No ratings yet
Sap La Ux400 en 23 Ex
269 pages
Git Commands for Story Blog Setup
No ratings yet
Git Commands for Story Blog Setup
19 pages
(Ebook PDF) Matlab An Introduction With Applications 5Th Edition
No ratings yet
(Ebook PDF) Matlab An Introduction With Applications 5Th Edition
39 pages
Develop An Android Program To Display All Sensors Available in Devices
No ratings yet
Develop An Android Program To Display All Sensors Available in Devices
5 pages
OWASP Top 10 Vulnerabilities Overview
No ratings yet
OWASP Top 10 Vulnerabilities Overview
7 pages
Acharya Institute of Technology Bengalur
No ratings yet
Acharya Institute of Technology Bengalur
112 pages
Siemens 7SA510 7SA511 V3.x Line PTT User Manual ENU
0% (1)
Siemens 7SA510 7SA511 V3.x Line PTT User Manual ENU
4 pages
High Availability Essentials
No ratings yet
High Availability Essentials
15 pages
WinPADS200-EK BriefManual en PDF
No ratings yet
WinPADS200-EK BriefManual en PDF
21 pages
Design and Implementation of Medical Laboratory Scientific Database Management System
No ratings yet
Design and Implementation of Medical Laboratory Scientific Database Management System
68 pages
Adidas ZX 2K Boost Shoes Sale Egypt
No ratings yet
Adidas ZX 2K Boost Shoes Sale Egypt
1 page
Syed Tousif Resume
No ratings yet
Syed Tousif Resume
5 pages
Final Exam
No ratings yet
Final Exam
54 pages
Launch X-431 Pro3S+ User Manual: Quick Links
No ratings yet
Launch X-431 Pro3S+ User Manual: Quick Links
100 pages
MSP430 Datasheet PDF
No ratings yet
MSP430 Datasheet PDF
183 pages
Types of Cyber Security Policies
No ratings yet
Types of Cyber Security Policies
17 pages
Raunaks Resume
No ratings yet
Raunaks Resume
1 page
Level 3 Answers
No ratings yet
Level 3 Answers
4 pages
How To Make Android's Work For You: Bootable Recovery
No ratings yet
How To Make Android's Work For You: Bootable Recovery
59 pages
Schlage Connect Quick Start Guide
No ratings yet
Schlage Connect Quick Start Guide
20 pages
SQL Assignment 22B1535
No ratings yet
SQL Assignment 22B1535
3 pages
301 7 Python
No ratings yet
301 7 Python
86 pages
UNIT I - Cloud Computing
No ratings yet
UNIT I - Cloud Computing
56 pages
Deep Learning for Object Tracking
No ratings yet
Deep Learning for Object Tracking
3 pages
Zellij
No ratings yet
Zellij
3 pages