0% found this document useful (0 votes)
30 views

Chapter1 FDS

The document discusses the key concepts of data science and big data. It defines data science as using methods to analyze massive amounts of data to extract knowledge. It describes the relationship between big data and data science, noting data science evolved from statistics. The document outlines different types of data in big data sets including structured, unstructured, natural language, machine-generated, graph-based, audio/video/images, and streaming data. It provides examples to illustrate each type of data.

Uploaded by

idea.master1403
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Chapter1 FDS

The document discusses the key concepts of data science and big data. It defines data science as using methods to analyze massive amounts of data to extract knowledge. It describes the relationship between big data and data science, noting data science evolved from statistics. The document outlines different types of data in big data sets including structured, unstructured, natural language, machine-generated, graph-based, audio/video/images, and streaming data. It provides examples to illustrate each type of data.

Uploaded by

idea.master1403
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CS – 354 : Foundations of Data

Science
Dr. Reena Bharathi
Syllabus
• C:\Users\RRB\FDS_TY\T.Y.B.Sc. (Computer
Science)_07.07.2021.pdf

Foundations of data Science ==> Dr. Reena


2
bharathi
Introduction to Data Science
• Data science is what makes us humans, what we
are today.
• Ability of our brain to
– See connections
– Draw conclusion from facts
– Learn from past experience.
• Inabilities of our biological body
– Huge amount of raw computing
– Storing the huge amount of data that we have
captured so far, due to our curiosity

3
• Thus the need of help from machines that can
– Help to recognize patterns
– Create connections
– Provide us with ready and correct answers , as per
our need.

Foundations of data Science ==> Dr. Reena


4
bharathi
WHAT IS DATA SCIENCE ??

Foundations of data Science ==> Dr. Reena


5
bharathi
What is Big Data
• Massive sets of unstructured/semi-structured data from Web
traffic, social media, sensors, etc
• Petabytes, exabytes of data

• In the last minute there were ……


• 204 million emails sent • 100,000 tweets
• 61,000 hours of music • 6 million views and 277,000 Facebook
listened to on Pandora Logins
• 20 million photo views • 2+ million Google searches
• 3 million uploads on Flickr
Big Data Sources
Web
Human-machine Access
communication

Social
Media
RFID
tags

Human –Human
Big
Communication
data Machine to machine
communication
(IOT)

Sensors
Big Data :3V’s
Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
– Social Network, Semantic Web (RDF), …

• Streaming Data
– You can only scan the data once

• A single application can be


generating/collecting many types of data

• Big Public Data (online, weather, finance, etc)

To extract knowledge all these types of data


need to linked together
Velocity (Speed)
• Data is being generated fast and need to be
processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
– E-Promotions: Based on your current location, your purchase
history, what you like send promotions right now for store
next to you

– Healthcare monitoring: sensors monitoring your activities and


body any abnormal measurements require immediate
reaction
Real-time/Fast Data

Mobile devices
(tracking all objects all the
time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and


networks
(measuring all kinds of
data)
• The progress and innovation is no longer hindered by the ability to collect data.
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
Uncertainty of Data
4V’s of big data
What is Data Science
• Data science involves using
– methods to analyze massive amounts of data and
extract the knowledge it contains.
• Relationship between big data and data science
– the relationship between crude oil and an oil refinery.
• Data science and big data evolved from statistics
and traditional data management but are now
considered to be distinct disciplines.

Foundations of data Science ==> Dr. Reena


14
bharathi
• Data science is an Evolutionary extension of
Statistics, capable of handling the massive
amount of data .
• Main difference between a data scientist and
a statistician :
– Data scientist ability to work with big data sets,
experience in M/L, computing and algoritm
building.
– Tools used by Data scientist Hadoop, Pig,
Python, R , Java etc.

Foundations of data Science ==> Dr. Reena


15
bharathi
Significance of data science and Big
data
• Both used in almost all commercial and
non-commercial settings.
• Commercial establishments use to gain insights
into their customers, processes, staff, products
etc.
• Used for providing personalized offerings to
customers, better user-experience etc. (Google
AdSense collects data from internet users , to
match relevant commercial messages to the
person browsing the net)
Foundations of data Science ==> Dr. Reena
16
bharathi
• HR professionals use people analytics and text
mining to screen candidates , monitor mood of
employees, detect and study informal networks
among coworkers etc.
• Financial institutions data science for predicting
stock market, to determine risk of lending money,
new ways to attract more customers etc
• Government organizations to gain insights to
projects, optimize project fundings, to monitor
millions of individuals, by collecting and distiling
information from social media and other data
sources.

Foundations of data Science ==> Dr. Reena


17
bharathi
• NGOs to raise money and defend their
causes. Eg WWF (World Wildlife Fund)
employs data scientists to increase
effectiveness of their fund raising efforts.
• Educational institutions to increase study
experience of students, Massive Online
Courses (MOOC) produces lots of data , that
can be used to check how this type of learning
can complement traditional classes.

Foundations of data Science ==> Dr. Reena


18
bharathi
Different types of data in Big data sets
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming

Foundations of data Science ==> Dr. Reena


19
bharathi
• Structured Data :
– Depends on a data model and resides in a fixed
field within a record.
– Easy to store, retrieve data . Eg Relational tables
• Unstructured Data :
• Not easy to fit into a data model, since the content is
context-specific or varying.
• Eg : Regular email.
• Email is also an example of natural language data.
• Even though email contains structured elements like
sender, subject etc, its difficult to analyze the contents
written in the email body.

Foundations of data Science ==> Dr. Reena


20
bharathi
• Natural language:
– A special type of unstructured data
– Requires knowledge of specific data science
techniques and linguistics, to process.
• Machine-generated data:
– Data created automatically, without human
intervention.
– A major data source for big data sets.
– Need highly scalable tools for analytics, due to its
volume and speed (velocity)
– Egs Web server logs, call detail records, network
event logs, sensor data etc

Foundations of data Science ==> Dr. Reena


21
bharathi
• Graph-based or network data:
– Represents pair-wise relationships between
objects.
– Focuses on the relationships or adjacency of
objects.
– Uses nodes, edges and properties to represent
and store graphical data.
– Eg : data on social media websites : follower list
on twitter, friends list on facebook.

Foundations of data Science ==> Dr. Reena


22
bharathi
• Audio, Image, Video:
– Poses challenges to a data scientist.
– Difficulty in identifying objects, within an image.
– Eg : An application that’s capable of learning how
to play video games. Uses deep learning , to
interpret everything on a video screen input.
• Streaming data:
– Data flows into the system when an event
happens, instead of being loaded into a data store
in a batch.
– Egs Live sports events, stock market live.

Foundations of data Science ==> Dr. Reena


23
bharathi
Data science Process (Life Cycle)
• Data science process made up of six steps:
– Setting the research goal
– Retrieving data
– Data preparation
– Data exploration
– Data modeling
– Presentation and automation.

Foundations of data Science ==> Dr. Reena


24
bharathi
• Setting the research goal:
– Data science most applied in context of an
organization
– First step in a data science project is to prepare a
project charter
• Information regarding what is to be researched
• The benefit to the organization from the research
• The data and resources needed
• A timetable
• Deliverables.

Foundations of data Science ==> Dr. Reena


25
bharathi
• Retrieving data
– Collect data .
– Checking the existence, quality and access to data
– Data can be obtained from different sources in
different forms
• Data preparation
– Enhance the quality of data and prepare it for further
processes
– Three subphases
• Data cleansing to remove false values, inconsistencies etc
• Data integration combine data from multiple sources
• Data transformation ensure that the data is in a suitable
form to be used by our analytical models.

Foundations of data Science ==> Dr. Reena


26
bharathi
• Data exploration
– Building a deeper understanding of your data
• Understand the relationship between variables in data
• Understand the distribution of data
• Check for any outliers
– Also called as exploratory data analysis.
• Data modeling / model building
– Use models, domain knowledge and insights about
data , to answer research questions
– Technique selected from the field of statistics/
machine learning, operation research
– Model building done in an iterative manner selecting
variables for the model, executing the model and
performing model diagnostics.
Foundations of data Science ==> Dr. Reena
27
bharathi
• Presentation & automation
– Present the results to the business.
– Can be presented in the form of presentations/
charts/ research reports etc
– Business then will decide to design an operational
process to use the outcomes from the model.

Foundations of data Science ==> Dr. Reena


28
bharathi
Data scientist’s Tool box
• Many big data tools , frameworks available.
• Big data ecosystem can be grouped into technologies, for easier
understanding.
• Big data ecosystem consists of the following:
– Distributed file system
– Distributed programming
– Machine learning
– Data integration tools
– Security
– Service programming
– System deployment tools
– Benchmarking
– Scheduling
– NoSQL & New SQL databases

Foundations of data Science ==> Dr. Reena


29
bharathi
• Distributed file systems:
– Runs on multiple servers at once
• Distributed file systems have significant
advantages:
– They can store files larger than any one computer

disk.
– Files get automatically replicated across multiple

servers for redundancy or parallel operations


while hiding the complexity of doing so from the
user.
– The system scales easily: you’re no longer bound

by the memory or storage restrictions of a single


server. Horizontal scaling
Foundations of data Science ==> Dr. Reena
30
bharathi
• Distributed programming
– One important aspect of DFS Data is not moved
towrads the program, but program will move
data.
– Normal general-purpose programming language
such as C, Python, or Java Doesn’t provide
enough support for complexities that come with
distributed programming, such as restarting jobs
that have failed, tracking the results from the
different subprocesses etc
– Many open-source frameworks provided to work
with distributed data, and also overcome all
challenges.

Foundations of data Science ==> Dr. Reena


31
bharathi
• Data Integration framwork :
– For integrating data from different sources , moving
data from one source to another.
– Data integration frameworks such as Apache scoop,
Apache flume etc.
– Process similar as ETL in a traditional data warehouse.
• Machine learning frameworks:
– To extract coveted insights from big data sets.
– Machine learning, statistics, Mathematics applied.
– Most popular M/L library in Python Scikit-learn
– Other libraries PyBrain for neural networks, NLTK for
natural language , Pylearn2, TensorFlow for deep
learning.

Foundations of data Science ==> Dr. Reena


32
bharathi
• NoSQL databases :
– NoSQL Not Only SQL
– For large amounts of data
– Different types of NoSQL databases
• Column databases
• Document stores
• Key-value stores
• New SQL combines the scalability of NoSQL databases
with advantages of RDBMS
• Graph databases

Foundations of data Science ==> Dr. Reena


33
bharathi
• Scheduling tools:
– To automate repetitive tasks and trigger jobs
based on events, like adding new file /folder etc.
– Eg: Scheduling a Map Reduce task, everytime a file
is added to the DFS.
• Benchmarking tools
– To optimize big data installation by providing
standardized profiling suites.
– A profiling suite is taken from a representative set
of big data jobs.
– Involves benchmarking and optimizing the big
data infrastructure and configurations.

Foundations of data Science ==> Dr. Reena


34
bharathi
• System deployment:
– System deployment tools assist in deploying new
applications into the big data cluster.
– Automates the installation and configuration of
big data components
• Service programming
– Service tools are used to expose the big data
applications to other applications , as a service.
– Eg is REST service (representational state transfer)

Foundations of data Science ==> Dr. Reena


35
bharathi
Types of Data
• Structured
• Unstructured
• Semi-structured

Foundations of data Science ==> Dr. Reena


36
bharathi
Problems with Unstructured data
• Data keeps expanding in volume
• Compilation, organization is Time consuming
– “It was found that a female with a height between 65
inches and 67 inches had an IQ of 125–130. However,
it was not clear looking at a person shorter or taller
than this observation if the change in IQ score could
be different, and, even if it was, it could not be
possibly concluded that the change was solely due to
the difference in one’s height.”
• Uncertainty in the quality of data
• Cannot be analyzed using conventional systems

Foundations of data Science ==> Dr. Reena


37
bharathi
Data Sources
• Data source is the location from where the raw
data is obtained.
• Data collection process of acquiring, collecting,
extracting and storing huge amount of data.
• Different data sources
– Open data source
– Social media data source
– Multimodel data source
– Standard data sets

Foundations of data Science ==> Dr. Reena


38
bharathi
• Open data sources
– Open data sets available to the public
– Local/federal governments, NGOs, academic
communities lead in providing open data sets.
– Open government Data Platform India a
platform for supporting open data initiative of GOI
– List of principles associated with Open data
• Public agencies must adopt a presumption in favour
of openness to the extent permitted by law and subject
to privacy, confidentiality, security etc
• Accessible made available in convenient , modifiable,
and open formats that can be retrived , downloaded,
indexed and searched. Should provide data in multiple
formats for consumtion.

Foundations of data Science ==> Dr. Reena


39
bharathi
• Described data should be described fully so that
consumers of the data have sufficient information to
understand their strength, weaknesses, analytical
limitations, security requirements etc. Metadata of data
should be available.
• Reusable should be available under an open license, no
restrictions on their use.
• Complete open data are published in primary forms, as
collected from source, with finest level of granularity
allowed. Derived or aggregate data must reference primary
data.
• Timely Should be made available as quickly as necessary,
so as to preserve the value of data.
• Managed post-release A point of contact must be
designated to assist with data use and to respond to
complaints about adherence to these open data
requirements.
Foundations of data Science ==> Dr. Reena
40
bharathi
• Social media sources
– Social media interactive web based applications.
– Allows for creation, sharing exchange of information, ideas
etc via virtual communication networks.
– Social media data useful for research / marketing
purposes.
– APIs provided by social media companies to facilitate
access to vast amounts of data.
– APIs set of rules and methods for asking and sending
data.
– Eg Twitter API, Facebook API, Instagram API, Youtube API
etc
– Yelp.com a popular crowd-sourced review platform for
local business, released data sets that have been used ina
wide range of research NLP, graph mining etc

Foundations of data Science ==> Dr. Reena


41
bharathi
• Multimodel data sets
– Huge data sets with different formats and data
tyes IOT
– Need to collect and explore different forms of
data (multimodel) and multimedia data .
• Standard data sets
– Simple data sets tabular data, spread sheets etc.

Foundations of data Science ==> Dr. Reena


42
bharathi
Data Formats
• Integers
• Floats
• Text data (strings)
• Dense numerical arrays (Arrays storage)
• Compressed data
• CSV data(Comma separated values)
headers,Quotes, Nondata rows, Comments
• HTML files
• JSON (javaScript object notation) key-value
pairs, ordered list of values

Foundations of data Science ==> Dr. Reena


43
bharathi
• XML files eXtensible Markup language
• Tar files Tape Archive
• Gzip files
• Zip files
• Image files

Foundations of data Science ==> Dr. Reena


44
bharathi

You might also like