INTRODUCTION TO BIG DATA
ANALYTICS
Utkarsh Sharma
Asst. Prof. (CSE)
Jaypee University Of Engineering & Technology
Big Data Overview
Several industries have led the way in developing their ability to
gather and exploit data:
• Credit card companies monitor every purchase their customers make and
can identify fraudulent purchases with a high degree of accuracy using
rules derived by processing billions of transactions.
• Mobile phone companies analyze subscriber’s calling patterns to
determine, If that rival network is offering an attractive promotion that might
cause the subscriber to defect.
• For companies such as Linked In and Facebook, data itself is their primary
product.
Big Data Overview
Three attributes stand out as defining Big Data characteristics:
• Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows
and millions of columns.
• Complexity of data types and structures: Big Data reflects the variety of new data sources, formats,
and structures, including digital traces being left on the web and other digital repositories for
subsequent analysis.
• Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data
ingestion and near real time analysis.
Another definition of Big Data comes from the McKinsey Global report from 2011:
• Big Data is data whose scale, distribution, diversity, and/or timeliness
require the use of new technical architectures and analytics to enable
insights that unlock new sources of business value.
McKinsey's definition of Big Data implies that organizations will need new data architectures and
analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the
new role of the data scientist.
Data Deluge
An Example(Genomic sequencing)
While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one
human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now,
websites such as 23andme offer genotyping for less than $100.
Data Structures
• Big data can come in multiple forms, including structured and
non-structured data such as financial data, text files, multimedia
files, and genetic mappings.
• Most of the Big Data is unstructured or semi-structured in
nature, which requires different techniques and tools to process
and analyze.
• Distributed computing environments and massively parallel
processing (MPP) architectures that enable parallelized data
ingest and analysis are the preferred approach to process such
complex data.
Data Structures
Structured Data
• Data containing a defined data type, format, and structure (that is, transaction data, online analytical
processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets).
Semi-structured data
• Textual data files with a discernible pattern that enables parsing (such as Extensible Markup
Language [XML] data files that are self-describing and defined by an XML schema).
Quasi-structured data
• Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance,
web clickstream data that may contain inconsistencies in data values and formats).
• Consider the following example. A user attends the EMC World conference and subsequently runs
a Google search online to find information related to EMC and Data Science. This would produce a
URL such as https: I /www . google. com/ #q=EMC+ data+science
• After doing this search, the user may choose the second link, to read more about the headline "Data
Scientist- EM( Education, Training, and Certification." This brings the user to an erne . com site
focused on this topic and a new URL, ht t p s : I / e ducation . e rne . com/ guest/ campai gn/ data_
science.aspx
• Arriving at this site, the user may decide to click to learn more about the process of becoming
certified in data science. The user chooses a link toward the top of the page on Certifications,
bringing the user to a new URL: ht tps :I I education. erne. com/guest / certifica tion/ framework/ stf/
data_science . aspx,
Unstructured data
• Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
• All of these heterogenous types of data structures created the need
of some specialized data storage and retrieval techniques, such as
Data warehouses and analytics sandbox.
Data Warehouse
• A data warehouse is a central repository of information that can be analyzed to make more informed
decisions.
• Data flows into a data warehouse from transactional systems, relational databases, and other sources,
typically on a regular cadence.
• Business analysts, data engineers, data scientists, and decision makers access the data
through business intelligence (BI) tools, SQL clients, and other analytics applications.
Intro. to Data Warehouse
• The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This
data helps analysts to take informed decisions in an organization.
• An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place whereas a Data Warehouse keeps historical data also.
• A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us Online
Analytical Processing (OLAP) tools.
Understanding a Data Warehouse
• A data warehouse is a database, which is kept separate from the organization's operational
database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization to analyze its business.
• A data warehouse helps executives to organize, understand, and use their data to take strategic
decisions.
• Data warehouse systems help in the integration of diversity of application systems.
• A data warehouse system helps in consolidated historical data analysis.
Analytics sandbox
• A workspace in which data assets are gathered from multiple sources
and technologies for analysis.
• To lessen the performance burden of the analysis, the workspace may
use in-database processing and is considered to be owned by the
analysts rather than database administrators.
• Often, this workspace is created by using a sampling of the dataset
rather than the entire dataset.
• The sandbox may also reduce the stove-piped and partial versions of
the true data that may have been developed in business units.
Analytics sandbox
Types of Data Repositories
Business Intelligence vs Data Science
Examples of Big Data Analytics
• As mentioned earlier, Big Data presents many opportunities to improve sales and marketing
analytics.
• An example of this is the U.S. retailer Target. After analysing consumer purchasing behavior,
Target's statisticians determined that the retailer made a great deal of money from three main life-
event situations.
• Marriage, when people tend to buy many new products.
• Divorce, when people buy new products and change their spending habits.
• Pregnancy, when people have many new things to buy and have an urgency to buy them.
• Target determined that the most lucrative of these life-events is the third situation: pregnancy. Using
data collected from shoppers, Target was able to identify this fact and predict which of its shoppers
were pregnant. In one case, Target knew a female shopper was pregnant even before her family
knew
Data Science Project Lifecycle
Data Science Project Lifecycle
• 1. Obtain Data
• Skills required
• how to use MySQL, PostgreSQL or MongoDB
• 2. Scrub Data
• Skills required
• You will need scripting tools like Python or R to help you to scrub the data.
• 3. Explore Data
• Skills required
• If you are using Python then Numpy, Matplotlib, Pandas or Scipy; if you are using R, then
GGplot2 or the data exploration swiss knife Dplyr. On top of that, you need to have knowledge
and skills in inferential statistics and data visualization.
• 4. Model Data
• Skills required
• In Machine Learning, the skills you will need is both supervised and unsupervised algorithms.
• 5. Interpreting Data
• Skills required
• You will need strong business domain knowledge to present your findings in a way that can
answer the business questions you set out to answer
The Analytics Process
An Analysis process contains all or some of the following phases:
• Business understanding: Identifying and understanding the business objectives
• Data Collection: Collection of data from different sources and its representation
in terms of its application.
• Data Preparation: Removing the unnecessary and unwanted data
• Data Modelling: Create a model to analyse the different relationships between
the objects.
• Data Evaluation: Evaluation and preparation
of analysis report
• Deployment: Finalizing the plan for
deployment
Types of Analytics
On the basis of problem description, four types of data analytics are used:
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Descriptive analytics : What is happening?
• This is the most common of all forms. In business it provides the analyst a view of
key metrics and measures within the business.
• Descriptive analytics juggles raw data from
multiple data sources to give valuable insights
into the past.
• However, these findings simply signal that something
is wrong or right, without explaining why.
Diagnostic: Why is it happening?
• At this stage, historical data can be measured against other data to answer the question
of why something happened.
• Diagnostic analytics gives in-depth insights into a
particular problem.
• On assessment of the descriptive data, diagnostic
analytical tools will empower an analyst to drill down
and in so doing isolate the root-cause of a problem.
Predictive: What is likely to happen?
• Predictive analytics tells what is likely to happen. It uses the findings
of descriptive and diagnostic analytics to detect clusters and
exceptions, and to predict future trends.
• Predictive models typically utilize
a variety of variable data to make
the prediction.
• Predictive analytics belongs to
advanced analytics types and brings
many advantages like sophisticated
analysis based on machine or deep
learning.
Prescriptive: What do I need to do?
• The purpose of prescriptive analytics is to literally prescribe what action to take to
eliminate a future problem or take full advantage of a promising trend.
• The prescriptive model utilizes an understanding of what has
happened, why it has happened and a variety of
“what-might-happen” analysis to help the user determine
the best course of action to take.
• Besides, this state-of-the-art type of data analytics requires not
only historical internal data but also external information due
to the nature of algorithms it’s based on.
Big Data Analytics(One more categorization)
• Basic Analytics
Slicing & Dicing
Basic monitoring
Anomaly identification
• Advanced Analytics
Predictive Modelling
Text Analytics
Statistics and data mining algorithms
• Operational Analytics
• Monetized Analytics
Data Analytics Lifecycle
Brief Overview
• The Data Analytics Lifecycle is designed specifically for Big Data problems and data
science projects.
• The lifecycle has six phases, and project work can occur in several phases at once.
• For most phases in the lifecycle, the movement can be either forward or backward.
• In recent years, substantial attention has been placed on the emerging role of the data
scientist.
• Despite this strong focus on the emerging role of the data scientist specifically, there are
actually seven key roles that need to be fulfilled for a high-functioning data science team
to execute analytic projects successfully.
Key Roles for a Successful Analytics Project
• For a small, versatile team, the seven roles may be fulfilled by only 3 people, but a very large
project may require 20 or more people. The seven roles follow:
Key Roles for a Successful Analytics Project
• Business User :- business analyst, line manager, or deep subject matter expert in the project
domain.
• Project Sponsor :- provides the funding and gauges
• Project Manager :- Ensures that key milestones and objectives are met on time and at the expected
quality.
• Business Intelligence Analyst :- Provides business domain expertise based on a deep
understanding of the data, key performance indicators (KPis).
• Database Administrator (DBA) :- Provisions and configures the database environment to support
the analytics needs of the working team.
• Data Engineer :- Leverages deep technical skills to assist with tuning SQL queries for data
management and data extraction, and provides support for data ingestion into the analytic sandbox.
• Data Scientist :- Provides subject matter expertise for analytical techniques, data modeling, and
applying valid analytical techniques to given business problems.
Data Analytics Lifecycle
Phase 1- Discovery
• Learning the Business Domain
• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
Phase 2: Data Preparation
• Preparing the Analytic Sandbox
• Performing ETLT
• Learning About the Data
• Data Conditioning
• Survey and Visualize
Phase 3: Model Planning
• Data Exploration and Variable Selection
• Model Selection
Phase 4: Model Building
• The team develops data sets for testing, training, and production purposes.
Phase 5: Communicate Results
• The team, in collaboration with major stakeholders, determines if the results of the project
are a success or a failure based on the criteria developed in Phase 1.
Phase 6: Operationalize
• The team delivers final reports, briefings, code, and technical documents.
• In addition, the team may run a pilot project to implement the models in a production
environment.
Key Outputs from a Successful Analytic Project
Big Data Pre-processing
• The set of techniques used prior to the application of a data mining
method is named as data preprocessing for data mining.
• The bigger amounts of data collected require more sophisticated
mechanisms to analyze it.
• Data preprocessing is able to adapt the data to the requirements
posed by each data mining algorithm, enabling to process data that
would be unfeasible otherwise.
Introduction to Big Data Analytics

More Related Content

PPTX
PPTX
data generalization and summarization
PPTX
multi dimensional data model
PPTX
Data warehouse,data mining & Big Data
PPT
Introduction to Data Warehouse
PDF
ETL VS ELT.pdf
PPTX
Data clustring
PPTX
Introduction to data warehousing
data generalization and summarization
multi dimensional data model
Data warehouse,data mining & Big Data
Introduction to Data Warehouse
ETL VS ELT.pdf
Data clustring
Introduction to data warehousing

What's hot (20)

PPTX
Data warehouse and data mining
PDF
Data warehouse architecture
PPTX
Grid based method & model based clustering method
PPTX
Unit 1 - ML - Introduction to Machine Learning.pptx
PPTX
Exploratory Data Analysis
PPTX
Data warehousing
PPTX
Data Mining & Applications
PPTX
Knowledge discovery process
PPTX
Data Reduction Stratergies
PDF
Data Mining & Data Warehousing Lecture Notes
PPTX
Data Analytics Life Cycle
PPTX
Basic Introduction of Data Warehousing from Adiva Consulting
PPTX
Exploratory data analysis with Python
PPT
Files Vs DataBase
PPTX
Data partitioning
PPTX
Data preprocessing in Machine learning
PPT
Map reduce in BIG DATA
PPTX
Metadata ppt
PPTX
Data mining: Classification and prediction
PPT
6 Data Modeling for NoSQL 2/2
Data warehouse and data mining
Data warehouse architecture
Grid based method & model based clustering method
Unit 1 - ML - Introduction to Machine Learning.pptx
Exploratory Data Analysis
Data warehousing
Data Mining & Applications
Knowledge discovery process
Data Reduction Stratergies
Data Mining & Data Warehousing Lecture Notes
Data Analytics Life Cycle
Basic Introduction of Data Warehousing from Adiva Consulting
Exploratory data analysis with Python
Files Vs DataBase
Data partitioning
Data preprocessing in Machine learning
Map reduce in BIG DATA
Metadata ppt
Data mining: Classification and prediction
6 Data Modeling for NoSQL 2/2
Ad

Similar to Introduction to Big Data Analytics (20)

PPTX
This is abouts are you doing the same time who is the best person to be safe and
PDF
Module 2 Data Collection and Management.pdf
PPTX
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
PPTX
Data mining
PDF
Introduction to Business and Data Analysis Undergraduate.pdf
PPTX
Big data analyti data analytical life cycle
PDF
PDF
Business Analytics and Data mining.pdf
PDF
Big data overview
PDF
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
PPTX
Data science.chapter-1,2,3
PPTX
Big_Data.pptx
PDF
Introduction to Data Analytics, AKTU - UNIT-1
PPTX
Big data Analytics Unit - CCS334 Syllabus
PPTX
Data Mining & Data Warehousing
PPTX
TOPIC.pptx
PPTX
Chapter 4 : Introduction to BigData.pptx
PDF
CS3352-Foundations of Data Science Notes.pdf
PDF
Ch_2.pdf
PPTX
Data mining
This is abouts are you doing the same time who is the best person to be safe and
Module 2 Data Collection and Management.pdf
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Data mining
Introduction to Business and Data Analysis Undergraduate.pdf
Big data analyti data analytical life cycle
Business Analytics and Data mining.pdf
Big data overview
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
Data science.chapter-1,2,3
Big_Data.pptx
Introduction to Data Analytics, AKTU - UNIT-1
Big data Analytics Unit - CCS334 Syllabus
Data Mining & Data Warehousing
TOPIC.pptx
Chapter 4 : Introduction to BigData.pptx
CS3352-Foundations of Data Science Notes.pdf
Ch_2.pdf
Data mining
Ad

More from Utkarsh Sharma (10)

PPTX
Model validation
PPTX
Introduction to statistics
PPTX
Web mining: Concepts and applications
PPTX
Time series analysis
PPTX
Text analytics
PPTX
Introduction to Data Analytics
PPTX
Evaluating classification algorithms
PPTX
Principle Component Analysis
PPTX
Density based Clustering Algorithms(DB SCAN, Mean shift )
PPTX
Association rule mining
Model validation
Introduction to statistics
Web mining: Concepts and applications
Time series analysis
Text analytics
Introduction to Data Analytics
Evaluating classification algorithms
Principle Component Analysis
Density based Clustering Algorithms(DB SCAN, Mean shift )
Association rule mining

Recently uploaded (20)

PDF
anganwadi services for the b.sc nursing and GNM
PPTX
Thinking Routines and Learning Engagements.pptx
PPTX
Neurology of Systemic disease all systems
PDF
Diabetes Mellitus , types , clinical picture, investigation and managment
PPT
hsl powerpoint resource goyloveh feb 07.ppt
PDF
Disorder of Endocrine system (1).pdfyyhyyyy
PDF
0520_Scheme_of_Work_(for_examination_from_2021).pdf
PDF
Compact First Student's Book Cambridge Official
PPTX
ACFE CERTIFICATION TRAINING ON LAW.pptx
PDF
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
PPTX
operating_systems_presentations_delhi_nc
PDF
Lecture on Viruses: Structure, Classification, Replication, Effects on Cells,...
PPTX
2025 High Blood Pressure Guideline Slide Set.pptx
PDF
FYJC - Chemistry textbook - standard 11.
PDF
Hospital Case Study .architecture design
PPTX
MMW-CHAPTER-1-final.pptx major Elementary Education
PPTX
Diploma pharmaceutics notes..helps diploma students
PDF
Horaris_Grups_25-26_Definitiu_15_07_25.pdf
PDF
CAT 2024 VARC One - Shot Revision Marathon by Shabana.pptx.pdf
PDF
African Communication Research: A review
anganwadi services for the b.sc nursing and GNM
Thinking Routines and Learning Engagements.pptx
Neurology of Systemic disease all systems
Diabetes Mellitus , types , clinical picture, investigation and managment
hsl powerpoint resource goyloveh feb 07.ppt
Disorder of Endocrine system (1).pdfyyhyyyy
0520_Scheme_of_Work_(for_examination_from_2021).pdf
Compact First Student's Book Cambridge Official
ACFE CERTIFICATION TRAINING ON LAW.pptx
CHALLENGES FACED BY TEACHERS WHEN TEACHING LEARNERS WITH DEVELOPMENTAL DISABI...
operating_systems_presentations_delhi_nc
Lecture on Viruses: Structure, Classification, Replication, Effects on Cells,...
2025 High Blood Pressure Guideline Slide Set.pptx
FYJC - Chemistry textbook - standard 11.
Hospital Case Study .architecture design
MMW-CHAPTER-1-final.pptx major Elementary Education
Diploma pharmaceutics notes..helps diploma students
Horaris_Grups_25-26_Definitiu_15_07_25.pdf
CAT 2024 VARC One - Shot Revision Marathon by Shabana.pptx.pdf
African Communication Research: A review

Introduction to Big Data Analytics

  • 1. INTRODUCTION TO BIG DATA ANALYTICS Utkarsh Sharma Asst. Prof. (CSE) Jaypee University Of Engineering & Technology
  • 2. Big Data Overview Several industries have led the way in developing their ability to gather and exploit data: • Credit card companies monitor every purchase their customers make and can identify fraudulent purchases with a high degree of accuracy using rules derived by processing billions of transactions. • Mobile phone companies analyze subscriber’s calling patterns to determine, If that rival network is offering an attractive promotion that might cause the subscriber to defect. • For companies such as Linked In and Facebook, data itself is their primary product.
  • 3. Big Data Overview Three attributes stand out as defining Big Data characteristics: • Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows and millions of columns. • Complexity of data types and structures: Big Data reflects the variety of new data sources, formats, and structures, including digital traces being left on the web and other digital repositories for subsequent analysis. • Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data ingestion and near real time analysis.
  • 4. Another definition of Big Data comes from the McKinsey Global report from 2011: • Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value. McKinsey's definition of Big Data implies that organizations will need new data architectures and analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the new role of the data scientist.
  • 6. An Example(Genomic sequencing) While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now, websites such as 23andme offer genotyping for less than $100.
  • 7. Data Structures • Big data can come in multiple forms, including structured and non-structured data such as financial data, text files, multimedia files, and genetic mappings. • Most of the Big Data is unstructured or semi-structured in nature, which requires different techniques and tools to process and analyze. • Distributed computing environments and massively parallel processing (MPP) architectures that enable parallelized data ingest and analysis are the preferred approach to process such complex data.
  • 9. Structured Data • Data containing a defined data type, format, and structure (that is, transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS, CSV files, and even simple spreadsheets).
  • 10. Semi-structured data • Textual data files with a discernible pattern that enables parsing (such as Extensible Markup Language [XML] data files that are self-describing and defined by an XML schema).
  • 11. Quasi-structured data • Textual data with erratic data formats that can be formatted with effort, tools, and time (for instance, web clickstream data that may contain inconsistencies in data values and formats). • Consider the following example. A user attends the EMC World conference and subsequently runs a Google search online to find information related to EMC and Data Science. This would produce a URL such as https: I /www . google. com/ #q=EMC+ data+science • After doing this search, the user may choose the second link, to read more about the headline "Data Scientist- EM( Education, Training, and Certification." This brings the user to an erne . com site focused on this topic and a new URL, ht t p s : I / e ducation . e rne . com/ guest/ campai gn/ data_ science.aspx • Arriving at this site, the user may decide to click to learn more about the process of becoming certified in data science. The user chooses a link toward the top of the page on Certifications, bringing the user to a new URL: ht tps :I I education. erne. com/guest / certifica tion/ framework/ stf/ data_science . aspx,
  • 12. Unstructured data • Data that has no inherent structure, which may include text documents, PDFs, images, and video. • All of these heterogenous types of data structures created the need of some specialized data storage and retrieval techniques, such as Data warehouses and analytics sandbox.
  • 13. Data Warehouse • A data warehouse is a central repository of information that can be analyzed to make more informed decisions. • Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. • Business analysts, data engineers, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications.
  • 14. Intro. to Data Warehouse • The term "Data Warehouse" was first coined by Bill Inmon in 1990. According to Inmon, a data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data. This data helps analysts to take informed decisions in an organization. • An operational database undergoes frequent changes on a daily basis on account of the transactions that take place whereas a Data Warehouse keeps historical data also. • A data warehouses provides us generalized and consolidated data in multidimensional view. Along with generalized and consolidated view of data, a data warehouses also provides us Online Analytical Processing (OLAP) tools.
  • 15. Understanding a Data Warehouse • A data warehouse is a database, which is kept separate from the organization's operational database. • There is no frequent updating done in a data warehouse. • It possesses consolidated historical data, which helps the organization to analyze its business. • A data warehouse helps executives to organize, understand, and use their data to take strategic decisions. • Data warehouse systems help in the integration of diversity of application systems. • A data warehouse system helps in consolidated historical data analysis.
  • 16. Analytics sandbox • A workspace in which data assets are gathered from multiple sources and technologies for analysis. • To lessen the performance burden of the analysis, the workspace may use in-database processing and is considered to be owned by the analysts rather than database administrators. • Often, this workspace is created by using a sampling of the dataset rather than the entire dataset. • The sandbox may also reduce the stove-piped and partial versions of the true data that may have been developed in business units.
  • 18. Types of Data Repositories
  • 19. Business Intelligence vs Data Science
  • 20. Examples of Big Data Analytics • As mentioned earlier, Big Data presents many opportunities to improve sales and marketing analytics. • An example of this is the U.S. retailer Target. After analysing consumer purchasing behavior, Target's statisticians determined that the retailer made a great deal of money from three main life- event situations. • Marriage, when people tend to buy many new products. • Divorce, when people buy new products and change their spending habits. • Pregnancy, when people have many new things to buy and have an urgency to buy them. • Target determined that the most lucrative of these life-events is the third situation: pregnancy. Using data collected from shoppers, Target was able to identify this fact and predict which of its shoppers were pregnant. In one case, Target knew a female shopper was pregnant even before her family knew
  • 21. Data Science Project Lifecycle
  • 22. Data Science Project Lifecycle • 1. Obtain Data • Skills required • how to use MySQL, PostgreSQL or MongoDB • 2. Scrub Data • Skills required • You will need scripting tools like Python or R to help you to scrub the data. • 3. Explore Data • Skills required • If you are using Python then Numpy, Matplotlib, Pandas or Scipy; if you are using R, then GGplot2 or the data exploration swiss knife Dplyr. On top of that, you need to have knowledge and skills in inferential statistics and data visualization. • 4. Model Data • Skills required • In Machine Learning, the skills you will need is both supervised and unsupervised algorithms. • 5. Interpreting Data • Skills required • You will need strong business domain knowledge to present your findings in a way that can answer the business questions you set out to answer
  • 23. The Analytics Process An Analysis process contains all or some of the following phases: • Business understanding: Identifying and understanding the business objectives • Data Collection: Collection of data from different sources and its representation in terms of its application. • Data Preparation: Removing the unnecessary and unwanted data • Data Modelling: Create a model to analyse the different relationships between the objects. • Data Evaluation: Evaluation and preparation of analysis report • Deployment: Finalizing the plan for deployment
  • 24. Types of Analytics On the basis of problem description, four types of data analytics are used: • Descriptive Analytics • Diagnostic Analytics • Predictive Analytics • Prescriptive Analytics
  • 25. Descriptive analytics : What is happening? • This is the most common of all forms. In business it provides the analyst a view of key metrics and measures within the business. • Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past. • However, these findings simply signal that something is wrong or right, without explaining why.
  • 26. Diagnostic: Why is it happening? • At this stage, historical data can be measured against other data to answer the question of why something happened. • Diagnostic analytics gives in-depth insights into a particular problem. • On assessment of the descriptive data, diagnostic analytical tools will empower an analyst to drill down and in so doing isolate the root-cause of a problem.
  • 27. Predictive: What is likely to happen? • Predictive analytics tells what is likely to happen. It uses the findings of descriptive and diagnostic analytics to detect clusters and exceptions, and to predict future trends. • Predictive models typically utilize a variety of variable data to make the prediction. • Predictive analytics belongs to advanced analytics types and brings many advantages like sophisticated analysis based on machine or deep learning.
  • 28. Prescriptive: What do I need to do? • The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future problem or take full advantage of a promising trend. • The prescriptive model utilizes an understanding of what has happened, why it has happened and a variety of “what-might-happen” analysis to help the user determine the best course of action to take. • Besides, this state-of-the-art type of data analytics requires not only historical internal data but also external information due to the nature of algorithms it’s based on.
  • 29. Big Data Analytics(One more categorization) • Basic Analytics Slicing & Dicing Basic monitoring Anomaly identification • Advanced Analytics Predictive Modelling Text Analytics Statistics and data mining algorithms • Operational Analytics • Monetized Analytics
  • 30. Data Analytics Lifecycle Brief Overview • The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects. • The lifecycle has six phases, and project work can occur in several phases at once. • For most phases in the lifecycle, the movement can be either forward or backward. • In recent years, substantial attention has been placed on the emerging role of the data scientist. • Despite this strong focus on the emerging role of the data scientist specifically, there are actually seven key roles that need to be fulfilled for a high-functioning data science team to execute analytic projects successfully.
  • 31. Key Roles for a Successful Analytics Project • For a small, versatile team, the seven roles may be fulfilled by only 3 people, but a very large project may require 20 or more people. The seven roles follow:
  • 32. Key Roles for a Successful Analytics Project • Business User :- business analyst, line manager, or deep subject matter expert in the project domain. • Project Sponsor :- provides the funding and gauges • Project Manager :- Ensures that key milestones and objectives are met on time and at the expected quality. • Business Intelligence Analyst :- Provides business domain expertise based on a deep understanding of the data, key performance indicators (KPis). • Database Administrator (DBA) :- Provisions and configures the database environment to support the analytics needs of the working team. • Data Engineer :- Leverages deep technical skills to assist with tuning SQL queries for data management and data extraction, and provides support for data ingestion into the analytic sandbox. • Data Scientist :- Provides subject matter expertise for analytical techniques, data modeling, and applying valid analytical techniques to given business problems.
  • 34. Phase 1- Discovery • Learning the Business Domain • Resources • Framing the Problem • Identifying Key Stakeholders • Interviewing the Analytics Sponsor • Developing Initial Hypotheses
  • 35. Phase 2: Data Preparation • Preparing the Analytic Sandbox • Performing ETLT • Learning About the Data • Data Conditioning • Survey and Visualize
  • 36. Phase 3: Model Planning • Data Exploration and Variable Selection • Model Selection Phase 4: Model Building • The team develops data sets for testing, training, and production purposes.
  • 37. Phase 5: Communicate Results • The team, in collaboration with major stakeholders, determines if the results of the project are a success or a failure based on the criteria developed in Phase 1. Phase 6: Operationalize • The team delivers final reports, briefings, code, and technical documents. • In addition, the team may run a pilot project to implement the models in a production environment.
  • 38. Key Outputs from a Successful Analytic Project
  • 39. Big Data Pre-processing • The set of techniques used prior to the application of a data mining method is named as data preprocessing for data mining. • The bigger amounts of data collected require more sophisticated mechanisms to analyze it. • Data preprocessing is able to adapt the data to the requirements posed by each data mining algorithm, enabling to process data that would be unfeasible otherwise.