Data
Department of Information Technology
University of the Punjab – Gujranwala Campus
Science
Data Preprocessing,
Statistical Inference, EDA, and
the Data Science Process
2
Compiled & Edited by
Babar Yaqoob Khan
Visiting Lecturer – Data Science
❖ Data
▪ Definition, Types (Structured Data, Semi-Structured Data, Un-Structured Data), Sources, Qualities & Importance
▪ The information processing cycle
▪ Data Preprocessing (Sampling, Cleansing, Aggregation, Dimensionality Reduction, Feature Subset Selection, Feature
Creation, Integration, Discretization and Binarization, and Transformation)
❖ Statistical Inference
▪ Definition & Objectives, Sampling, Statistical experiment and Probability
❖ Exploratory Data Analysis (EDA)
▪ Definition & Objectives, EDA Process and Example
❖ The Data Science Process
▪ Definition & Objectives, the Process diagram
❖ Data Analytical Life Cycle
▪ Discovery, Data preparation , Model planning , Model building , Model Evaluation, Communicate results, Operationalize
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 2
WHAT IS IN IT FOR YOU?
❖ Data
▪ The facts and figures in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to,
or represent, conditions, ideas, or objects.
▪ Different types or formats of data:
• Numbers, Characters or Strings, Time and Date
• Pictures/Images, Graphs, and Maps
• Documents, E-mails, Tweets, and Newsfeeds etc.
• Audio and Video streams
• Formats: XML, CSV, TSV, SQL, JSON, Text etc.
• Records: user-level data, timestamped event data
▪ Data can be stored in files, data repositories or in databases
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 3
Data – Definition, Types, Sources, Qualities & Importance
❑ Nominal scale
❑Categorical scale
❑ Ordinal scale
❑ Interval scale
❑ Ratio scale
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 4
Types of Data Measurements
Qualitative
Quantitative
Discrete Continuous
More
Information
Content
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 5
Nominal:
ID numbers, Names of people, Gender, Blood type, Eye colour, Political Party
Categorical:
Fruits, vegetables, juices, zip codes, sales
Ordinal:
Rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
Interval:
Calendar dates, temperatures in Celsius or Fahrenheit, GRE and IQ scores
Ratio:
Mass, length, counts, money
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 6
Types of Data Measurements: Examples
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 7
❖ Data Sources
▪ Business activities – sale & purchase of products
▪ Manufacturing process – production and assembling of products
▪ Transportation – transportation of people and products from place to place
▪ Sensing & monitoring – data from sensors (in space and oceans etc. ) and CCTV cameras
▪ Human interaction – emails, audio, video and textual communication
▪ … … ….
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 11
Data – Definition, Types, Sources, Qualities & Importance
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 15
Data Pre-processing
Data Science Process
Raw Data
Collection
Data
Pre-
Processing
Clean
Dataset
Data
Processing
Visualization/
Communicate
Results
Data
Product
Exploratory
Data
Analysis
Make
Decisions
Reality
Business
Problem
Instrument
Data
Sources
Decision Support
Business Intelligence
Recommender Systems
Business Forecasting (Prediction)
❖ Population (N)
▪ Includes all of the elements from a set of data e.g.,
• The entire US population i.e., 341.97 million (341,963,408) or
• The entire Pakistan population i.e., 252.37 million (252,363,571)
• The entire world population i.e., 8.2 billion
• Set of objects, such as tweets or photographs
❖ Sample (n)
▪ Consists of one or more observations drawn from the population
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 17
n < N
Population vs. Sample
❖ Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis
❖ Sampling Types
▪ Simple Random Sampling
• Equal probability of selecting any item
▪ Stratified Sampling
• Split the data into partitions and draw random samples from each partition
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 18
Sampling & Types
❖ Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis
❖ Sampling Types
▪ Systematic Sampling
• Select every nth item from a list.
For instance, if you have a list of 1,000 people and you choose every 10th person
▪ Cluster Sampling
• The population is divided into clusters, usually based on geographical areas or natural groupings. A
few clusters are randomly selected, and all members within those clusters are surveyed
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 19
Sampling & Types
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 20
Sample Size
Ideal Ratio: 70:30
❖ Data in the real world is dirty
❖ GIGO - good data is a prerequisite for producing effective models of any type
Incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
e.g., occupation=“ ”
Noisy: containing errors or outliers
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 21
Why Data Pre-processing?
Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is analysed.
– Human/hardware/software problems
Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 22
Why is Data Dirty?
❖ Data Cleaning
▪ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
❖ Data Integration
▪ Integration of multiple databases or files
❖ Data Transformation
▪ Normalization and aggregation
❖ Data Reduction
▪ Obtains reduced representation in volume but produces the same or similar analytical
results
❖ Data Discretization & Binarization
▪ Part of data reduction but with particular importance for numerical data
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 23
Data Preprocessing – Major Tasks
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 24
Forms of Data Preprocessing
n
❖ Importance
▪ Garbage in Garbage out Principle (GIGO)
❖ Data Cleaning Tasks
• Fill in missing values
• Identify outliers and Managing noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 26
Data Preprocessing – Cleaning
❖ Missing Data
❑ Data is not always available
▪ E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
❑ Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
❑ Missing data may need to be inferred
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 27
Data Preprocessing – Cleaning
❖ How to Handle Missing Data?
❖ Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
❖ Fill in the missing value manually: tedious + infeasible?
❖ Fill in it automatically with
▪ a global constant : e.g. “unknown”, a new class?!
▪ the attribute mean for all data points belonging to the same class: smarter
▪ the most probable value: inference-based such as Bayesian formula or decision tree
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 28
Data Preprocessing – Cleaning
❖ How to Handle Noisy Data ?
❖ Binning
▪ First sort data and partition into (equal-frequency) bins
▪ Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
❖ Regression
▪ Smooth by fitting the data into regression functions
❖ Clustering
▪ Detect and remove outliers
❖ Combined Computer and Human Inspection
▪ Detect suspicious values and check by human (e.g., deal with possible outliers)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 29
Data Preprocessing – Cleaning
❖ Simple Discretization Methods: Binning
❖ Equal-width (distance) partitioning
▪ Divides the range into N intervals of equal size: uniform grid
▪ If A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (Max – Min)/N.
▪ The most straightforward, but outliers may dominate presentation
▪ Skewed data is not handled well
❖ Equal-depth (frequency) partitioning
▪ Divides the range into N intervals each containing approximately same number of data points
▪ Good data scaling
▪ Managing categorical attributes can be tricky
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 30
Data Preprocessing – Cleaning
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 31
Data Preprocessing – Cleaning
❖ Binning
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 32
Data Preprocessing – Cleaning
❖ Regression
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 33
Data Preprocessing – Cleaning
❖ Clustering
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 34
Data Preprocessing – Integration
Data integration:
❖ Combines data from multiple sources into a coherent store
❖ Schema integration: e.g., A.cust-id ≡ B.cust-#
▪ Integrate metadata from different sources
❖ Entity identification problem:
▪ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
❖ Detecting and resolving data value conflicts
▪ For the same real world entity, attribute values from different sources are different
▪ Possible reasons: different representations, different scales, e.g., metric vs. British units
Handling Redundancy in Data Integration
❖ Redundant data occur often when integration of multiple databases
▪ Object identification: The same attribute or object may have different names in
different databases
▪ Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue (from monthly income data)
❖ Redundant attributes may be able to be detected by correlation analysis
❖ Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 35
Data Preprocessing – Data Integration
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 36
Correlation Analysis (Numerical Data)
Data Preprocessing – Data Integration
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 37
Data Preprocessing - Data Integration
Correlation Analysis (Categorical Data)
Play Chess Not Play Chess Sum (row)
Like Science Fiction 250 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 38
Probability to play chess: P(chess) = 300/1500 = 0.2
Probability to like science fiction: P(SciFi) = 450/1500 = 0.3
If science fiction and chess playing are independent attributes, then the
probability to like SciFi AND play chess is
P(SciFi, chess) = P(SciFi) · P(chess) = 0.06
That means, we expect 0.06 · 1500 = 90 such cases (if they are independent)
Correlation Analysis (Categorical Data)
Play Chess Not Play Chess Sum (row)
Like Science Fiction 250 (90) 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 39
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on
the data distribution in the two categories)
It shows that like_science_fiction and play_chess are correlated in the group!
Correlation Analysis (Categorical Data)
❖ Data Reduction (Dimensionality Reduction)
• Obtains reduced representation in volume but produces the same or similar analytical
results
✓ Feature Subset Selection / Principal Component Analysis (PCA)
✓ Singular Value Decomposition (SVD)
❖ Data Discretization (Dimensionality Reduction)
• Part of data reduction but with particular importance for numerical data
• Also called “binning”
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 40
Data Preprocessing – Reduction, Discretization
❖ Data Preprocessing – Transformation
• Maps entire set of values of an attribute to a new set of values
• Data standardization and normalization (by clustering and binning)
✓ Smoothing: remove noise from data
✓ Aggregation: summarization
✓ Generalization: concept hierarchy climbing
✓ Normalization: scaled to fall within a small, specified range
✓ min-max normalization
✓ z-score normalization
✓ normalization by decimal scaling
✓ Attribute/feature construction
✓ New attributes constructed from the given ones
41
Data Preprocessing – Transformation
❖ Feature Creation
• Original attributes not always best representation of information
• Creates new features which are more efficient/focused
❖ Methodologies
▪ Features Extraction – Domain Specific
• Derived features
▪ Feature Construction
• Combine multiple features to construct new feature(s)
▪ Mapping Data to New Space
• Fourier Transform - what frequencies are present in your signal
• Wavelet Transform - what frequencies are present and where (or at what scale)
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 42
Data Preprocessing – Feature Creation
❖ Discretization & Binarization
• Converting the data into discrete form and later to binarize it to accommodate certain machine
learning algorithms/models
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 43
Data Preprocessing – Discretization & Binarization
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 44
Statistical Inference
❖ Population
▪ Includes all of the elements from a set of data e.g.,
• The entire Pakistani population i.e., 247 million or
• The entire world population i.e., 8 billion
• Set of objects, such as tweets or photographs
❖ Sample
▪ It consists of one or more items drawn from the population
▪ For example, 1000 Pakistanis selected from all provinces of Pakistan
▪ Size of sample (n) always less than size of the population (N)
▪ Sample may not be totally representative of the population
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 45
Statistical Inference – Population and Sample
Sample < Population
n < N
❖ Statistical inference
▪ It is process of estimating the parameters of a population, using the random sampling.
▪ The inference also tests reliability of the estimates with calculated uncertainty.
❖ Purpose and benefits
▪ Enable us to understand the population without studying its all items.
▪ Minimizes the cost of understanding the population.
▪ It remains the only possible option, when whole the population is not accessible.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 46
Statistical Inference
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 47
Statistical Inference
Population
Sampling
Sample
Parameters Estimation
Statistical
Inference
❖ A statistical experiment has three properties:
▪ The experiment can have more than one possible outcome
▪ Each possible outcome can be specified in advance
▪ The outcome of the experiment depends on chance
❖ For instance, toss a coin
▪ Outcomes are:
• More than one
• Specified in advance
• Depends on chance
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 48
Statistical Experiment
Head or, Tail
{ Head, Tail }
Unknown in advance, unless coin is tossed
Or 50% chance of Head and vice versa
Statistical Experiment
❖ Variable or Parameter
▪ It represents value of an attribute of an item in the population. i.e. name, color of an item.
▪ A random variable can take on any of the specified values (domain).
▪ A random variable takes a value after a statistical experiment.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 49
Variables or Parameters
𝒙 = 7 𝒙
Statistical Experiment 𝐱
𝒙 = 7
𝒙 is a Variable 𝒙 is a Random Variable
❖ Probability
▪ It is the measure of the likelihood of happening an event.
▪ A quantitative measure, always takes value between 0 and 1.
❖ Example:
▪ Tossing a coin is a statistical experiment.
▪ It can result two outcomes:
• Head or a Tail
▪ Calculating the chance of a resulting a ‘head’ is its probability
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 50
Probability
Probability =
Favorable outcomes
Possible outcomes
P(H) = 1/2 = 0.5
P(T) = 1/2 = 0.5
heads (H) tails (T)
❖ Probability distribution
▪ It links each outcome of a statistical experiment with its probability of occurrence
▪ For instance, you toss a coin two times
▪ Possible outcomes = {HH, HT, TH, TT}
▪ Let X = number of Heads
▪ Possible outcomes = { 0, 1, 2 }
• P(X = 0) = 1/4 = 0.25 No Heads = { TT }
• P(X = 2) = 1/4 = 0.25 Two Heads = { HH }
• P(X = 1) = 2/4 = 0.50 One Heads = { HT, TH }
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 51
Probability Distribution
❖ Modeling
▪ A model is representation of a real object or situation. It presents a simplified version of something.
▪ It is an artificial construction to understand and represent the nature of real things.
• Model does not has unnecessary detail.
▪ Humans try to understand the world around them using different models.
• Architect capture 3-D prints to construct design structures
• Biologists capture connection between amino acids to understand protein-protein interactions
• Statisticians and Data Scientists capture randomness to comprehend data-generating processes
❖ Data Modeling
▪ Data modeling is the analysis of data objects and their relationships to other data objects.
▪ The model helps us in defining and analyzing data requirements needed to support the business
processes in an organization.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 52
Data Modeling
❖ Building a Model
▪ Do some Exploratory Data Analysis (EDA) and discover the relationship among the data.
▪ Try to describe the relationship using a mathematical formula.
❖ Model Fitting
▪ Model Fitting (Balance Fitting)
• When model fits the training as well as testing data pretty well
▪ Underfitting
• When model is unable to fit even the training data
▪ Overfitting
• When model fits the training data well but testing data too poor
✓ Noise (undesired data) and higher variability (inconsistency) in data cause the overfitting
✓ Remove noise (data cleaning) and add more training data to train the model.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 53
Data Modeling
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 54
Fitting a Data Model
Too good to be true.
Forced-fitting
Too simple to explain
the variation in data
Exploratory Data Analysis
❖ Exploratory Data Analysis (EDA)
▪ In statistics, EDA is used to analyze datasets to summarize their main characteristics.
▪ EDA often employ the visual methods to see what the data can tell us beyond the formal modeling or
hypothesis testing task.
▪ It is an effort to understand the process that generate the data under observation.
• ‘Exploration’ means your understanding of the problem is changing as you go ahead.
• Plots, graphs and summary statistics are basic tools of the EDA.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 58
What is EDA?
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 59
What is EDA?
❖ EDA helps us to:
▪ Understand the data and its value in business
• Discover patterns in data
• Spot anomalies (outliers) in data
• Verify existing assumptions about data
• Make comparisons between the data distributions.
• Finding suitable data formats
▪ Improve accuracy of the data-products.
▪ Assure verification of the data-products.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 60
EDA – Why we do it?
❖ The Data Science Process
▪ Say, we have data (raw data) on these things
▪ We want to process these data for better analysis
▪ Processing would give us a Clean Dataset to analyze
▪ We’ll be doing some EDA with the clean dataset
▪ EDA will lead us towards a Data Model and an Algorithm
▪ We get the results after using the model and interpret, visualize, or report them
▪ The results are in decision making or as input for a ‘Data Product’.
▪ The Products may be like as:
• Recommender system
• Business forecasting system
• Spam classifier
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 61
The Data Science Process
Data Science Process
Raw Data
Collection
Data
Pre-
Processing
Clean
Dataset
Data
Processing
Visualization/
Communicate
Results
Data
Product
Exploratory
Data
Analysis
Make
Decisions
Reality
Business
Problem
Instrument
Data
Sources
Decision Support,
Business Intelligence
Recommender Systems
Business Forecasting
(Prediction)
❖ Big Data
▪ Big data is a field dedicated to the analysis,
processing, and storage of large collections of data
that frequently originate from various sources.
▪ It is used when traditional data analysis, processing
and storage technologies and techniquesare
insufficient.
❖ Big Data Characteristics
▪ Volume
▪ Velocity
▪ Variety
▪ Veracity
▪ Value
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 63
The Big Data Approach
❖ Descriptive Analysis (What happened)
▪ It is done to answer questions about events that have already occurred.
❖ Diagnostic Analysis (Why did it happen)
▪ It is used to determine the cause of a phenomenon that occurred in the past using questions that focus
on the reason behind the event.
❖ Predictive Analysis (What will happen)
▪ It is an attempt to determine the outcome of an event that would occur in the future.
❖ Prescriptive Analysis (How can we make it happen)
▪ Prescriptive analytics are build upon the results of predictive analytics to prescribe the actions that
shouldbe taken to improve the business.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 64
Big Data – Analytics Types
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 65
Big Data – Analytics Types
2
Diagnostic 3
Predictive
4
Prescriptive
1
Descriptive
❖ Phase 1: Learning the business domain and problem discovery
▪ Understand the business process
• Study the similar past projects
• Identify available resources – people, required skills, technology, time, and data.
• Have right mix of domain experts, customers, analytic talent, and project management.
▪ Identifying key stakeholders
• Understand their interests in the project
• Propose and discuss more than one solutions to the problem
▪ Discover the problem to be solved
• Write the problem statement and its justification.
• Discuss and refine the problem statement after discussion with the major stakeholder
• Establish the criteria for success and failure of the proposed solution
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 66
Data Analytics Life Cycle – Phase 1
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 67
Data Analytics Life Cycle – Key Roles
❖ Phase 2: Data preparation
▪ Define the steps to explore and preprocess data before its modeling and analysis.
▪ Prepare the analytics sandbox (setup for the experiments)
▪ Perform the Extract Transform Load (ETL) process (or ELT). → ETLT = ETL + ELT
▪ Understand the target data
▪ Data cleaning – data normalization and transformation
• For better understanding, utilize maximum of the available data
• Survey and visualize the test dataset
• Carefully complete the highly labor-intensive activity
▪ Data accessing strategies:
• Download snapshot of the production data
• Use the API facility, if available
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 68
Data Analytics Life Cycle – Phase 2
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 69
Phase 2 – Sample Dataset Inventory
❖ Phase 2: Data preparation tools
▪ Hadoop
• It can perform massively parallel loading and analysis of large dataset.
• Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of massive
unstructured data feeds from multiple sources.
▪ Alpine Miner
• Provides a graphical user interface (GUI) for data manipulation and analysis
▪ Open Refine (Google Refine)
• A powerful tool for working with large and unstructured dataset. It is a popular GUI-based tool for
performing data transformations.
▪ Data Wrangler (Stanford University)
• An interactive tool for data cleaning and transformation on a given dataset.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 70
Phase 2 – Common tools for data preparation
❖ Phase 3: Planning the data model
▪ Data exploration and variable selection
• Perform Exploratory Data Analysis, if required.
• Explore associations & relationships among data
• Identify key performance indicators (KPIs)
▪ Selecting suitable data analytical method or model
• Keep in mind requirements of the business
• Consider the type and format of data attributes
• Consult the domain experts and follow the best practices
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 71
Data Analytics Life Cycle – Phase 3
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 72
Phase 3 – Selecting appropriate data analytical model
❖ Phase 3: Common tools for the model planning phase
▪ R - Analytical Software Package
• It has the data modeling capabilities and good environment for building interpretive models
• R has ability to interface with databases via an ODBC connection and execute statistical tests and
analyses against Big Data via an open source connection.
• R contains nearly 5,000 packages for data analysis and graphical representation.
▪ SQL Analysis services
• It can perform in-database analytics of common data mining functions, involved aggregations, and
basic predictive models.
▪ SAS/ACCESS
• Provides integration between SAS and the analytics sandbox via multiple data connectors such as
OBDC, JDBC and OLE DB. Connectivity to relational databases (such as Oracle or Teradata) and data
warehousing applications ( i.e. Green plum or Aster)
• Enterprise applications such as SAP and Salesforce.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 73
Data Analytics Life Cycle – Phase 3
❖ Phase 4: Model building
▪ Develop datasets for testing, training, and production purposes.
▪ Assess validity of the model and its results on small scale
• Verify result of the model from domain experts
▪ Evaluate the required hardware support to execute the model
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 74
Data Analytics Life Cycle – Phase 4
❖ Phase 4: Common tools for the model building phase
▪ SAS Enterprise Miner
• Allows users to run predictive and descriptive models based on large volumes of data from across
the enterprise.
• It is built for enterprise-level computing and analytics by interoperating with large data stores.
▪ SPSS Modeler (IBM SPSS Modeler)
• Offers methods to explore and analyze data through a GUI.
▪ MatLab
• Provides a high-level language for performing a variety of data analytics and exploration.
▪ Statistica and Mathematica
• Popular and well-regarded data mining and analytics tools.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 75
Data Analytics Life Cycle – Phase 4
❖ Phase 4: Free or Open Source tools for the model building phase
▪ WEKA
• A free data mining software package with an analytic workbench. The functions created in WEKA
can be executed within Java code.
▪ Python
• It is a programming language that provides toolkits for machine learning and analysis, such as scikit-
learn, numpy, scipy, pandas, and related data visualization using matplotlib.
▪ Rand PL/R
• R was described earlier in the model planning phase, and PLR is a procedural language for
PostgreSQL with R. Using this approach means that R commands can be executed in database.
▪ Octave
• A programming language for computational modeling having some functionality of MatLab.
• Being freely available, Octave is used in major universities when teaching machine learning.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 76
Data Analytics Life Cycle – Phase 4
❖ Phase 5: Communicate the results
▪ Collaborate with the major stakeholders, and evaluation of the results
• Identify key findings, quantify their business value.
• The deliverable of this phase will be decisive for the outside stakeholders and sponsors
• Summarize the findings and convey to the stakeholders.
• Make recommendations for future work or improvements to existing processes
▪ Accept failure of an analytical project
• A true failure means failure of data to accept or reject the hypothesis stated in phase-1.
• Analyst should be rigorous enough with the data to determine whether it will prove or disprove the
hypotheses
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 77
Data Analytics Life Cycle – Phase 5
❖ Phase 6: Operationalize
▪ Communicate the benefits of the project more broadly
• If required, run a pilot project before implementing the models in a production environment.
• Learn from the deployment and make any needed adjustments.
▪ Properly document and deliver the final reports, briefings, code, and technical documents.
• Consult documentation of the similar past projects, if available.
• Follow the documentation standards to increase its effectiveness.
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 78
Data Analytics Life Cycle – Phase 6
❖ Data
▪ Definition, Importance, Characteristics, Sources and Types
▪ Structured Data, Semi-Structured Data, Un-Structured Data
▪ The information processing cycle
▪ Data Preprocessing (Integration, Cleansing, Reduction, and Transformation)
❖ Statistical Inference
▪ Definition & Objectives, Sampling, Statistical experiment and Probability
❖ Exploratory Data Analysis (EDA)
▪ Definition & Objectives, EDA Process and Example
❖ The Data Science Process
▪ Definition & Objectives, the Process diagram
❖ Data Analytical Life Cycle
▪ Discovery, Data preparation , Model planning , Model building , Communicate results, Operationalize
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 79
Content’s Review
You are Welcome !
Questions ?
Comments !
Suggestions !!
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 80
Questions ?
Comments !
Suggestions !!
Farewell to the Day ☺

Statistical Inference, Exploratory Data Analysis, and the Data Science Process.pdf

  • 1.
    Data Department of InformationTechnology University of the Punjab – Gujranwala Campus Science Data Preprocessing, Statistical Inference, EDA, and the Data Science Process 2 Compiled & Edited by Babar Yaqoob Khan Visiting Lecturer – Data Science
  • 2.
    ❖ Data ▪ Definition,Types (Structured Data, Semi-Structured Data, Un-Structured Data), Sources, Qualities & Importance ▪ The information processing cycle ▪ Data Preprocessing (Sampling, Cleansing, Aggregation, Dimensionality Reduction, Feature Subset Selection, Feature Creation, Integration, Discretization and Binarization, and Transformation) ❖ Statistical Inference ▪ Definition & Objectives, Sampling, Statistical experiment and Probability ❖ Exploratory Data Analysis (EDA) ▪ Definition & Objectives, EDA Process and Example ❖ The Data Science Process ▪ Definition & Objectives, the Process diagram ❖ Data Analytical Life Cycle ▪ Discovery, Data preparation , Model planning , Model building , Model Evaluation, Communicate results, Operationalize Statistical Inference, Exploratory Data Analysis, and the Data Science Process 2 WHAT IS IN IT FOR YOU?
  • 3.
    ❖ Data ▪ Thefacts and figures in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to, or represent, conditions, ideas, or objects. ▪ Different types or formats of data: • Numbers, Characters or Strings, Time and Date • Pictures/Images, Graphs, and Maps • Documents, E-mails, Tweets, and Newsfeeds etc. • Audio and Video streams • Formats: XML, CSV, TSV, SQL, JSON, Text etc. • Records: user-level data, timestamped event data ▪ Data can be stored in files, data repositories or in databases Statistical Inference, Exploratory Data Analysis, and the Data Science Process 3 Data – Definition, Types, Sources, Qualities & Importance
  • 4.
    ❑ Nominal scale ❑Categoricalscale ❑ Ordinal scale ❑ Interval scale ❑ Ratio scale Statistical Inference, Exploratory Data Analysis, and the Data Science Process 4 Types of Data Measurements Qualitative Quantitative Discrete Continuous More Information Content
  • 5.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 5
  • 6.
    Nominal: ID numbers, Namesof people, Gender, Blood type, Eye colour, Political Party Categorical: Fruits, vegetables, juices, zip codes, sales Ordinal: Rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval: Calendar dates, temperatures in Celsius or Fahrenheit, GRE and IQ scores Ratio: Mass, length, counts, money Statistical Inference, Exploratory Data Analysis, and the Data Science Process 6 Types of Data Measurements: Examples
  • 7.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 7
  • 11.
    ❖ Data Sources ▪Business activities – sale & purchase of products ▪ Manufacturing process – production and assembling of products ▪ Transportation – transportation of people and products from place to place ▪ Sensing & monitoring – data from sensors (in space and oceans etc. ) and CCTV cameras ▪ Human interaction – emails, audio, video and textual communication ▪ … … …. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 11 Data – Definition, Types, Sources, Qualities & Importance
  • 12.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 15 Data Pre-processing
  • 13.
    Data Science Process RawData Collection Data Pre- Processing Clean Dataset Data Processing Visualization/ Communicate Results Data Product Exploratory Data Analysis Make Decisions Reality Business Problem Instrument Data Sources Decision Support Business Intelligence Recommender Systems Business Forecasting (Prediction)
  • 14.
    ❖ Population (N) ▪Includes all of the elements from a set of data e.g., • The entire US population i.e., 341.97 million (341,963,408) or • The entire Pakistan population i.e., 252.37 million (252,363,571) • The entire world population i.e., 8.2 billion • Set of objects, such as tweets or photographs ❖ Sample (n) ▪ Consists of one or more observations drawn from the population Statistical Inference, Exploratory Data Analysis, and the Data Science Process 17 n < N Population vs. Sample
  • 15.
    ❖ Sampling • Techniquemainly employed for data selection from population • Often used both for preliminary investigation and the final data analysis ❖ Sampling Types ▪ Simple Random Sampling • Equal probability of selecting any item ▪ Stratified Sampling • Split the data into partitions and draw random samples from each partition Statistical Inference, Exploratory Data Analysis, and the Data Science Process 18 Sampling & Types
  • 16.
    ❖ Sampling • Techniquemainly employed for data selection from population • Often used both for preliminary investigation and the final data analysis ❖ Sampling Types ▪ Systematic Sampling • Select every nth item from a list. For instance, if you have a list of 1,000 people and you choose every 10th person ▪ Cluster Sampling • The population is divided into clusters, usually based on geographical areas or natural groupings. A few clusters are randomly selected, and all members within those clusters are surveyed Statistical Inference, Exploratory Data Analysis, and the Data Science Process 19 Sampling & Types
  • 17.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 20 Sample Size Ideal Ratio: 70:30
  • 18.
    ❖ Data inthe real world is dirty ❖ GIGO - good data is a prerequisite for producing effective models of any type Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” Noisy: containing errors or outliers e.g., Salary=“-10” Inconsistent: containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records Statistical Inference, Exploratory Data Analysis, and the Data Science Process 21 Why Data Pre-processing?
  • 19.
    Incomplete data maycome from – “Not applicable” data value when collected – Different considerations between the time when the data was collected and when it is analysed. – Human/hardware/software problems Noisy data (incorrect values) may come from – Faulty data collection instruments – Human or computer error at data entry – Errors in data transmission Inconsistent data may come from – Different data sources – Functional dependency violation (e.g., modify some linked data) Statistical Inference, Exploratory Data Analysis, and the Data Science Process 22 Why is Data Dirty?
  • 20.
    ❖ Data Cleaning ▪Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies ❖ Data Integration ▪ Integration of multiple databases or files ❖ Data Transformation ▪ Normalization and aggregation ❖ Data Reduction ▪ Obtains reduced representation in volume but produces the same or similar analytical results ❖ Data Discretization & Binarization ▪ Part of data reduction but with particular importance for numerical data Statistical Inference, Exploratory Data Analysis, and the Data Science Process 23 Data Preprocessing – Major Tasks
  • 21.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 24 Forms of Data Preprocessing n
  • 22.
    ❖ Importance ▪ Garbagein Garbage out Principle (GIGO) ❖ Data Cleaning Tasks • Fill in missing values • Identify outliers and Managing noisy data • Correct inconsistent data • Resolve redundancy caused by data integration Statistical Inference, Exploratory Data Analysis, and the Data Science Process 26 Data Preprocessing – Cleaning
  • 23.
    ❖ Missing Data ❑Data is not always available ▪ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ❑ Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data ❑ Missing data may need to be inferred Statistical Inference, Exploratory Data Analysis, and the Data Science Process 27 Data Preprocessing – Cleaning
  • 24.
    ❖ How toHandle Missing Data? ❖ Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. ❖ Fill in the missing value manually: tedious + infeasible? ❖ Fill in it automatically with ▪ a global constant : e.g. “unknown”, a new class?! ▪ the attribute mean for all data points belonging to the same class: smarter ▪ the most probable value: inference-based such as Bayesian formula or decision tree Statistical Inference, Exploratory Data Analysis, and the Data Science Process 28 Data Preprocessing – Cleaning
  • 25.
    ❖ How toHandle Noisy Data ? ❖ Binning ▪ First sort data and partition into (equal-frequency) bins ▪ Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. ❖ Regression ▪ Smooth by fitting the data into regression functions ❖ Clustering ▪ Detect and remove outliers ❖ Combined Computer and Human Inspection ▪ Detect suspicious values and check by human (e.g., deal with possible outliers) Statistical Inference, Exploratory Data Analysis, and the Data Science Process 29 Data Preprocessing – Cleaning
  • 26.
    ❖ Simple DiscretizationMethods: Binning ❖ Equal-width (distance) partitioning ▪ Divides the range into N intervals of equal size: uniform grid ▪ If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (Max – Min)/N. ▪ The most straightforward, but outliers may dominate presentation ▪ Skewed data is not handled well ❖ Equal-depth (frequency) partitioning ▪ Divides the range into N intervals each containing approximately same number of data points ▪ Good data scaling ▪ Managing categorical attributes can be tricky Statistical Inference, Exploratory Data Analysis, and the Data Science Process 30 Data Preprocessing – Cleaning
  • 27.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 31 Data Preprocessing – Cleaning ❖ Binning
  • 28.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 32 Data Preprocessing – Cleaning ❖ Regression
  • 29.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 33 Data Preprocessing – Cleaning ❖ Clustering
  • 30.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 34 Data Preprocessing – Integration Data integration: ❖ Combines data from multiple sources into a coherent store ❖ Schema integration: e.g., A.cust-id ≡ B.cust-# ▪ Integrate metadata from different sources ❖ Entity identification problem: ▪ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton ❖ Detecting and resolving data value conflicts ▪ For the same real world entity, attribute values from different sources are different ▪ Possible reasons: different representations, different scales, e.g., metric vs. British units
  • 31.
    Handling Redundancy inData Integration ❖ Redundant data occur often when integration of multiple databases ▪ Object identification: The same attribute or object may have different names in different databases ▪ Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue (from monthly income data) ❖ Redundant attributes may be able to be detected by correlation analysis ❖ Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Statistical Inference, Exploratory Data Analysis, and the Data Science Process 35 Data Preprocessing – Data Integration
  • 32.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 36 Correlation Analysis (Numerical Data) Data Preprocessing – Data Integration
  • 33.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 37 Data Preprocessing - Data Integration Correlation Analysis (Categorical Data)
  • 34.
    Play Chess NotPlay Chess Sum (row) Like Science Fiction 250 200 450 Not Like Science Fiction 50 1000 1050 Sum 300 1200 1500 Statistical Inference, Exploratory Data Analysis, and the Data Science Process 38 Probability to play chess: P(chess) = 300/1500 = 0.2 Probability to like science fiction: P(SciFi) = 450/1500 = 0.3 If science fiction and chess playing are independent attributes, then the probability to like SciFi AND play chess is P(SciFi, chess) = P(SciFi) · P(chess) = 0.06 That means, we expect 0.06 · 1500 = 90 such cases (if they are independent) Correlation Analysis (Categorical Data)
  • 35.
    Play Chess NotPlay Chess Sum (row) Like Science Fiction 250 (90) 200 450 Not Like Science Fiction 50 1000 1050 Sum 300 1200 1500 Statistical Inference, Exploratory Data Analysis, and the Data Science Process 39 Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) It shows that like_science_fiction and play_chess are correlated in the group! Correlation Analysis (Categorical Data)
  • 36.
    ❖ Data Reduction(Dimensionality Reduction) • Obtains reduced representation in volume but produces the same or similar analytical results ✓ Feature Subset Selection / Principal Component Analysis (PCA) ✓ Singular Value Decomposition (SVD) ❖ Data Discretization (Dimensionality Reduction) • Part of data reduction but with particular importance for numerical data • Also called “binning” Statistical Inference, Exploratory Data Analysis, and the Data Science Process 40 Data Preprocessing – Reduction, Discretization
  • 37.
    ❖ Data Preprocessing– Transformation • Maps entire set of values of an attribute to a new set of values • Data standardization and normalization (by clustering and binning) ✓ Smoothing: remove noise from data ✓ Aggregation: summarization ✓ Generalization: concept hierarchy climbing ✓ Normalization: scaled to fall within a small, specified range ✓ min-max normalization ✓ z-score normalization ✓ normalization by decimal scaling ✓ Attribute/feature construction ✓ New attributes constructed from the given ones 41 Data Preprocessing – Transformation
  • 38.
    ❖ Feature Creation •Original attributes not always best representation of information • Creates new features which are more efficient/focused ❖ Methodologies ▪ Features Extraction – Domain Specific • Derived features ▪ Feature Construction • Combine multiple features to construct new feature(s) ▪ Mapping Data to New Space • Fourier Transform - what frequencies are present in your signal • Wavelet Transform - what frequencies are present and where (or at what scale) Statistical Inference, Exploratory Data Analysis, and the Data Science Process 42 Data Preprocessing – Feature Creation
  • 39.
    ❖ Discretization &Binarization • Converting the data into discrete form and later to binarize it to accommodate certain machine learning algorithms/models Statistical Inference, Exploratory Data Analysis, and the Data Science Process 43 Data Preprocessing – Discretization & Binarization
  • 40.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 44 Statistical Inference
  • 41.
    ❖ Population ▪ Includesall of the elements from a set of data e.g., • The entire Pakistani population i.e., 247 million or • The entire world population i.e., 8 billion • Set of objects, such as tweets or photographs ❖ Sample ▪ It consists of one or more items drawn from the population ▪ For example, 1000 Pakistanis selected from all provinces of Pakistan ▪ Size of sample (n) always less than size of the population (N) ▪ Sample may not be totally representative of the population Statistical Inference, Exploratory Data Analysis, and the Data Science Process 45 Statistical Inference – Population and Sample Sample < Population n < N
  • 42.
    ❖ Statistical inference ▪It is process of estimating the parameters of a population, using the random sampling. ▪ The inference also tests reliability of the estimates with calculated uncertainty. ❖ Purpose and benefits ▪ Enable us to understand the population without studying its all items. ▪ Minimizes the cost of understanding the population. ▪ It remains the only possible option, when whole the population is not accessible. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 46 Statistical Inference
  • 43.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 47 Statistical Inference Population Sampling Sample Parameters Estimation Statistical Inference
  • 44.
    ❖ A statisticalexperiment has three properties: ▪ The experiment can have more than one possible outcome ▪ Each possible outcome can be specified in advance ▪ The outcome of the experiment depends on chance ❖ For instance, toss a coin ▪ Outcomes are: • More than one • Specified in advance • Depends on chance Statistical Inference, Exploratory Data Analysis, and the Data Science Process 48 Statistical Experiment Head or, Tail { Head, Tail } Unknown in advance, unless coin is tossed Or 50% chance of Head and vice versa Statistical Experiment
  • 45.
    ❖ Variable orParameter ▪ It represents value of an attribute of an item in the population. i.e. name, color of an item. ▪ A random variable can take on any of the specified values (domain). ▪ A random variable takes a value after a statistical experiment. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 49 Variables or Parameters 𝒙 = 7 𝒙 Statistical Experiment 𝐱 𝒙 = 7 𝒙 is a Variable 𝒙 is a Random Variable
  • 46.
    ❖ Probability ▪ Itis the measure of the likelihood of happening an event. ▪ A quantitative measure, always takes value between 0 and 1. ❖ Example: ▪ Tossing a coin is a statistical experiment. ▪ It can result two outcomes: • Head or a Tail ▪ Calculating the chance of a resulting a ‘head’ is its probability Statistical Inference, Exploratory Data Analysis, and the Data Science Process 50 Probability Probability = Favorable outcomes Possible outcomes P(H) = 1/2 = 0.5 P(T) = 1/2 = 0.5 heads (H) tails (T)
  • 47.
    ❖ Probability distribution ▪It links each outcome of a statistical experiment with its probability of occurrence ▪ For instance, you toss a coin two times ▪ Possible outcomes = {HH, HT, TH, TT} ▪ Let X = number of Heads ▪ Possible outcomes = { 0, 1, 2 } • P(X = 0) = 1/4 = 0.25 No Heads = { TT } • P(X = 2) = 1/4 = 0.25 Two Heads = { HH } • P(X = 1) = 2/4 = 0.50 One Heads = { HT, TH } Statistical Inference, Exploratory Data Analysis, and the Data Science Process 51 Probability Distribution
  • 48.
    ❖ Modeling ▪ Amodel is representation of a real object or situation. It presents a simplified version of something. ▪ It is an artificial construction to understand and represent the nature of real things. • Model does not has unnecessary detail. ▪ Humans try to understand the world around them using different models. • Architect capture 3-D prints to construct design structures • Biologists capture connection between amino acids to understand protein-protein interactions • Statisticians and Data Scientists capture randomness to comprehend data-generating processes ❖ Data Modeling ▪ Data modeling is the analysis of data objects and their relationships to other data objects. ▪ The model helps us in defining and analyzing data requirements needed to support the business processes in an organization. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 52 Data Modeling
  • 49.
    ❖ Building aModel ▪ Do some Exploratory Data Analysis (EDA) and discover the relationship among the data. ▪ Try to describe the relationship using a mathematical formula. ❖ Model Fitting ▪ Model Fitting (Balance Fitting) • When model fits the training as well as testing data pretty well ▪ Underfitting • When model is unable to fit even the training data ▪ Overfitting • When model fits the training data well but testing data too poor ✓ Noise (undesired data) and higher variability (inconsistency) in data cause the overfitting ✓ Remove noise (data cleaning) and add more training data to train the model. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 53 Data Modeling
  • 50.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 54 Fitting a Data Model Too good to be true. Forced-fitting Too simple to explain the variation in data
  • 51.
  • 52.
    ❖ Exploratory DataAnalysis (EDA) ▪ In statistics, EDA is used to analyze datasets to summarize their main characteristics. ▪ EDA often employ the visual methods to see what the data can tell us beyond the formal modeling or hypothesis testing task. ▪ It is an effort to understand the process that generate the data under observation. • ‘Exploration’ means your understanding of the problem is changing as you go ahead. • Plots, graphs and summary statistics are basic tools of the EDA. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 58 What is EDA?
  • 53.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 59 What is EDA?
  • 54.
    ❖ EDA helpsus to: ▪ Understand the data and its value in business • Discover patterns in data • Spot anomalies (outliers) in data • Verify existing assumptions about data • Make comparisons between the data distributions. • Finding suitable data formats ▪ Improve accuracy of the data-products. ▪ Assure verification of the data-products. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 60 EDA – Why we do it?
  • 55.
    ❖ The DataScience Process ▪ Say, we have data (raw data) on these things ▪ We want to process these data for better analysis ▪ Processing would give us a Clean Dataset to analyze ▪ We’ll be doing some EDA with the clean dataset ▪ EDA will lead us towards a Data Model and an Algorithm ▪ We get the results after using the model and interpret, visualize, or report them ▪ The results are in decision making or as input for a ‘Data Product’. ▪ The Products may be like as: • Recommender system • Business forecasting system • Spam classifier Statistical Inference, Exploratory Data Analysis, and the Data Science Process 61 The Data Science Process
  • 56.
    Data Science Process RawData Collection Data Pre- Processing Clean Dataset Data Processing Visualization/ Communicate Results Data Product Exploratory Data Analysis Make Decisions Reality Business Problem Instrument Data Sources Decision Support, Business Intelligence Recommender Systems Business Forecasting (Prediction)
  • 57.
    ❖ Big Data ▪Big data is a field dedicated to the analysis, processing, and storage of large collections of data that frequently originate from various sources. ▪ It is used when traditional data analysis, processing and storage technologies and techniquesare insufficient. ❖ Big Data Characteristics ▪ Volume ▪ Velocity ▪ Variety ▪ Veracity ▪ Value Statistical Inference, Exploratory Data Analysis, and the Data Science Process 63 The Big Data Approach
  • 58.
    ❖ Descriptive Analysis(What happened) ▪ It is done to answer questions about events that have already occurred. ❖ Diagnostic Analysis (Why did it happen) ▪ It is used to determine the cause of a phenomenon that occurred in the past using questions that focus on the reason behind the event. ❖ Predictive Analysis (What will happen) ▪ It is an attempt to determine the outcome of an event that would occur in the future. ❖ Prescriptive Analysis (How can we make it happen) ▪ Prescriptive analytics are build upon the results of predictive analytics to prescribe the actions that shouldbe taken to improve the business. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 64 Big Data – Analytics Types
  • 59.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 65 Big Data – Analytics Types 2 Diagnostic 3 Predictive 4 Prescriptive 1 Descriptive
  • 60.
    ❖ Phase 1:Learning the business domain and problem discovery ▪ Understand the business process • Study the similar past projects • Identify available resources – people, required skills, technology, time, and data. • Have right mix of domain experts, customers, analytic talent, and project management. ▪ Identifying key stakeholders • Understand their interests in the project • Propose and discuss more than one solutions to the problem ▪ Discover the problem to be solved • Write the problem statement and its justification. • Discuss and refine the problem statement after discussion with the major stakeholder • Establish the criteria for success and failure of the proposed solution Statistical Inference, Exploratory Data Analysis, and the Data Science Process 66 Data Analytics Life Cycle – Phase 1
  • 61.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 67 Data Analytics Life Cycle – Key Roles
  • 62.
    ❖ Phase 2:Data preparation ▪ Define the steps to explore and preprocess data before its modeling and analysis. ▪ Prepare the analytics sandbox (setup for the experiments) ▪ Perform the Extract Transform Load (ETL) process (or ELT). → ETLT = ETL + ELT ▪ Understand the target data ▪ Data cleaning – data normalization and transformation • For better understanding, utilize maximum of the available data • Survey and visualize the test dataset • Carefully complete the highly labor-intensive activity ▪ Data accessing strategies: • Download snapshot of the production data • Use the API facility, if available Statistical Inference, Exploratory Data Analysis, and the Data Science Process 68 Data Analytics Life Cycle – Phase 2
  • 63.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 69 Phase 2 – Sample Dataset Inventory
  • 64.
    ❖ Phase 2:Data preparation tools ▪ Hadoop • It can perform massively parallel loading and analysis of large dataset. • Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of massive unstructured data feeds from multiple sources. ▪ Alpine Miner • Provides a graphical user interface (GUI) for data manipulation and analysis ▪ Open Refine (Google Refine) • A powerful tool for working with large and unstructured dataset. It is a popular GUI-based tool for performing data transformations. ▪ Data Wrangler (Stanford University) • An interactive tool for data cleaning and transformation on a given dataset. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 70 Phase 2 – Common tools for data preparation
  • 65.
    ❖ Phase 3:Planning the data model ▪ Data exploration and variable selection • Perform Exploratory Data Analysis, if required. • Explore associations & relationships among data • Identify key performance indicators (KPIs) ▪ Selecting suitable data analytical method or model • Keep in mind requirements of the business • Consider the type and format of data attributes • Consult the domain experts and follow the best practices Statistical Inference, Exploratory Data Analysis, and the Data Science Process 71 Data Analytics Life Cycle – Phase 3
  • 66.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 72 Phase 3 – Selecting appropriate data analytical model
  • 67.
    ❖ Phase 3:Common tools for the model planning phase ▪ R - Analytical Software Package • It has the data modeling capabilities and good environment for building interpretive models • R has ability to interface with databases via an ODBC connection and execute statistical tests and analyses against Big Data via an open source connection. • R contains nearly 5,000 packages for data analysis and graphical representation. ▪ SQL Analysis services • It can perform in-database analytics of common data mining functions, involved aggregations, and basic predictive models. ▪ SAS/ACCESS • Provides integration between SAS and the analytics sandbox via multiple data connectors such as OBDC, JDBC and OLE DB. Connectivity to relational databases (such as Oracle or Teradata) and data warehousing applications ( i.e. Green plum or Aster) • Enterprise applications such as SAP and Salesforce. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 73 Data Analytics Life Cycle – Phase 3
  • 68.
    ❖ Phase 4:Model building ▪ Develop datasets for testing, training, and production purposes. ▪ Assess validity of the model and its results on small scale • Verify result of the model from domain experts ▪ Evaluate the required hardware support to execute the model Statistical Inference, Exploratory Data Analysis, and the Data Science Process 74 Data Analytics Life Cycle – Phase 4
  • 69.
    ❖ Phase 4:Common tools for the model building phase ▪ SAS Enterprise Miner • Allows users to run predictive and descriptive models based on large volumes of data from across the enterprise. • It is built for enterprise-level computing and analytics by interoperating with large data stores. ▪ SPSS Modeler (IBM SPSS Modeler) • Offers methods to explore and analyze data through a GUI. ▪ MatLab • Provides a high-level language for performing a variety of data analytics and exploration. ▪ Statistica and Mathematica • Popular and well-regarded data mining and analytics tools. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 75 Data Analytics Life Cycle – Phase 4
  • 70.
    ❖ Phase 4:Free or Open Source tools for the model building phase ▪ WEKA • A free data mining software package with an analytic workbench. The functions created in WEKA can be executed within Java code. ▪ Python • It is a programming language that provides toolkits for machine learning and analysis, such as scikit- learn, numpy, scipy, pandas, and related data visualization using matplotlib. ▪ Rand PL/R • R was described earlier in the model planning phase, and PLR is a procedural language for PostgreSQL with R. Using this approach means that R commands can be executed in database. ▪ Octave • A programming language for computational modeling having some functionality of MatLab. • Being freely available, Octave is used in major universities when teaching machine learning. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 76 Data Analytics Life Cycle – Phase 4
  • 71.
    ❖ Phase 5:Communicate the results ▪ Collaborate with the major stakeholders, and evaluation of the results • Identify key findings, quantify their business value. • The deliverable of this phase will be decisive for the outside stakeholders and sponsors • Summarize the findings and convey to the stakeholders. • Make recommendations for future work or improvements to existing processes ▪ Accept failure of an analytical project • A true failure means failure of data to accept or reject the hypothesis stated in phase-1. • Analyst should be rigorous enough with the data to determine whether it will prove or disprove the hypotheses Statistical Inference, Exploratory Data Analysis, and the Data Science Process 77 Data Analytics Life Cycle – Phase 5
  • 72.
    ❖ Phase 6:Operationalize ▪ Communicate the benefits of the project more broadly • If required, run a pilot project before implementing the models in a production environment. • Learn from the deployment and make any needed adjustments. ▪ Properly document and deliver the final reports, briefings, code, and technical documents. • Consult documentation of the similar past projects, if available. • Follow the documentation standards to increase its effectiveness. Statistical Inference, Exploratory Data Analysis, and the Data Science Process 78 Data Analytics Life Cycle – Phase 6
  • 73.
    ❖ Data ▪ Definition,Importance, Characteristics, Sources and Types ▪ Structured Data, Semi-Structured Data, Un-Structured Data ▪ The information processing cycle ▪ Data Preprocessing (Integration, Cleansing, Reduction, and Transformation) ❖ Statistical Inference ▪ Definition & Objectives, Sampling, Statistical experiment and Probability ❖ Exploratory Data Analysis (EDA) ▪ Definition & Objectives, EDA Process and Example ❖ The Data Science Process ▪ Definition & Objectives, the Process diagram ❖ Data Analytical Life Cycle ▪ Discovery, Data preparation , Model planning , Model building , Communicate results, Operationalize Statistical Inference, Exploratory Data Analysis, and the Data Science Process 79 Content’s Review You are Welcome ! Questions ? Comments ! Suggestions !!
  • 74.
    Statistical Inference, ExploratoryData Analysis, and the Data Science Process 80 Questions ? Comments ! Suggestions !! Farewell to the Day ☺