Statistical Inference, Exploratory Data Analysis, and the Data Science Process.pdf

Data
Department of Information Technology
University of the Punjab – Gujranwala Campus
Science
Data Preprocessing,
Statistical Inference, EDA, and
the Data Science Process
2
Compiled & Edited by
Babar Yaqoob Khan
Visiting Lecturer – Data Science

❖ Data
▪ Definition, Types (Structured Data, Semi-Structured Data, Un-Structured Data), Sources, Qualities & Importance
▪ The information processing cycle
▪ Data Preprocessing (Sampling, Cleansing, Aggregation, Dimensionality Reduction, Feature Subset Selection, Feature
Creation, Integration, Discretization and Binarization, and Transformation)
❖ Statistical Inference
▪ Definition & Objectives, Sampling, Statistical experiment and Probability
❖ Exploratory Data Analysis (EDA)
▪ Definition & Objectives, EDA Process and Example
❖ The Data Science Process
▪ Definition & Objectives, the Process diagram
❖ Data Analytical Life Cycle
▪ Discovery, Data preparation , Model planning , Model building , Model Evaluation, Communicate results, Operationalize
Statistical Inference, Exploratory Data Analysis, and the Data Science Process 2
WHAT IS IN IT FOR YOU?

❖ Data
▪ The facts and figures in raw or unorganized form (such as alphabets, numbers, or symbols) that refer to,
or represent, conditions, ideas, or objects.
▪ Different types or formats of data:
• Numbers, Characters or Strings, Time and Date
• Pictures/Images, Graphs, and Maps
• Documents, E-mails, Tweets, and Newsfeeds etc.
• Audio and Video streams
• Formats: XML, CSV, TSV, SQL, JSON, Text etc.
• Records: user-level data, timestamped event data
▪ Data can be stored in files, data repositories or in databases
Data – Definition, Types, Sources, Qualities & Importance

❑ Nominal scale
❑Categorical scale
❑ Ordinal scale
❑ Interval scale
❑ Ratio scale
Types of Data Measurements
Qualitative
Quantitative
Discrete Continuous
More
Information
Content

Nominal:
ID numbers, Names of people, Gender, Blood type, Eye colour, Political Party
Categorical:
Fruits, vegetables, juices, zip codes, sales
Ordinal:
Rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}
Interval:
Calendar dates, temperatures in Celsius or Fahrenheit, GRE and IQ scores
Ratio:
Mass, length, counts, money
Types of Data Measurements: Examples

❖ Data Sources
▪ Business activities – sale & purchase of products
▪ Manufacturing process – production and assembling of products
▪ Transportation – transportation of people and products from place to place
▪ Sensing & monitoring – data from sensors (in space and oceans etc. ) and CCTV cameras
▪ Human interaction – emails, audio, video and textual communication
▪ … … ….
Data – Definition, Types, Sources, Qualities & Importance

Data Pre-processing

Data Science Process
Raw Data
Collection
Data
Pre-
Processing
Clean
Dataset
Data
Processing
Visualization/
Communicate
Results
Data
Product
Exploratory
Data
Analysis
Make
Decisions
Reality
Business
Problem
Instrument
Data
Sources
Decision Support
Business Intelligence
Recommender Systems
Business Forecasting (Prediction)

❖ Population (N)
▪ Includes all of the elements from a set of data e.g.,
• The entire US population i.e., 341.97 million (341,963,408) or
• The entire Pakistan population i.e., 252.37 million (252,363,571)
• The entire world population i.e., 8.2 billion
• Set of objects, such as tweets or photographs
❖ Sample (n)
▪ Consists of one or more observations drawn from the population
n < N
Population vs. Sample

❖ Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis
❖ Sampling Types
▪ Simple Random Sampling
• Equal probability of selecting any item
▪ Stratified Sampling
• Split the data into partitions and draw random samples from each partition
Sampling & Types

❖ Sampling
• Technique mainly employed for data selection from population
• Often used both for preliminary investigation and the final data analysis
❖ Sampling Types
▪ Systematic Sampling
• Select every nth item from a list.
For instance, if you have a list of 1,000 people and you choose every 10th person
▪ Cluster Sampling
• The population is divided into clusters, usually based on geographical areas or natural groupings. A
few clusters are randomly selected, and all members within those clusters are surveyed
Sampling & Types

Sample Size
Ideal Ratio: 70:30

❖ Data in the real world is dirty
❖ GIGO - good data is a prerequisite for producing effective models of any type
Incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
e.g., occupation=“ ”
Noisy: containing errors or outliers
e.g., Salary=“-10”
Inconsistent: containing discrepancies in codes or names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
Why Data Pre-processing?

Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is analysed.
– Human/hardware/software problems
Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
Why is Data Dirty?

❖ Data Cleaning
▪ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
❖ Data Integration
▪ Integration of multiple databases or files
❖ Data Transformation
▪ Normalization and aggregation
❖ Data Reduction
▪ Obtains reduced representation in volume but produces the same or similar analytical
results
❖ Data Discretization & Binarization
▪ Part of data reduction but with particular importance for numerical data
Data Preprocessing – Major Tasks

Forms of Data Preprocessing
n

❖ Importance
▪ Garbage in Garbage out Principle (GIGO)
❖ Data Cleaning Tasks
• Fill in missing values
• Identify outliers and Managing noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
Data Preprocessing – Cleaning

❖ Missing Data
❑ Data is not always available
▪ E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
❑ Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
❑ Missing data may need to be inferred

❖ How to Handle Missing Data?
❖ Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute varies
considerably.
❖ Fill in the missing value manually: tedious + infeasible?
❖ Fill in it automatically with
▪ a global constant : e.g. “unknown”, a new class?!
▪ the attribute mean for all data points belonging to the same class: smarter
▪ the most probable value: inference-based such as Bayesian formula or decision tree

❖ How to Handle Noisy Data ?
❖ Binning
▪ First sort data and partition into (equal-frequency) bins
▪ Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
❖ Regression
▪ Smooth by fitting the data into regression functions
❖ Clustering
▪ Detect and remove outliers
❖ Combined Computer and Human Inspection
▪ Detect suspicious values and check by human (e.g., deal with possible outliers)

❖ Simple Discretization Methods: Binning
❖ Equal-width (distance) partitioning
▪ Divides the range into N intervals of equal size: uniform grid
▪ If A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (Max – Min)/N.
▪ The most straightforward, but outliers may dominate presentation
▪ Skewed data is not handled well
❖ Equal-depth (frequency) partitioning
▪ Divides the range into N intervals each containing approximately same number of data points
▪ Good data scaling
▪ Managing categorical attributes can be tricky

❖ Binning

❖ Regression

❖ Clustering

Data Preprocessing – Integration
Data integration:
❖ Combines data from multiple sources into a coherent store
❖ Schema integration: e.g., A.cust-id ≡ B.cust-#
▪ Integrate metadata from different sources
❖ Entity identification problem:
▪ Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
❖ Detecting and resolving data value conflicts
▪ For the same real world entity, attribute values from different sources are different
▪ Possible reasons: different representations, different scales, e.g., metric vs. British units

Handling Redundancy in Data Integration
❖ Redundant data occur often when integration of multiple databases
▪ Object identification: The same attribute or object may have different names in
different databases
▪ Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue (from monthly income data)
❖ Redundant attributes may be able to be detected by correlation analysis
❖ Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data Preprocessing – Data Integration

Correlation Analysis (Numerical Data)
Data Preprocessing – Data Integration

Data Preprocessing - Data Integration
Correlation Analysis (Categorical Data)

Play Chess Not Play Chess Sum (row)
Like Science Fiction 250 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500
Probability to play chess: P(chess) = 300/1500 = 0.2
Probability to like science fiction: P(SciFi) = 450/1500 = 0.3
If science fiction and chess playing are independent attributes, then the
probability to like SciFi AND play chess is
P(SciFi, chess) = P(SciFi) · P(chess) = 0.06
That means, we expect 0.06 · 1500 = 90 such cases (if they are independent)

Play Chess Not Play Chess Sum (row)
Like Science Fiction 250 (90) 200 450
Not Like Science Fiction 50 1000 1050
Sum 300 1200 1500
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on
the data distribution in the two categories)
It shows that like_science_fiction and play_chess are correlated in the group!

❖ Data Reduction (Dimensionality Reduction)
• Obtains reduced representation in volume but produces the same or similar analytical
results
✓ Feature Subset Selection / Principal Component Analysis (PCA)
✓ Singular Value Decomposition (SVD)
❖ Data Discretization (Dimensionality Reduction)
• Part of data reduction but with particular importance for numerical data
• Also called “binning”
Data Preprocessing – Reduction, Discretization

❖ Data Preprocessing – Transformation
• Maps entire set of values of an attribute to a new set of values
• Data standardization and normalization (by clustering and binning)
✓ Smoothing: remove noise from data
✓ Aggregation: summarization
✓ Generalization: concept hierarchy climbing
✓ Normalization: scaled to fall within a small, specified range
✓ min-max normalization
✓ z-score normalization
✓ normalization by decimal scaling
✓ Attribute/feature construction
✓ New attributes constructed from the given ones
41
Data Preprocessing – Transformation

❖ Feature Creation
• Original attributes not always best representation of information
• Creates new features which are more efficient/focused
❖ Methodologies
▪ Features Extraction – Domain Specific
• Derived features
▪ Feature Construction
• Combine multiple features to construct new feature(s)
▪ Mapping Data to New Space
• Fourier Transform - what frequencies are present in your signal
• Wavelet Transform - what frequencies are present and where (or at what scale)
Data Preprocessing – Feature Creation

❖ Discretization & Binarization
• Converting the data into discrete form and later to binarize it to accommodate certain machine
learning algorithms/models
Data Preprocessing – Discretization & Binarization

Statistical Inference

❖ Population
▪ Includes all of the elements from a set of data e.g.,
• The entire Pakistani population i.e., 247 million or
• The entire world population i.e., 8 billion
• Set of objects, such as tweets or photographs
❖ Sample
▪ It consists of one or more items drawn from the population
▪ For example, 1000 Pakistanis selected from all provinces of Pakistan
▪ Size of sample (n) always less than size of the population (N)
▪ Sample may not be totally representative of the population
Statistical Inference – Population and Sample
Sample < Population
n < N

❖ Statistical inference
▪ It is process of estimating the parameters of a population, using the random sampling.
▪ The inference also tests reliability of the estimates with calculated uncertainty.
❖ Purpose and benefits
▪ Enable us to understand the population without studying its all items.
▪ Minimizes the cost of understanding the population.
▪ It remains the only possible option, when whole the population is not accessible.

Population
Sampling
Sample
Parameters Estimation
Statistical
Inference

❖ A statistical experiment has three properties:
▪ The experiment can have more than one possible outcome
▪ Each possible outcome can be specified in advance
▪ The outcome of the experiment depends on chance
❖ For instance, toss a coin
▪ Outcomes are:
• More than one
• Specified in advance
• Depends on chance
Statistical Experiment
Head or, Tail
{ Head, Tail }
Unknown in advance, unless coin is tossed
Or 50% chance of Head and vice versa
Statistical Experiment

❖ Variable or Parameter
▪ It represents value of an attribute of an item in the population. i.e. name, color of an item.
▪ A random variable can take on any of the specified values (domain).
▪ A random variable takes a value after a statistical experiment.
Variables or Parameters
𝒙 = 7 𝒙
Statistical Experiment 𝐱
𝒙 = 7
𝒙 is a Variable 𝒙 is a Random Variable

❖ Probability
▪ It is the measure of the likelihood of happening an event.
▪ A quantitative measure, always takes value between 0 and 1.
❖ Example:
▪ Tossing a coin is a statistical experiment.
▪ It can result two outcomes:
• Head or a Tail
▪ Calculating the chance of a resulting a ‘head’ is its probability
Probability
Probability =
Favorable outcomes
Possible outcomes
P(H) = 1/2 = 0.5
P(T) = 1/2 = 0.5
heads (H) tails (T)

❖ Probability distribution
▪ It links each outcome of a statistical experiment with its probability of occurrence
▪ For instance, you toss a coin two times
▪ Possible outcomes = {HH, HT, TH, TT}
▪ Let X = number of Heads
▪ Possible outcomes = { 0, 1, 2 }
• P(X = 0) = 1/4 = 0.25 No Heads = { TT }
• P(X = 2) = 1/4 = 0.25 Two Heads = { HH }
• P(X = 1) = 2/4 = 0.50 One Heads = { HT, TH }
Probability Distribution

❖ Modeling
▪ A model is representation of a real object or situation. It presents a simplified version of something.
▪ It is an artificial construction to understand and represent the nature of real things.
• Model does not has unnecessary detail.
▪ Humans try to understand the world around them using different models.
• Architect capture 3-D prints to construct design structures
• Biologists capture connection between amino acids to understand protein-protein interactions
• Statisticians and Data Scientists capture randomness to comprehend data-generating processes
❖ Data Modeling
▪ Data modeling is the analysis of data objects and their relationships to other data objects.
▪ The model helps us in defining and analyzing data requirements needed to support the business
processes in an organization.
Data Modeling

❖ Building a Model
▪ Do some Exploratory Data Analysis (EDA) and discover the relationship among the data.
▪ Try to describe the relationship using a mathematical formula.
❖ Model Fitting
▪ Model Fitting (Balance Fitting)
• When model fits the training as well as testing data pretty well
▪ Underfitting
• When model is unable to fit even the training data
▪ Overfitting
• When model fits the training data well but testing data too poor
✓ Noise (undesired data) and higher variability (inconsistency) in data cause the overfitting
✓ Remove noise (data cleaning) and add more training data to train the model.
Data Modeling

Fitting a Data Model
Too good to be true.
Forced-fitting
Too simple to explain
the variation in data

▪ In statistics, EDA is used to analyze datasets to summarize their main characteristics.
▪ EDA often employ the visual methods to see what the data can tell us beyond the formal modeling or
hypothesis testing task.
▪ It is an effort to understand the process that generate the data under observation.
• ‘Exploration’ means your understanding of the problem is changing as you go ahead.
• Plots, graphs and summary statistics are basic tools of the EDA.
What is EDA?

What is EDA?

❖ EDA helps us to:
▪ Understand the data and its value in business
• Discover patterns in data
• Spot anomalies (outliers) in data
• Verify existing assumptions about data
• Make comparisons between the data distributions.
• Finding suitable data formats
▪ Improve accuracy of the data-products.
▪ Assure verification of the data-products.
EDA – Why we do it?

▪ Say, we have data (raw data) on these things
▪ We want to process these data for better analysis
▪ Processing would give us a Clean Dataset to analyze
▪ We’ll be doing some EDA with the clean dataset
▪ EDA will lead us towards a Data Model and an Algorithm
▪ We get the results after using the model and interpret, visualize, or report them
▪ The results are in decision making or as input for a ‘Data Product’.
▪ The Products may be like as:
• Recommender system
• Business forecasting system
• Spam classifier
The Data Science Process

Data Science Process
Raw Data
Collection
Data
Pre-
Processing
Clean
Dataset
Data
Processing
Visualization/
Communicate
Results
Data
Product
Exploratory
Data
Analysis
Make
Decisions
Reality
Business
Problem
Instrument
Data
Sources
Decision Support,
Business Intelligence
Recommender Systems
Business Forecasting
(Prediction)

❖ Big Data
▪ Big data is a field dedicated to the analysis,
processing, and storage of large collections of data
that frequently originate from various sources.
▪ It is used when traditional data analysis, processing
and storage technologies and techniquesare
insufficient.
❖ Big Data Characteristics
▪ Volume
▪ Velocity
▪ Variety
▪ Veracity
▪ Value
The Big Data Approach

❖ Descriptive Analysis (What happened)
▪ It is done to answer questions about events that have already occurred.
❖ Diagnostic Analysis (Why did it happen)
▪ It is used to determine the cause of a phenomenon that occurred in the past using questions that focus
on the reason behind the event.
❖ Predictive Analysis (What will happen)
▪ It is an attempt to determine the outcome of an event that would occur in the future.
❖ Prescriptive Analysis (How can we make it happen)
▪ Prescriptive analytics are build upon the results of predictive analytics to prescribe the actions that
shouldbe taken to improve the business.
Big Data – Analytics Types

Big Data – Analytics Types
2
Diagnostic 3
Predictive
4
Prescriptive
1
Descriptive

❖ Phase 1: Learning the business domain and problem discovery
▪ Understand the business process
• Study the similar past projects
• Identify available resources – people, required skills, technology, time, and data.
• Have right mix of domain experts, customers, analytic talent, and project management.
▪ Identifying key stakeholders
• Understand their interests in the project
• Propose and discuss more than one solutions to the problem
▪ Discover the problem to be solved
• Write the problem statement and its justification.
• Discuss and refine the problem statement after discussion with the major stakeholder
• Establish the criteria for success and failure of the proposed solution
Data Analytics Life Cycle – Phase 1

Data Analytics Life Cycle – Key Roles

❖ Phase 2: Data preparation
▪ Define the steps to explore and preprocess data before its modeling and analysis.
▪ Prepare the analytics sandbox (setup for the experiments)
▪ Perform the Extract Transform Load (ETL) process (or ELT). → ETLT = ETL + ELT
▪ Understand the target data
▪ Data cleaning – data normalization and transformation
• For better understanding, utilize maximum of the available data
• Survey and visualize the test dataset
• Carefully complete the highly labor-intensive activity
▪ Data accessing strategies:
• Download snapshot of the production data
• Use the API facility, if available

Phase 2 – Sample Dataset Inventory

❖ Phase 2: Data preparation tools
▪ Hadoop
• It can perform massively parallel loading and analysis of large dataset.
• Used for web traffic parsing, GPS location analytics, genomic analysis, and combining of massive
unstructured data feeds from multiple sources.
▪ Alpine Miner
• Provides a graphical user interface (GUI) for data manipulation and analysis
▪ Open Refine (Google Refine)
• A powerful tool for working with large and unstructured dataset. It is a popular GUI-based tool for
performing data transformations.
▪ Data Wrangler (Stanford University)
• An interactive tool for data cleaning and transformation on a given dataset.
Phase 2 – Common tools for data preparation

❖ Phase 3: Planning the data model
▪ Data exploration and variable selection
• Perform Exploratory Data Analysis, if required.
• Explore associations & relationships among data
• Identify key performance indicators (KPIs)
▪ Selecting suitable data analytical method or model
• Keep in mind requirements of the business
• Consider the type and format of data attributes
• Consult the domain experts and follow the best practices

Phase 3 – Selecting appropriate data analytical model

❖ Phase 3: Common tools for the model planning phase
▪ R - Analytical Software Package
• It has the data modeling capabilities and good environment for building interpretive models
• R has ability to interface with databases via an ODBC connection and execute statistical tests and
analyses against Big Data via an open source connection.
• R contains nearly 5,000 packages for data analysis and graphical representation.
▪ SQL Analysis services
• It can perform in-database analytics of common data mining functions, involved aggregations, and
basic predictive models.
▪ SAS/ACCESS
• Provides integration between SAS and the analytics sandbox via multiple data connectors such as
OBDC, JDBC and OLE DB. Connectivity to relational databases (such as Oracle or Teradata) and data
warehousing applications ( i.e. Green plum or Aster)
• Enterprise applications such as SAP and Salesforce.

❖ Phase 4: Model building
▪ Develop datasets for testing, training, and production purposes.
▪ Assess validity of the model and its results on small scale
• Verify result of the model from domain experts
▪ Evaluate the required hardware support to execute the model

❖ Phase 4: Common tools for the model building phase
▪ SAS Enterprise Miner
• Allows users to run predictive and descriptive models based on large volumes of data from across
the enterprise.
• It is built for enterprise-level computing and analytics by interoperating with large data stores.
▪ SPSS Modeler (IBM SPSS Modeler)
• Offers methods to explore and analyze data through a GUI.
▪ MatLab
• Provides a high-level language for performing a variety of data analytics and exploration.
▪ Statistica and Mathematica
• Popular and well-regarded data mining and analytics tools.

❖ Phase 4: Free or Open Source tools for the model building phase
▪ WEKA
• A free data mining software package with an analytic workbench. The functions created in WEKA
can be executed within Java code.
▪ Python
• It is a programming language that provides toolkits for machine learning and analysis, such as scikit-
learn, numpy, scipy, pandas, and related data visualization using matplotlib.
▪ Rand PL/R
• R was described earlier in the model planning phase, and PLR is a procedural language for
PostgreSQL with R. Using this approach means that R commands can be executed in database.
▪ Octave
• A programming language for computational modeling having some functionality of MatLab.
• Being freely available, Octave is used in major universities when teaching machine learning.

❖ Phase 5: Communicate the results
▪ Collaborate with the major stakeholders, and evaluation of the results
• Identify key findings, quantify their business value.
• The deliverable of this phase will be decisive for the outside stakeholders and sponsors
• Summarize the findings and convey to the stakeholders.
• Make recommendations for future work or improvements to existing processes
▪ Accept failure of an analytical project
• A true failure means failure of data to accept or reject the hypothesis stated in phase-1.
• Analyst should be rigorous enough with the data to determine whether it will prove or disprove the
hypotheses

❖ Phase 6: Operationalize
▪ Communicate the benefits of the project more broadly
• If required, run a pilot project before implementing the models in a production environment.
• Learn from the deployment and make any needed adjustments.
▪ Properly document and deliver the final reports, briefings, code, and technical documents.
• Consult documentation of the similar past projects, if available.
• Follow the documentation standards to increase its effectiveness.

❖ Data
▪ Definition, Importance, Characteristics, Sources and Types
▪ Structured Data, Semi-Structured Data, Un-Structured Data
▪ The information processing cycle
▪ Data Preprocessing (Integration, Cleansing, Reduction, and Transformation)
❖ Statistical Inference
▪ Definition & Objectives, Sampling, Statistical experiment and Probability
▪ Definition & Objectives, EDA Process and Example
▪ Definition & Objectives, the Process diagram
❖ Data Analytical Life Cycle
▪ Discovery, Data preparation , Model planning , Model building , Communicate results, Operationalize
Content’s Review
You are Welcome !
Questions ?
Comments !
Suggestions !!

Questions ?
Comments !
Suggestions !!
Farewell to the Day ☺

Statistical Inference, Exploratory Data Analysis, and the Data Science Process.pdf

More Related Content

Similar to Statistical Inference, Exploratory Data Analysis, and the Data Science Process.pdf(20)

Recently uploaded(20)

Statistical Inference, Exploratory Data Analysis, and the Data Science Process.pdf