0% found this document useful (0 votes)
7 views12 pages

CC&BD Unit 4

Uploaded by

Jeevan Kp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

CC&BD Unit 4

Uploaded by

Jeevan Kp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit 4

Data Pre-processing and analytics: Data pre-processing overview -Sampling

Data Pre-processing Overview


Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of
resolving such issues.
The collected datasets vary with respect to noise, redundancy, and consistency, etc., and it is
undoubtedly a waste to store meaningless data. In addition, some analytical methods have stringent
requirements on data quality. Therefore, data should be pre-processed under many circumstances to
integrate the data from different sources, so as to enable effective data analysis.
Pre-processing data not only reduces storage expense, but also improves analysis accuracy. The
following are some of the relational data pre-processing techniques.
● Integration:
○ This involves the combination of data from different sources and provides users
with a uniform view of data.
○ Two methods have been widely recognized: data warehouse and data
federation.

○ Data warehousing includes a process named ETL (Extract, Transform and Load).
○ Extraction involves connecting source systems, selecting, collecting, analyzing,
and processing necessary data.
○ Transformation is the execution of a series of rules to transform the extracted data
into standard formats.
○ Loading means importing extracted and transformed data into the target storage
infrastructure. It is the most complex procedure among the three, which includes
operations such as transformation, copy, clearing, standardization, screening, and
data organization.
● Cleaning:
○ It is a process to identify inaccurate, incomplete, or unreasonable data, and then
modify or delete such data to improve data quality. Generally, data cleaning
includes five complementary procedures:
○ defining and determining error types, searching and identifying errors,
correcting errors, documenting error examples and error types, and modifying data
entry procedures to reduce future errors.
○ During cleaning, data formats, completeness, rationality, and restriction shall be
inspected. Data cleaning is of vital importance to keep the data consistency, which
is widely applied in many fields, such as banking, insurance, retail industry,
telecommunications, and traffic control.
● Redundancy Elimination:
○ This refers to data repetitions or surplus, which usually occurs in many datasets.
○ Data redundancy can increase the unnecessary data transmission expense and
cause defects on storage systems, e.g., waste of storage space, leading to data
inconsistency, reduction of data reliability, and data damage.
○ Therefore, various redundancy reduction methods have been proposed, such as
redundancy detection, data filtering, and data compression.

Such methods may apply to different datasets or application environments

SAMPLING
The aim of sampling is to take a subset of past customer data and use that to build an analytical
model.
Generally, Sampling is a process used in statistical analysis in which a predetermined number of
observations are taken from a larger population.
A key requirement for a good sample is that it should be representative of the future customers on
which the analytical model will be run.
The sample should also be taken from an average business period to get a picture of the target
population that is as accurate as possible.
Stratified sampling is a type of sampling method in which the total population is divided into
smaller groups or strata to complete the sampling process.
Missing Values - Outlier Detection and Treatment

MISSING VALUES
Missing values can occur because of various reasons. The information can be inapplicable. The
information can also be undisclosed. For example, a customer decided not to disclose his or her income
because of privacy. Missing data can also originate because of an error during merging.
Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other
techniques need some additional preprocessing. The following are the most popular schemes to deal
with missing values:
● Replace (impute):
○ This implies replacing the missing value with a known value.
● Delete:
○ This is the most straightforward option and consists of deleting observations or variables
with lots of missing values.
● Keep:
○ Missing values can be meaningful (e.g., a customer did not disclose his or her
income because he or she is currently unemployed).
OUTLIER DETECTION AND TREATMENT
One of the most important steps in data pre-processing is outlier detection and treatment. Machine
learning algorithms are very sensitive to the range and distribution of data points. Data outliers can
deceive the training process resulting in longer training times and less accurate models. Outliers are
defined as samples that are significantly different from the remaining data. Those are points that lie outside
the overall pattern of the distribution. Statistical measures such as mean, variance, and correlation are very
susceptible to outliers.

A simple example of an outlier is here, a point that deviates from the overall pattern.
Nature of Outliers:
Outliers can occur in the dataset due to one of the following reasons,
1. Genuine extreme high and low values in the dataset
2. Introduced due to human or mechanical error
3. Introduced by replacing missing values

In some cases, the presence of outliers are informative and will require further study. For example,
outliers are important in use-cases related to transaction management, where an outlier might be used to
identify potentially fraudulent transactions.
Outliers are extreme observations that are very dissimilar to the rest of the population. Actually,
two types of outliers can be considered:
1. Valid observations (e.g., salary of boss is $1 million)
2. Invalid observations (e.g., age is 300 years)

Both are univariate outliers in the sense that they are outlying on one dimension. However, outliers can
be hidden in unidimensional views of the data.

Multivariate outliers are observations that are outlying in multiple dimensions.

The above is an example of two outlying observations considering both the dimensions of
income and age.
Two important steps in dealing with outliers are detection and treatment. Various graphical
tools can be used to detect outliers. Histograms are a first example.

The above is an example of a distribution for age whereby the circled areas clearly represent
outliers.
Another useful visual mechanism are box plots.

3 different methods of dealing with outliers:


● Univariate method: This method looks for data points with extreme values on one
variable.
● Multivariate method: Here, we look for unusual combinations of all the variables.
● Minkowski error: This method reduces the contribution of potential outliers in the training
process.

Standardizing Data - Categorization


STANDARDIZING DATA
Data standardization is the process of converting data to a common format to enable users to
process and analyze it. Most organizations utilize data from a number of sources; this can include data
warehouses, lakes, cloud storage, and databases. However, data from disparate sources can be problematic
if it isn’t uniform, leading to difficulties down the line (e.g., when you use that data to produce dashboards
and visualizations, etc.).
Data standardization is crucial for many reasons. First of all, it helps you establish clear,
consistently defined elements and attributes, providing a comprehensive catalog of your data. Whatever
insights you’re trying to get or problems you’re attempting to solve, properly understanding your data is
a crucial starting point.
Getting there involves converting that data into a uniform format, with logical and consistent
definitions. These definitions will form your metadata — the labels that identify the what, how, why, who,
when, and where of your data. That’s the basis of your data standardization process.
From an accuracy perspective, standardizing the way you label data will improve access to the
most relevant and current information. This will help make your analytics and reporting easier. Security-
wise, mindful cataloging forms the basis of a powerful authentication and authorization approach, which
will apply security restrictions to data items and data users as appropriate.
For this purpose, data is generally processed in one of two ways: data standardization or data
normalization, sometimes referred to as min-max scaling.
Data normalization refers to shifting the values of your data so they fall between 0 and 1. Data
standardization, in this context, is used as a scaling technique to establish the mean and the standard
deviation at 0 and 1, respectively.
Standardization is especially useful for regression based approaches, but is not needed for decision
trees.

CATEGORIZATION
In the process of data mining, large data sets are first sorted, then patterns are identified and
relationships are established to perform data analysis and solve problems.
Classification: It is a Data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of
categories (subpopulations), a new observation belongs to, on the basis of a training set of data containing
observations and whose categories membership is known.
Example: Before starting any Project, we need to check it’s feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further approve
it. It is a two-step process such as :
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.
Categorization (also known as coarse classification, classing, grouping, binning, etc.) can be
done for various reasons. For categorical variables, it is needed to reduce the number of categories.
With categorization, one would create categories of values such that fewer parameters will have to be
estimated and a more robust model is obtained.
For continuous variables, categorization may also be very beneficial.

Various methods can be used to do categorization. Two very basic methods are equal interval binning
and equal frequency binning.
Chi-squared analysis is a more sophisticated way to do coarse classification.
Weights of Evidence Coding

WEIGHTS OF EVIDENCE CODING


Logistic regression model is one of the most commonly used statistical techniques for solving
binary classification problems. It is an acceptable technique in almost all the domains. These two concepts
- weight of evidence (WOE) and information value (IV) evolved from the same logistic regression
technique. These two terms have been in existence in the credit scoring world for more than 4-5
decades. They have been used as a benchmark to screen variables in the credit risk modeling projects such
as probability of default. They help to explore data and screen variables. It is also used in marketing
analytics projects such as customer attrition model, campaign response model etc.
The weight of evidence tells the predictive power of an independent variable in relation to the
dependent variable. Since it evolved from the credit scoring world, it is generally described as a measure
of the separation of good and bad customers.
"Bad Customers" refers to the customers who defaulted on a loan and "Good Customers"
refers to the customers who paid back the loan.

Distribution of Goods - % of Good Customers in a particular group


Distribution of Bads - % of Bad Customers in a particular group
ln - Natural Log

Variable Selection and Segmentation

Variable Selection
Variable selection is a collection of candidate model variables tested for significance during model
training. Candidate model variables are also known as independent variables, predictors, attributes, model
factors, covariates, regressors, features, or characteristics.
Selecting an appropriate subset of variables is crucial for the quality of segmentation analyses.

Variable selection is a process that aims to identify a minimal set of predictors for maximum gain
(predictive accuracy).
Variable selection means choosing among many variables which to include in a particular model,
that is, to select appropriate variables from a complete list of variables by removing those that are
irrelevant or redundant.
In machine learning and statistics, feature selection, also known as variable selection, attribute
selection or variable subset selection, is the process of selecting a subset of relevant features for use in
model construction.
Classical variable selection methods include forward selection, backward elimination, and
stepwise selection.
When selecting variables, there are a number of limitations.
● First, the model will usually contain some highly predictive variables — the use of which is
prohibited by legal, ethical or regulatory rules.
● Second, some variables might not be available or might be of poor quality during modeling or
production stages.
● In addition, there might be important variables that have not been recognized.
● And finally, the business will always have the last word and might insist that only business-sound
variables are included or request monotonically increasing or decreasing effects.
Segmentation
Segmentation refers to the act of segmenting data according to your company’s needs in order to
refine your analyses based on a defined context, using a tool for cross-calculating analyses. The purpose
of segmentation is to better understand your visitors, and to obtain actionable data in order to improve
your business. A segment enables you to filter your analyses based on certain elements (single or
combined).
Data Segmentation is the process of taking the data you hold and dividing it up and grouping
similar data together based on the chosen parameters so that you can use it more efficiently within
marketing and operations. Examples of data segmentation could be:
● Gender
● Customers vs prospects
● Industry benefits of data segmentation
● Create messaging that is tailored and sophisticated to suit your target market -
appealing to their needs better.
● To easily conduct an analysis of your data stored in the database, helping to
identify potential opportunities and challenges based within it.
● To mass-personalize your marketing communications, reducing costs.

Sometimes the data is segmented before the analytical modeling starts. A first reason for this could
be strategic (e.g., banks might want to adopt special strategies to specific segments of customers). It
could also be motivated from an operational viewpoint (e.g., new customers must have separate models
because the characteristics in the standard model do not make sense operationally for them).
Segmentation could also be needed to take into account significant variable interactions (e.g., if one
variable strongly interacts with a number of others, it might be sensible to segment according to this
variable).
The segmentation can be conducted using the experience and knowledge from a business expert,
or it could be based on statistical analysis using, for example, decision trees, k-means, or self-organizing
maps.
Segmentation is a very useful preprocessing activity because one can now estimate different
analytical models each tailored to a specific segment.

DATA ANALYTICS: Predictive Analytics (Regression, Decision Tree, Neural


Networks) – 1
DATA ANALYTICS
It is the process of analyzing raw data to find trends and answer questions.

The data analytics process has some components that can help a variety of initiatives. By
combining these components, a successful data analytics initiative will provide a clear picture of where
you are, where you have been and where you should go.

Generally, this process begins with descriptive analytics.


● This is the process of describing historical trends in data.
● Descriptive analytics aims to answer the question “what happened?”
● Descriptive analytics does not make predictions or directly inform decisions.
● It focuses on summarizing data in a meaningful and descriptive way.

The next essential part of data analytics is advanced analytics.

● This part of data science takes advantage of advanced tools to extract data, make predictions
and discover trends.
● This information provides new insight from data.
● Advanced analytics addresses “what if?” questions.

Big data analytics enables businesses to draw meaningful conclusions from complex and varied
data sources, which has been made possible by advances in parallel processing and cheap computational
power.
Role of Data Analytics
Data analysts exist at the intersection of information technology, statistics and business. They
combine these fields in order to help businesses and organizations succeed. The primary goal of a data
analyst is to increase efficiency and improve performance by discovering patterns in data.
Data analytics is a broad field. There are four primary types of data analytics: descriptive,
diagnostic, predictive and prescriptive analytics. Each type has a different goal and a different place in the
data analysis process. These are also the primary data analytics applications in business.

Descriptive analytics helps answer questions about what happened. These techniques summarize large
datasets to describe outcomes to stakeholders. By developing key performance indicators (KPIs,) these
strategies can help track successes or failures. Metrics such as return on investment (ROI) are used in
many industries. Specialized metrics are developed to track performance in specific industries. This
process requires the collection of relevant data, processing of the data, data analysis and data
visualization. This process provides essential insight into past performance.

Diagnostic analytics helps answer questions about why things happened. These techniques supplement
more basic descriptive analytics. They take the findings from descriptive analytics and dig deeper to find
the cause. The performance indicators are further investigated to discover why they got better or worse.
This generally occurs in three steps:
● Identify anomalies in the data. These may be unexpected changes in a metric or a particular
market.
● Data that is related to these anomalies is collected.
● Statistical techniques are used to find relationships and trends that explain these anomalies.

Predictive analytics helps answer questions about what will happen in the future. These
techniques use historical data to identify trends and determine if they are likely to recur. Predictive
analytical tools provide valuable insight into what may happen in the future and its techniques include a
variety of statistical and machine learning techniques, such as: neural networks, decision trees, and
regression.
Prescriptive analytics helps answer questions about what should be done. By using insights from
predictive analytics, data-driven decisions can be made. This allows businesses to make informed
decisions in the face of uncertainty. Prescriptive analytics techniques rely on machine learning strategies
that can find patterns in large datasets. By analyzing past decisions and events, the likelihood of different
outcomes can be estimated.
These types of data analytics provide the insight that businesses need to make effective and
efficient decisions. Used in combination they provide a well-rounded understanding of a
company’s needs and opportunities.

DATA ANALYTICS: Predictive Analytics (Regression, Decision Tree,Neural Networks) -


Predictive Analytics
In predictive analytics, the aim is to build an analytical model predicting a target measure of
interest. The target is then typically used to steer the learning process during an optimization procedure.

Two types of predictive analytics can be distinguished: regression and classification.

● In regression, the target variable is continuous. Popular examples are predicting stock prices,
loss given default (LGD), and customer lifetime value (CLV).
● In classification, the target is categorical. It can be binary (e.g., fraud, churn, credit risk)
or multiclass (e.g., predicting credit ratings).
LINEAR REGRESSION
Regression analysis is the method of using observations (data records) to quantify the relationship
between a target variable (a field in the record set), also referred to as a dependent variable, and a set of
independent variables, also referred to as a covariate. For example, regression analysis can be used to
determine whether the dollar value of grocery shopping baskets (the target variable) is different for male
and female shoppers (gender being the independent variable). The regression equation estimates a
coefficient for each gender that corresponds to the difference in value.

How Does Linear Regression Work?

To illustrate how linear regression works, examine the relationship between the prices of vintage
wines and the number of years since vintage. Each year, many vintage wine buyers gather in
France and buy wines that will mature in 10 years. There are many stories and speculations on
how the buyers determine the future prices of wine. Is the wine going to be good 10 years from
now, and how much would it be worth? Imagine an application that could assist buyers in making
those decisions by forecasting the expected future value of the wines. This is exactly what
economists have done. They have collected data and created a regression model that estimates
this future price. The current explanation of the regression is based onthis model.
DECISION TREES
Decision trees are recursive partitioning algorithms (RPAs) that come up with a tree-like structure
representing patterns in an underlying data set.

The top node is the root node specifying a testing condition of which the outcome corresponds to
a branch leading up to an internal node. The terminal nodes of the tree assign the classifications and are
also referred to as the leaf nodes.
Many algorithms have been suggested to construct decision trees. All these algorithms differ in
their way of answering the key decisions to build a tree, which are:
● Splitting decision: Which variable to split at what value (e.g., age < 30 or not, income < 1,000
ornot; marital status = married or not)
● Stopping decision: When to stop growing a tree?
● Assignment decision: What class (e.g., good or bad customer) to assign to a leaf node?

NEURAL NETWORKS
Neural networks are a series of algorithms that mimic the operations of a human brain to recognize
relationships between vast amounts of data. They are used in a variety of applications in financial services,
from forecasting and marketing research to fraud detection and risk assessment.
Neural networks are sets of algorithms intended to recognize patterns and interpret data through
clustering or labeling. In other words, neural networks are algorithms. A training algorithm is the method
you use to execute the neural network's learning process.
Today, neural networks are used for solving many business problems such as sales
forecasting,customer research, data validation, and risk management. For example, at Statsbot we apply
neural networks for time-series predictions, anomaly detection in data, and natural language
understanding

A first perspective states that they are mathematical representations inspired by the functioning of
the human brain. Another more realistic perspective sees neural networks as generalizations of existing
statistical models.

A complex algorithm used for predictive analysis, the neural network, is biologically inspired by
the structure of the human brain. A neural network provides a very simple model in comparison to the
human brain,
Widely used for data classification, neural networks process past and current data to estimate
future values — discovering any complex correlations hidden in the data.

Neural networks can be used to make predictions on time series data such as weather data. A neural
network can be designed to detect patterns in input data and produce an output free of noise.
The structure of a neural-network algorithm has three layers:

● The input layer feeds past data values into the next (hidden) layer. The black circles represent
nodes of the neural network.
● The hidden layer encapsulates several complex functions that create predictors; often those
functions are hidden from the user. A set of nodes (black circles) at the hidden layer represents
mathematical functions that modify the input data; these functions are called neurons.

The processing element or neuron in the middle basically performs two operations: it takes the
inputs and multiplies them with the weights.

● The output layer collects the predictions made in the hidden layer and produces the final result:
the model’s prediction.
Neural networks can model very complex patterns and decision boundaries in the data and, as such,
are very powerful.

Descriptive Analytics (Association Rules, Sequence Rules)

Descriptive analytics is the interpretation of historical data to better understand changes that have
occurred in a business. Descriptive analytics describes the use of a range of historic data to draw
comparisons. Most commonly reported financial metrics are a product of descriptive analytics—for
example, year-over-year pricing changes, month-over-month sales growth, the number of users, or the
total revenue per subscriber. These measures all describe what has occurred in a business during a set
period.
In descriptive analytics, the aim is to describe patterns of customer behavior. Contrary to predictive
analytics, there is no real target variable (e.g., churn or fraud indicator) available. Hence, descriptive
analytics is often referred to as unsupervised learning because there is no target variable to steer the
learning process.

The three most common types of descriptive analytics are


Social Network Analytics (Social Network Learning RelationalNeighbour
Classification) - 1
Social Network Analytics
SNA is used for modeling, visualizing, and analyzing the interactions between individuals within
groups and organizations. Disciplines such as sociology, business management, and public health have
made extensive use of SNA for a variety of organizational and network situations.
Many types of social networks exist. The most popular are undoubtedly Facebook, Twitter,
Google+, and LinkedIn. However, social networks are more than that. It could be any set of nodes (also
referred to as vertices) connected by edges in a particular business setting. Examples of social networks
could be:

Web pages connected by hyperlinks


● Email traffic between people
● Research papers connected by citations
● Telephone calls between customers of a telco provider
● Banks connected by liquidity dependencies
● Spread of illness between patients

These examples clearly illustrate that social network analytics can be applied in a wide variety
ofdifferent settings.

SOCIAL NETWORK DEFINITIONS


A social network consists of both nodes (vertices) and edges. Both need to be clearly defined
at the outset of the analysis.
A node (vertex) could be defined as a customer (private/professional), household/ family,
patient, doctor, paper, author, terrorist, web page, and so forth.
An edge can be defined as a friend relationship, a call, transmission of a disease, reference, and
so on. Note that the edges can also be weighted based on interaction frequency, importance of
information exchange, intimacy, and emotional intensity.
Sociograms are good for small-scale networks. For larger-scale networks, the network will
typically be represented as a matrix shown below.

SOCIAL NETWORK LEARNING


Social Network Analytics (Social Network Learning RelationalNeighbour
Classification) - 2

Social Network Analytics

Social network analysis (SNA) is the process of investigating social structures through the use of
networks and graph theory.
● It characterizes networked structures in terms of nodes (individual actors, people, or things within
the network) and the ties, edges, or links (relationships or interactions) that connect them.
● Examples of social structures commonly visualized through social network analysis include
social media networks, memes spread, information circulation, friendship and acquaintance
networks, business networks, knowledge networks, difficult working relationships, social
networks, collaboration graphs, kinship, disease transmission, and sexual relationships.
● These networks are often visualized through sociograms in which nodes are represented as points
and ties are represented as lines.
● These visualizations provide a means of qualitatively assessing networks by varying the visual
representation of their nodes and edges to reflect attributes of interest.

A relation model is based on the idea that the behavior between nodes is correlated, meaning that
connected nodes have a propensity to belong to the same class. The relational neighbor classifier, in
particular, predicts a node's class based on its neighboring nodes and adjacent edges.

These 10 social media analytics tools can help to track the social presence.

1. Sprout Social.
2. HubSpot.
3. TapInfluence.
4. BuzzSumo.
5. Snaplytics.
6. Curalate.
7. Keyhole.
8. Google Analytics.

You might also like