CC&BD Unit 4
CC&BD Unit 4
○ Data warehousing includes a process named ETL (Extract, Transform and Load).
○ Extraction involves connecting source systems, selecting, collecting, analyzing,
and processing necessary data.
○ Transformation is the execution of a series of rules to transform the extracted data
into standard formats.
○ Loading means importing extracted and transformed data into the target storage
infrastructure. It is the most complex procedure among the three, which includes
operations such as transformation, copy, clearing, standardization, screening, and
data organization.
● Cleaning:
○ It is a process to identify inaccurate, incomplete, or unreasonable data, and then
modify or delete such data to improve data quality. Generally, data cleaning
includes five complementary procedures:
○ defining and determining error types, searching and identifying errors,
correcting errors, documenting error examples and error types, and modifying data
entry procedures to reduce future errors.
○ During cleaning, data formats, completeness, rationality, and restriction shall be
inspected. Data cleaning is of vital importance to keep the data consistency, which
is widely applied in many fields, such as banking, insurance, retail industry,
telecommunications, and traffic control.
● Redundancy Elimination:
○ This refers to data repetitions or surplus, which usually occurs in many datasets.
○ Data redundancy can increase the unnecessary data transmission expense and
cause defects on storage systems, e.g., waste of storage space, leading to data
inconsistency, reduction of data reliability, and data damage.
○ Therefore, various redundancy reduction methods have been proposed, such as
redundancy detection, data filtering, and data compression.
SAMPLING
The aim of sampling is to take a subset of past customer data and use that to build an analytical
model.
Generally, Sampling is a process used in statistical analysis in which a predetermined number of
observations are taken from a larger population.
A key requirement for a good sample is that it should be representative of the future customers on
which the analytical model will be run.
The sample should also be taken from an average business period to get a picture of the target
population that is as accurate as possible.
Stratified sampling is a type of sampling method in which the total population is divided into
smaller groups or strata to complete the sampling process.
Missing Values - Outlier Detection and Treatment
MISSING VALUES
Missing values can occur because of various reasons. The information can be inapplicable. The
information can also be undisclosed. For example, a customer decided not to disclose his or her income
because of privacy. Missing data can also originate because of an error during merging.
Some analytical techniques (e.g., decision trees) can directly deal with missing values. Other
techniques need some additional preprocessing. The following are the most popular schemes to deal
with missing values:
● Replace (impute):
○ This implies replacing the missing value with a known value.
● Delete:
○ This is the most straightforward option and consists of deleting observations or variables
with lots of missing values.
● Keep:
○ Missing values can be meaningful (e.g., a customer did not disclose his or her
income because he or she is currently unemployed).
OUTLIER DETECTION AND TREATMENT
One of the most important steps in data pre-processing is outlier detection and treatment. Machine
learning algorithms are very sensitive to the range and distribution of data points. Data outliers can
deceive the training process resulting in longer training times and less accurate models. Outliers are
defined as samples that are significantly different from the remaining data. Those are points that lie outside
the overall pattern of the distribution. Statistical measures such as mean, variance, and correlation are very
susceptible to outliers.
A simple example of an outlier is here, a point that deviates from the overall pattern.
Nature of Outliers:
Outliers can occur in the dataset due to one of the following reasons,
1. Genuine extreme high and low values in the dataset
2. Introduced due to human or mechanical error
3. Introduced by replacing missing values
In some cases, the presence of outliers are informative and will require further study. For example,
outliers are important in use-cases related to transaction management, where an outlier might be used to
identify potentially fraudulent transactions.
Outliers are extreme observations that are very dissimilar to the rest of the population. Actually,
two types of outliers can be considered:
1. Valid observations (e.g., salary of boss is $1 million)
2. Invalid observations (e.g., age is 300 years)
Both are univariate outliers in the sense that they are outlying on one dimension. However, outliers can
be hidden in unidimensional views of the data.
The above is an example of two outlying observations considering both the dimensions of
income and age.
Two important steps in dealing with outliers are detection and treatment. Various graphical
tools can be used to detect outliers. Histograms are a first example.
The above is an example of a distribution for age whereby the circled areas clearly represent
outliers.
Another useful visual mechanism are box plots.
CATEGORIZATION
In the process of data mining, large data sets are first sorted, then patterns are identified and
relationships are established to perform data analysis and solve problems.
Classification: It is a Data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of
categories (subpopulations), a new observation belongs to, on the basis of a training set of data containing
observations and whose categories membership is known.
Example: Before starting any Project, we need to check it’s feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further approve
it. It is a two-step process such as :
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed
model on test data and hence estimate the accuracy of the classification rules.
Categorization (also known as coarse classification, classing, grouping, binning, etc.) can be
done for various reasons. For categorical variables, it is needed to reduce the number of categories.
With categorization, one would create categories of values such that fewer parameters will have to be
estimated and a more robust model is obtained.
For continuous variables, categorization may also be very beneficial.
Various methods can be used to do categorization. Two very basic methods are equal interval binning
and equal frequency binning.
Chi-squared analysis is a more sophisticated way to do coarse classification.
Weights of Evidence Coding
Variable Selection
Variable selection is a collection of candidate model variables tested for significance during model
training. Candidate model variables are also known as independent variables, predictors, attributes, model
factors, covariates, regressors, features, or characteristics.
Selecting an appropriate subset of variables is crucial for the quality of segmentation analyses.
Variable selection is a process that aims to identify a minimal set of predictors for maximum gain
(predictive accuracy).
Variable selection means choosing among many variables which to include in a particular model,
that is, to select appropriate variables from a complete list of variables by removing those that are
irrelevant or redundant.
In machine learning and statistics, feature selection, also known as variable selection, attribute
selection or variable subset selection, is the process of selecting a subset of relevant features for use in
model construction.
Classical variable selection methods include forward selection, backward elimination, and
stepwise selection.
When selecting variables, there are a number of limitations.
● First, the model will usually contain some highly predictive variables — the use of which is
prohibited by legal, ethical or regulatory rules.
● Second, some variables might not be available or might be of poor quality during modeling or
production stages.
● In addition, there might be important variables that have not been recognized.
● And finally, the business will always have the last word and might insist that only business-sound
variables are included or request monotonically increasing or decreasing effects.
Segmentation
Segmentation refers to the act of segmenting data according to your company’s needs in order to
refine your analyses based on a defined context, using a tool for cross-calculating analyses. The purpose
of segmentation is to better understand your visitors, and to obtain actionable data in order to improve
your business. A segment enables you to filter your analyses based on certain elements (single or
combined).
Data Segmentation is the process of taking the data you hold and dividing it up and grouping
similar data together based on the chosen parameters so that you can use it more efficiently within
marketing and operations. Examples of data segmentation could be:
● Gender
● Customers vs prospects
● Industry benefits of data segmentation
● Create messaging that is tailored and sophisticated to suit your target market -
appealing to their needs better.
● To easily conduct an analysis of your data stored in the database, helping to
identify potential opportunities and challenges based within it.
● To mass-personalize your marketing communications, reducing costs.
Sometimes the data is segmented before the analytical modeling starts. A first reason for this could
be strategic (e.g., banks might want to adopt special strategies to specific segments of customers). It
could also be motivated from an operational viewpoint (e.g., new customers must have separate models
because the characteristics in the standard model do not make sense operationally for them).
Segmentation could also be needed to take into account significant variable interactions (e.g., if one
variable strongly interacts with a number of others, it might be sensible to segment according to this
variable).
The segmentation can be conducted using the experience and knowledge from a business expert,
or it could be based on statistical analysis using, for example, decision trees, k-means, or self-organizing
maps.
Segmentation is a very useful preprocessing activity because one can now estimate different
analytical models each tailored to a specific segment.
The data analytics process has some components that can help a variety of initiatives. By
combining these components, a successful data analytics initiative will provide a clear picture of where
you are, where you have been and where you should go.
● This part of data science takes advantage of advanced tools to extract data, make predictions
and discover trends.
● This information provides new insight from data.
● Advanced analytics addresses “what if?” questions.
Big data analytics enables businesses to draw meaningful conclusions from complex and varied
data sources, which has been made possible by advances in parallel processing and cheap computational
power.
Role of Data Analytics
Data analysts exist at the intersection of information technology, statistics and business. They
combine these fields in order to help businesses and organizations succeed. The primary goal of a data
analyst is to increase efficiency and improve performance by discovering patterns in data.
Data analytics is a broad field. There are four primary types of data analytics: descriptive,
diagnostic, predictive and prescriptive analytics. Each type has a different goal and a different place in the
data analysis process. These are also the primary data analytics applications in business.
Descriptive analytics helps answer questions about what happened. These techniques summarize large
datasets to describe outcomes to stakeholders. By developing key performance indicators (KPIs,) these
strategies can help track successes or failures. Metrics such as return on investment (ROI) are used in
many industries. Specialized metrics are developed to track performance in specific industries. This
process requires the collection of relevant data, processing of the data, data analysis and data
visualization. This process provides essential insight into past performance.
Diagnostic analytics helps answer questions about why things happened. These techniques supplement
more basic descriptive analytics. They take the findings from descriptive analytics and dig deeper to find
the cause. The performance indicators are further investigated to discover why they got better or worse.
This generally occurs in three steps:
● Identify anomalies in the data. These may be unexpected changes in a metric or a particular
market.
● Data that is related to these anomalies is collected.
● Statistical techniques are used to find relationships and trends that explain these anomalies.
Predictive analytics helps answer questions about what will happen in the future. These
techniques use historical data to identify trends and determine if they are likely to recur. Predictive
analytical tools provide valuable insight into what may happen in the future and its techniques include a
variety of statistical and machine learning techniques, such as: neural networks, decision trees, and
regression.
Prescriptive analytics helps answer questions about what should be done. By using insights from
predictive analytics, data-driven decisions can be made. This allows businesses to make informed
decisions in the face of uncertainty. Prescriptive analytics techniques rely on machine learning strategies
that can find patterns in large datasets. By analyzing past decisions and events, the likelihood of different
outcomes can be estimated.
These types of data analytics provide the insight that businesses need to make effective and
efficient decisions. Used in combination they provide a well-rounded understanding of a
company’s needs and opportunities.
● In regression, the target variable is continuous. Popular examples are predicting stock prices,
loss given default (LGD), and customer lifetime value (CLV).
● In classification, the target is categorical. It can be binary (e.g., fraud, churn, credit risk)
or multiclass (e.g., predicting credit ratings).
LINEAR REGRESSION
Regression analysis is the method of using observations (data records) to quantify the relationship
between a target variable (a field in the record set), also referred to as a dependent variable, and a set of
independent variables, also referred to as a covariate. For example, regression analysis can be used to
determine whether the dollar value of grocery shopping baskets (the target variable) is different for male
and female shoppers (gender being the independent variable). The regression equation estimates a
coefficient for each gender that corresponds to the difference in value.
To illustrate how linear regression works, examine the relationship between the prices of vintage
wines and the number of years since vintage. Each year, many vintage wine buyers gather in
France and buy wines that will mature in 10 years. There are many stories and speculations on
how the buyers determine the future prices of wine. Is the wine going to be good 10 years from
now, and how much would it be worth? Imagine an application that could assist buyers in making
those decisions by forecasting the expected future value of the wines. This is exactly what
economists have done. They have collected data and created a regression model that estimates
this future price. The current explanation of the regression is based onthis model.
DECISION TREES
Decision trees are recursive partitioning algorithms (RPAs) that come up with a tree-like structure
representing patterns in an underlying data set.
The top node is the root node specifying a testing condition of which the outcome corresponds to
a branch leading up to an internal node. The terminal nodes of the tree assign the classifications and are
also referred to as the leaf nodes.
Many algorithms have been suggested to construct decision trees. All these algorithms differ in
their way of answering the key decisions to build a tree, which are:
● Splitting decision: Which variable to split at what value (e.g., age < 30 or not, income < 1,000
ornot; marital status = married or not)
● Stopping decision: When to stop growing a tree?
● Assignment decision: What class (e.g., good or bad customer) to assign to a leaf node?
NEURAL NETWORKS
Neural networks are a series of algorithms that mimic the operations of a human brain to recognize
relationships between vast amounts of data. They are used in a variety of applications in financial services,
from forecasting and marketing research to fraud detection and risk assessment.
Neural networks are sets of algorithms intended to recognize patterns and interpret data through
clustering or labeling. In other words, neural networks are algorithms. A training algorithm is the method
you use to execute the neural network's learning process.
Today, neural networks are used for solving many business problems such as sales
forecasting,customer research, data validation, and risk management. For example, at Statsbot we apply
neural networks for time-series predictions, anomaly detection in data, and natural language
understanding
A first perspective states that they are mathematical representations inspired by the functioning of
the human brain. Another more realistic perspective sees neural networks as generalizations of existing
statistical models.
A complex algorithm used for predictive analysis, the neural network, is biologically inspired by
the structure of the human brain. A neural network provides a very simple model in comparison to the
human brain,
Widely used for data classification, neural networks process past and current data to estimate
future values — discovering any complex correlations hidden in the data.
Neural networks can be used to make predictions on time series data such as weather data. A neural
network can be designed to detect patterns in input data and produce an output free of noise.
The structure of a neural-network algorithm has three layers:
● The input layer feeds past data values into the next (hidden) layer. The black circles represent
nodes of the neural network.
● The hidden layer encapsulates several complex functions that create predictors; often those
functions are hidden from the user. A set of nodes (black circles) at the hidden layer represents
mathematical functions that modify the input data; these functions are called neurons.
The processing element or neuron in the middle basically performs two operations: it takes the
inputs and multiplies them with the weights.
● The output layer collects the predictions made in the hidden layer and produces the final result:
the model’s prediction.
Neural networks can model very complex patterns and decision boundaries in the data and, as such,
are very powerful.
Descriptive analytics is the interpretation of historical data to better understand changes that have
occurred in a business. Descriptive analytics describes the use of a range of historic data to draw
comparisons. Most commonly reported financial metrics are a product of descriptive analytics—for
example, year-over-year pricing changes, month-over-month sales growth, the number of users, or the
total revenue per subscriber. These measures all describe what has occurred in a business during a set
period.
In descriptive analytics, the aim is to describe patterns of customer behavior. Contrary to predictive
analytics, there is no real target variable (e.g., churn or fraud indicator) available. Hence, descriptive
analytics is often referred to as unsupervised learning because there is no target variable to steer the
learning process.
These examples clearly illustrate that social network analytics can be applied in a wide variety
ofdifferent settings.
Social network analysis (SNA) is the process of investigating social structures through the use of
networks and graph theory.
● It characterizes networked structures in terms of nodes (individual actors, people, or things within
the network) and the ties, edges, or links (relationships or interactions) that connect them.
● Examples of social structures commonly visualized through social network analysis include
social media networks, memes spread, information circulation, friendship and acquaintance
networks, business networks, knowledge networks, difficult working relationships, social
networks, collaboration graphs, kinship, disease transmission, and sexual relationships.
● These networks are often visualized through sociograms in which nodes are represented as points
and ties are represented as lines.
● These visualizations provide a means of qualitatively assessing networks by varying the visual
representation of their nodes and edges to reflect attributes of interest.
A relation model is based on the idea that the behavior between nodes is correlated, meaning that
connected nodes have a propensity to belong to the same class. The relational neighbor classifier, in
particular, predicts a node's class based on its neighboring nodes and adjacent edges.
These 10 social media analytics tools can help to track the social presence.
1. Sprout Social.
2. HubSpot.
3. TapInfluence.
4. BuzzSumo.
5. Snaplytics.
6. Curalate.
7. Keyhole.
8. Google Analytics.