0% found this document useful (0 votes)
24 views43 pages

NCVRT Datamining

Data mining is the process of using machine learning and statistical analysis to extract valuable insights from large datasets, facilitating better decision-making across various industries. The Knowledge Discovery in Databases (KDD) process encompasses multiple steps including data selection, preprocessing, transformation, mining, evaluation, representation, and deployment to uncover hidden patterns and trends. While data mining offers significant benefits such as improved business performance and fraud detection, it also presents challenges like data quality requirements, complexity, and privacy concerns.

Uploaded by

charugeshm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views43 pages

NCVRT Datamining

Data mining is the process of using machine learning and statistical analysis to extract valuable insights from large datasets, facilitating better decision-making across various industries. The Knowledge Discovery in Databases (KDD) process encompasses multiple steps including data selection, preprocessing, transformation, mining, evaluation, representation, and deployment to uncover hidden patterns and trends. While data mining offers significant benefits such as improved business performance and fraud detection, it also presents challenges like data quality requirements, complexity, and privacy concerns.

Uploaded by

charugeshm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Unit -III

Data Mining Basics


What is data mining?
Data mining is the use of machine learning and statistical analysis to uncover patterns and other
valuable information from large data sets.

Given the evolution of machine learning (ML), data warehousing, and the growth of big data, the
adoption of data mining, also known as knowledge discovery in databases (KDD), has rapidly
accelerated over the last decades. However, while this technology continuously evolves to handle
data at a large scale, leaders still might face challenges with scalability and automation.

The data mining techniques that underpin data analyses can be deployed for two main purposes.
They can either describe the target data set or they can predict outcomes by using machine
learning algorithms.

These methods are used to organize and filter data, surfacing the most useful information, from
fraud to user behaviors, bottlenecks and even security breaches. Using ML algorithms
and artificial intelligence (AI) enables automation of the analysis, which can greatly speed up the
process.

When combined with data analytics and visualization tools, such as Apache Spark, data mining
software is becoming more straightforward and extracting relevant insights can be gained more
quickly than ever. Advances in AI continue to expedite adoption across industries.
Benefits and challenges
Benefits

Discover hidden insights and trends: Data mining takes raw data and finds order in the chaos:
seeing the forest for the trees. This can result in better-informed planning across corporate
functions and industries, including advertising, finance, government, healthcare, human
resources (HR), manufacturing, marketing, research, sales and supply chain management (SCM).
Save budget: By analyzing performance data from multiple sources, bottlenecks in business
processes can be identified to speed resolution and increase efficiency.

Solve multiple challenges: Data mining is a versatile tool. Data from almost any source and any
aspect of an organization can be analyzed to discover patterns and better ways of conducting
business. Almost every department in an organization that collects and analyzes data can benefit
from data mining.
Challenges

Complexity and risk: Useful insights require valid data, plus experts with coding experience.
Knowledge of data mining languages including Python, R and SQL is helpful. An insufficiently
cautious approach to data mining might result in misleading or dangerous results. Some
consumer data used in data mining might be personally identifiable information (PII) which
should be handled carefully to avoid legal or public relations issues.

Cost: For the best results, a wide and deep collection of data sets is often needed. If new
information is to be gathered by an organization, setting up a data pipeline might represent a new
expense. If data needs to be purchased from an outside source, that also imposes a cost.

Uncertainty: First, a major data mining effort might be well run, but produce unclear results,
with no major benefit. Or inaccurate data can lead to incorrect insights, whether incorrect data
was selected or the preprocessing was mishandled. Other risks include modeling errors or
outdated data from a rapidly changing market.

Another potential problem is results might appear valid but are in fact random and not to be
trusted. It’s important to remember that “correlation is not causation.” A famous example of
“data dredging”—seeing an apparent correlation and overstating its importance—was recently
presented by blogger Tyler Vigen: “The price of Amazon.com stock closely matches the number
of children named ‘Stevie’ from 2002 to 2022.”1 But, of course, the naming of Stevies did not
influence the stock price or vice versa. Data mining applications find the patterns, but human
judgment is still significant.
The knowledge discovery process (KDD Process)

KDD stands for Knowledge Discovery in Databases, which is the process of extracting useful
knowledge from large amounts of data. It is an area of interest to researchers and professionals in
various fields, such as artificial intelligence, machine learning, pattern recognition, databases,
statistics, and data visualization. Data mining is a key component of the KDD process.

What is KDD in Data Mining

KDD (Knowledge Discovery in Databases) is a process of discovering useful knowledge and


insights from large and complex datasets. The KDD process involves a range of techniques and
methodologies, including data preprocessing, data transformation, data mining, pattern
evaluation, and knowledge representation. KDD and data mining are closely related processes,
with data mining being a key component and subset of the KDD process.

The KDD process aims to identify hidden patterns, relationships, and trends in data that can be
used to make predictions, decisions, and recommendations. KDD is a broad and interdisciplinary
field used in various industries, such as finance, healthcare, marketing, e-commerce, etc. KDD is
very important for organizations and businesses as it enables them to derive new insights and
knowledge from their data,

KDD Process in Data Mining

The KDD process in data mining is a multi-step process that involves various stages to extract
useful knowledge from large datasets. The following are the main steps involved in the KDD
process -

 Data Selection - The first step in the KDD process is identifying and selecting the
relevant data for analysis. This involves choosing the relevant data sources, such as
databases, data warehouses, and data streams, and determining which data is required for
the analysis.
 Data Preprocessing - After selecting the data, the next step is data preprocessing. This
step involves cleaning the data, removing outliers, and removing missing, inconsistent, or
irrelevant data. This step is critical, as the data quality can significantly impact the
accuracy and effectiveness of the analysis.
 Data Transformation - Once the data is preprocessed, the next step is to transform it
into a format that data mining techniques can analyze. This step involves reducing the
data dimensionality, aggregating the data, normalizing it, and discretizing it to prepare it
for further analysis.
 Data Mining - This is the heart of the KDD process and involves applying various data
mining techniques to the transformed data to discover hidden patterns, trends,
relationships, and insights. A few of the most common data mining techniques include
clustering, classification, association rule mining, and anomaly detection.
 Pattern Evaluation - After the data mining, the next step is to evaluate the discovered
patterns to determine their usefulness and relevance. This involves assessing the quality
of the patterns, evaluating their significance, and selecting the most promising patterns
for further analysis.
 Knowledge Representation - This step involves representing the knowledge extracted
from the data in a way humans can easily understand and use. This can be done through
visualizations, reports, or other forms of communication that provide meaningful insights
into the data.
 Deployment - The final step in the KDD process is to deploy the knowledge and insights
gained from the data mining process to practical applications. This involves integrating
the knowledge into decision-making processes or other applications to improve
organizational efficiency and effectiveness.

In summary, the KDD process in data mining involves several steps to extract useful knowledge
from large datasets. It is a comprehensive and iterative process that requires careful consideration
of each step to ensure the accuracy and effectiveness of the analysis. Various steps involved in
the KDD process in data mining are shown below diagram -

For a Hands-On Approach, Check out Scaler's Data Science Course that Offers Interactive
Modules. Enroll and Get Certified by the Best!
Advantages of KDD in Data Mining

KDD in data mining is a powerful approach for extracting useful knowledge and insights from
large datasets. It is very important for organizations as it has a lot of advantages. Some of the
advantages of KDD in data mining are -

 Helps in Decision Making - KDD can help make informed and data-driven decisions by
discovering hidden patterns, trends, and relationships in data that might not be
immediately apparent.
 Improves Business Performance - KDD can help organizations improve their business
performance by identifying areas for improvement, optimizing processes, and reducing
costs.
 Saves Time and Resources - KDD can help save time and resources by automating the
data analysis process and identifying the most relevant and significant information or
knowledge.
 Increases Efficiency - KDD can help organizations streamline their processes, optimize
their resources, and increase their overall efficiency.
 Enhances Customer Experience - KDD can help organizations improve customer
experience by understanding customer behavior, preferences, and requirements and
giving personalized products and services.
 Fraud Detection - KDD can help detect fraud and identify fraudulent behavior by
analyzing patterns in data and identifying anomalies or unusual behavior.
 Enables Predictive Modeling - KDD can enable organizations to develop predictive
models that can forecast future trends and behaviors, providing a competitive advantage
in the market.

Disadvantages of KDD in Data Mining

While KDD (Knowledge Discovery in Databases) is a powerful approach to extracting useful


knowledge and insights from large datasets, there are also some potential disadvantages to
consider -
 Requires High-Quality Data - KDD relies on high-quality data to generate accurate and
meaningful insights. If the data is incomplete, inconsistent, or of poor quality, it can lead
to inaccurate, misleading results and flawed conclusions.
 Complexity - KDD is a complex and time-consuming process that requires specialized
skills and knowledge to perform effectively. The complexity can also make interpreting
and communicating the results challenging to non-experts.
 Privacy and Compliance Concerns - KDD can raise ethical concerns related to privacy,
compliance, bias, and discrimination. For example, data mining techniques can extract
sensitive information about individuals without their consent or reinforce existing biases
or stereotypes.
 High Cost - KDD can be expensive, and require specialized software, hardware, and
skilled professionals to perform the analysis. The cost can be prohibitive for smaller
organizations or those with limited resources.

Difference Between KDD and Data Mining

The difference between KDD and data mining is explained in the below table.

Factor KDD Process Data Mining


It is a comprehensive process that includes A subset of KDD that focuses
Definition multiple steps for extracting useful knowledge primarily on finding patterns and
and insights from large datasets relationships in data
It includes steps such as data collection,
Steps cleaning, integration, selection, It includes steps such as data
involved transformation, data mining, interpretation, preprocessing, modeling, and analysis
and evaluation
Emphasizes the importance of domain Focuses on the use of computational
Focus
expertise in interpreting and validating results algorithms to analyze data
Techniques Data selection, cleaning, transformation, data Association rules mining, clustering,
used mining, pattern evaluation, interpretation, regression, classification, and
knowledge representation, and data dimensionality reduction.
visualization
A set of patterns, relationships,
Knowledge bases, such as rules or models that predictions, or insights to support
Outputs
help organizations make informed decisions decision-making or business
understanding

Data Mining Applications- The Business Context of Data Mining:

Data is a set of discrete objective facts about an event or a process that have little use by
themselves unless converted into information. We have been collecting numerous data, from
simple numerical measurements and text documents to more complex information such as
spatial data, multimedia channels, and hypertext documents.

Nowadays, large quantities of data are being accumulated. The amount of data collected is
said to be almost doubled every year. An extracting data or seeking knowledge from this
massive data, data mining techniques are used. Data mining is used in almost all places where
a large amount of data is stored and processed. For example, banks typically use ‘data mining’
to find out their prospective customers who could be interested in credit cards, personal loans,
or insurance as well. Since banks have the transaction details and detailed profiles of their
customers, they analyze all this data and try to find out patterns that help them predict that
certain customers could be interested in personal loans, etc.

Basically, the motive behind mining data, whether commercial or scientific, is the same - the
need to find useful information in data to enable better decision-making or a better
understanding of the world around us.

Technically, data mining is the computational process of analyzing data from different
perspectives, dimensions, angles and categorizing/summarizing it into meaningful
information. Data Mining can be applied to any type of data e.g. Data Warehouses,
Transactional Databases, Relational Databases, Multimedia Databases, Spatial Databases,
Time-series Databases, World Wide Web.
Data mining provides competitive advantages in the knowledge economy. It does this by
providing the maximum knowledge needed to rapidly make valuable business decisions
despite the enormous amounts of available data.

There are many measurable benefits that have been achieved in different application areas
from data mining. So, let's discuss different applications of Data Mining:

Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyze the old data already accumulated. Example of
scientific analysis:

Sequence analysis in bioinformatics

Classification of astronomical objects


Medical decision support.

Intrusion Detection: A network intrusion refers to any unauthorized activity on a digital


network. Network intrusions often involve stealing valuable network resources. Data mining
technique plays a vital role in searching intrusion detection, network attacks, and anomalies.
These techniques help in selecting and refining useful and relevant information from large
data sets. Data mining technique helps in classify relevant data for Intrusion Detection
System. Intrusion Detection system generates alarms for the network traffic about the foreign
invasions in the system. For example:

 Detect security violations


 Misuse Detection
 Anomaly Detection

Business Transactions: Every business industry is memorized for perpetuity. Such


transactions are usually time-related and can be inter-business deals or intra-business
operations. The effective and in-time use of the data in a reasonable time frame for
competitive decision-making is definitely the most important problem to solve for businesses
that struggle to survive in a highly competitive world. Data mining helps to analyze these
business transactions and identify marketing approaches and decision-making. Example :
 Direct mail targeting
 Stock trading
 Customer segmentation
 Churn prediction (Churn prediction is one of the most popular Big Data use cases in
business)
Market Basket Analysis: Market Basket Analysis is a technique that gives the careful study
of purchases done by a customer in a supermarket. This concept identifies the pattern of
frequent purchase items by customers. This analysis can help to promote deals, offers, sale by
the companies and data mining techniques helps to achieve this analysis task. Example:
 Data mining concepts are in use for Sales and marketing to provide better customer
service, to improve cross-selling opportunities, to increase direct mail response rates.
 Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
 Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Education: For analyzing the education sector, data mining uses Educational Data Mining
(EDM) method. This method generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some educational task:
 Predicting students admission in higher education
 Predicting students profiling
 Predicting student performance
 Teachers teaching performance
 Curriculum development
 Predicting student placement opportunities
Research: A data mining technique can perform predictions, classification, clustering,
associations, and grouping of data with perfection in the research area. Rules generated by
data mining are unique to find results. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model is a strategy to measure
the precision of the proposed model. It is called Train/Test because we split the data set into
two sets: a training data set and a testing data set. A training data set used to design the
training model whereas testing data set is used in the testing model. Example:
 Classification of uncertain data.
 Information-based clustering.
 Decision support system
 Web Mining
 Domain-driven data mining
 IoT (Internet of Things)and Cybersecurity
 Smart farming IoT(Internet of Things)
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals force activity
and their outcomes to improve the focusing of high-value physicians and figure out which
promoting activities will have the best effect in the following upcoming months, Whereas the
Insurance sector, data mining can help to predict which customers will buy new policies,
identify behavior patterns of risky customers and identify fraudulent behavior of customers.
 Claims analysis i.e which medical procedures are claimed together.
 Identify successful medical therapies for different illnesses.
 Characterizes patient behavior to predict office visits.
Transportation: A diversified transportation company with a large direct sales force can
apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers.
 Determine the distribution schedules among outlets.
 Analyze loading patterns.
Financial/Banking Sector: A credit card company can leverage its vast warehouse of
customer transaction data to identify customers most likely to be interested in a new credit
product.
 Credit card fraud detection.
 Identify 'Loyal' customers.
 Extraction of information related to customers.
 Determine credit card spending by customer groups.
Data Mining for Marketing, Benefits of data mining:

Data Mining Basics: A Comprehensive Overview

1. What is Data Mining?

Definition:

Data mining is the process of discovering patterns, correlations, trends, and useful information
from large datasets using techniques from statistics, machine learning, and database systems.

In simpler terms, data mining involves exploring and analyzing vast amounts of data to
extract meaningful insights that can support decision-making, predict future trends, or identify
hidden relationships within the data.

2. Data Mining Defined

Data mining can be defined as:

"The non-trivial process of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data." – Fayyad, Piatetsky-Shapiro, and Smyth (1996)

This definition emphasizes that data mining is not just about collecting or storing data but
rather extracting knowledge from it through analytical processes.
3. The Knowledge Discovery Process (KDD Process)

Data mining is a step within a broader process called Knowledge Discovery in Databases
(KDD) . The KDD process includes several stages:

Steps in the KDD Process:

Data Selection: Choosing relevant data for analysis from the available databases.

Data Cleaning: Removing noise, handling missing values, and correcting inconsistencies in
the data.

Data Transformation/Integration: Converting data into appropriate formats or structures


suitable for mining (e.g., normalization, aggregation).

Data Mining: Applying algorithms to extract patterns from the data.

Pattern Evaluation: Assessing the discovered patterns for usefulness, validity, and relevance.

Knowledge Presentation: Visualizing and interpreting the results so they are understandable
and actionable for end-users.

4. Data Mining Applications

Data mining has wide-ranging applications across various domains. Some key areas include:

Customer Segmentation: Grouping customers based on behavior or demographics.

Fraud Detection: Identifying unusual transactions or behaviors that indicate fraud.


Market Basket Analysis: Understanding associations between products frequently bought
together.

Predictive Modeling: Forecasting customer churn, loan defaults, equipment failures, etc.

Healthcare Analytics: Diagnosing diseases, predicting treatment outcomes, drug discovery.

Text Mining: Extracting insights from unstructured text data like social media, emails, or
reviews.

5. The Business Context of Data Mining

In business, data mining plays a crucial role in transforming raw data into strategic decisions.
Companies use data mining to:

Improve marketing strategies

Enhance customer experience and retention

Optimize pricing models

Detect financial fraud

Streamline operations

It helps businesses understand their customers better, anticipate market changes, and stay
ahead of competitors by leveraging data-driven insights.

6. Data Mining for Process Improvement

Data mining helps organizations identify inefficiencies in business processes by analyzing


historical data. For example:
Manufacturing: Predictive maintenance using sensor data to avoid machine downtime.

Supply Chain: Optimizing inventory levels and delivery routes.

Customer Service: Analyzing call center logs to improve response times and satisfaction.

By uncovering bottlenecks and trends, data mining enables continuous improvement and
automation of processes.

7. Data Mining as a Research Tool

In academic and scientific research, data mining is used to:

Discover new patterns in biological data (bioinformatics)

Analyze social networks and human behavior

Support climate modeling and environmental studies

Aid in psychological and educational research

Researchers apply clustering, classification, regression, and association rule mining


techniques to validate hypotheses and explore complex datasets.

8. Data Mining for Marketing

Marketing is one of the most common applications of data mining. Marketers use it to:

Segment Customers: Tailor messages to specific groups.

Predict Behavior: Anticipate purchases, churn, or responses to campaigns.


Recommend Products: Based on user preferences and past behavior (like Amazon or Netflix
recommendations).

Analyze Campaign Performance: Understand what works and optimize future efforts.

Data mining allows marketers to personalize communication, increase ROI, and improve
customer engagement.

9. Benefits of Data Mining

Here are some major benefits of data mining:

Improved Decision-Making

Provides actionable insights to guide strategic planning.

Cost Reduction

Identifies inefficiencies and reduces waste.

Increased Profitability

Helps target high-value customers and optimize pricing.

Better Customer Insights

Enables personalized experiences and targeted marketing.

Risk Management

Assesses credit risk, fraud detection, and operational risks.

Trend and Pattern Discovery


Uncovers hidden patterns that may lead to innovation or competitive advantage.

Data Mining Techniques


Last Updated : 01 May, 2025

Data Mining is the process of discovering useful patterns and
insights from large amounts of data. Data science, information
technology, and artisanal practices put together to reassemble the
collected information into something valuable. Researchers and
professionals are working to develop newer, faster, cheaper, and
more accurate ways to accomplish this process. Various other
terms are attached to data mining, like "knowledge mining from
data," "knowledge extraction," "data analysis," and "data
dredging," which all simply refer to the same idea.
Data mining is often a synonym for Knowledge Discovery from
Data (KDD). Some people see data mining as a key part of KDD,
where smart methods are used to find patterns in the data. The term
"Knowledge Discovery in Databases" (KDD) was first coined by
Gregory Piatetsky-Shapiro in 1989. However, "data mining" became
more widely used in business and media. Today, both terms are
often used interchangeably.
Steps in Knowledge Discovery from Data (KDD)
Knowledge discovery from data (KDD) is a multi-step process for
extracting useful insights. The following are the key steps involved:
 Data Selection: Identify and select relevant data from various
sources for analysis.
 Data Preprocessing: Clean and transform the data to address
errors and inconsistencies, making it suitable for analysis.
 Data Transformation: Convert the cleaned data into a form that
is suitable for data mining algorithms.
 Data Mining: Apply data mining techniques to identify patterns
and relationships in the data, selecting appropriate algorithms
and models.
 Pattern Evaluation: Evaluate the identified patterns to
determine their usefulness in making predictions or decisions.
 Knowledge Representation: Present the patterns in a way that
is understandable and useful for decision-making.
 Knowledge Refinement: Refine the knowledge obtained to
improve accuracy and usefulness based on feedback.
 Knowledge Dissemination: Share the results in an easily
understandable format to aid decision-making.
Now we discuss here different types of Data Mining Techniques
which are used to predict desire output.
Data Mining Techniques
1. Association
Association analysis looks for patterns where certain items or conditions tend to appear together
in a dataset. It's commonly used in market basket analysis to see which products are often bought
together. One method, called associative classification, generates rules from the data and uses
them to build a model for predictions.
2. Classification
Classification builds models to sort data into different categories. The model is trained on data
with known labels and is then used to predict labels for unknown data. Some examples of
classification models are:
 Decision Tree
 SVM(Support Vector Machine)
 Generalized Linear Models
 Bayesian classification
 Classification by Backpropagation
 K-NN Classifier
 Rule-Based Classification
 Frequent-Pattern Based Classification
 Rough Set Theory
 Fuzzy Logic
3. Prediction
Prediction is similar to classification, but instead of predicting categories, it predicts continuous
values (like numbers). The goal is to build a model that can estimate the value of a specific
attribute for new data.
4. Clustering
Clustering groups similar data points together without using predefined categories. It helps
discover hidden patterns in the data by organizing objects into clusters where items in each
cluster are more similar to each other than to those in other clusters.
5. Regression
Regression is used to predict continuous values, like prices or temperatures, based on past data.
There are two main types: linear regression, which looks for a straight-line relationship, and
multiple linear regression, which uses more variables to make predictions.
6. Artificial Neural Network (ANN) Classifier
An artificial neural network (ANN) is a model inspired by how the human brain works. It learns
from data by adjusting connections between artificial neurons. Neural networks are great for
recognizing complex patterns but require a lot of training and can be hard to interpret.
7. Outlier Detection
Outlier detection identifies data points that are very different from the rest of the data. These
unusual points, called outliers, can be spotted using statistical methods or by checking if they are
far away from other data points.
8. Genetic Algorithm
Genetic algorithms are inspired by natural selection. They solve problems by evolving solutions
over several generations. Each solution is like a "species," and the fittest solutions are kept and
improved over time, simulating "survival of the fittest" to find the best solution to a problem.
Advantages of Data Mining
Data mining is a powerful tool that offers many benefits across a wide range of industries. The
following are some of the advantages of data mining:

Advantages Description

Helps extract useful information from large datasets for

Better Decision Making informed decision making.

Improved Marketing Assists in identifying target markets and developing


personalized marketing strategies.
Improves operational efficiency by identifying

Increased Efficiency inefficiencies and optimizing processes.

Detects fraudulent activities by analyzing patterns in

Fraud Detection financial transactions.

Helps identify customers at risk of leaving and develop

Customer Retention strategies to retain them.

Provides businesses with insights into new

Competitive Advantage opportunities and emerging trends.

Improves healthcare outcomes by identifying risk

Improved Healthcare factors and enabling early diagnosis.

Disadvantages Of Data Mining


While data mining offers many benefits, there are also some disadvantages and challenges
associated with the process. The following are some of the main disadvantages of data mining:

Disadvantages Description

Results can be unreliable if the data is incomplete,

Data Quality inaccurate, or inconsistent.

Sensitive data could be misused if it falls into the

Data Privacy and Security wrong hands, risking privacy and security.
Raises ethical concerns about privacy,

Ethical Considerations surveillance, and discrimination.

Requires expertise in statistics, computer science,

Technical Complexity and domain knowledge.

Can be expensive, especially when large datasets

Cost need to be analyzed.

Generated data can be difficult to interpret and

Interpretation of Results find meaningful patterns.

Relies heavily on technology, and technical

Dependence on Technology failures can lead to data loss or corruption.

UNIT - IV

Cluster detection:

Certainly! Here's a set of **concise notes on Cluster


Detection** based on the reference:
Introduction to Clustering

Clustering** is the process of grouping a set of physical or


abstract objects into classes of similar objects.

- It is an **unsupervised learning** technique — no predefined


classes.

- Goal: Maximize intra-cluster similarity and minimize inter-


cluster similarity.

Why Cluster Detection?

- Useful in exploratory data analysis.

- Applications include:

- Pattern recognition

- Image processing

- Market research

- Spatial data analysis

- Anomaly detection
Types of Clustering

1. Partitioning Methods

- Divide data into non-overlapping subsets (e.g., k-means, k-


medoids)

- Each object belongs to exactly one cluster.

2. Hierarchical Methods

- Produces nested clusters organized as a tree (dendrogram).

- Two types:

- Agglomerative: bottom-up approach

- Divisive: top-down approach

3. Density-Based Methods

- Clusters are dense regions separated by low-density regions.

- Example: DBSCAN
- Can find arbitrary shaped clusters and handle noise.

4. Grid-Based Methods

- Quantize the space into finite number of cells.

- Fast processing time.

- Example: STING

5. Model-Based Methods

- Hypothesize a model for each cluster and find best fit.

- Uses statistical models or neural networks.

Major Clustering Approaches (Summary)

| Method | Description | Strengths | Weaknesses |

k-means | Partitioning method; minimizes distance to centroid |


Simple, fast | Sensitive to outliers, assumes spherical clusters |

DBSCAN| Density-based; finds arbitrary shapes | Handles noise


well | Requires density parameters |
HierarchicalTree-like structure of clusters | Visual interpretation
| Computationally expensive |

EM (Expectation-Maximization Model-based; probabilistic


assignment | Soft clustering | Assumes Gaussian distributions |.

K means Clustering – Introduction


-Means Clustering is an Unsupervised Machine Learning algorithm
which groups unlabeled dataset into different clusters. It is used to
organize data into groups based on their similarity.
Understanding K-means Clustering

xplanation of the K-Means Clustering Process

The provided image illustrates the process of K-Means


clustering , a popular partitioning method used in unsupervised
learning. Below is a detailed explanation of the steps depicted:
Step 1: Unlabelled Data

Input : The dataset consists of unlabelled data points (gray


circles in the left panel).

Objective : Group these data points into clusters based on their


similarity.

Key Point : At this stage, there are no predefined labels or


clusters.

Step 2: Applying K-Means Algorithm

K-Means Objective : Minimize the within-cluster sum of


squares (WCSS), which measures the distance between each
point and its cluster centroid.

Steps Involved :

Initialization : Choose the number of clusters k (e.g., k=3 in this


example). Randomly initialize k centroids.

Assignment Step : Assign each data point to the nearest centroid


based on a distance metric (e.g., Euclidean distance).
Update Step : Recalculate the centroids as the mean of all points
assigned to each cluster.

Iteration : Repeat the assignment and update steps until


convergence (i.e., centroids do not change significantly or a
maximum number of iterations is reached).

Step 3: Labelled Clusters

Output : The algorithm assigns labels to the data points,


grouping them into distinct clusters (colored regions in the right
panel).

Centroids : Each cluster has a centroid (marked by an "X") that


represents the center of the cluster.

Clusters : The data points are now grouped into three clusters
(green, blue, and brown), with each cluster having its own
centroid.

Key Observations from the Image

Unlabelled Data :
The initial dataset is a collection of unstructured points without
any predefined groupings.

This represents real-world scenarios where data is unlabeled and


patterns need to be discovered.

Labelled Clusters :

After applying K-Means, the data points are grouped into three
distinct clusters.

Each cluster is represented by a different color, and the centroid


of each cluster is marked with an "X."

Centroids :

Centroids are the representative points of each cluster.

They are recalculated iteratively to minimize the distance


between points and their respective centroids.

Visualization :

The transition from unlabelled data to labelled clusters


demonstrates how K-Means organizes data into meaningful
groups.
Advantages of K-Means

Simple and computationally efficient.

Works well with spherical clusters.

Provides clear visual interpretation through centroids.

Limitations of K-Means

Requires specifying the number of clusters k in advance.

Sensitive to initial centroid placement.

Assumes clusters are of similar size and density.

Performs poorly with non-spherical or irregularly shaped


clusters

memory-based reasoning:

Overview of the Memory-Based Reasoning Node


The Memory-Based Reasoning node belongs to the Model category in the
SAS data mining process of Sample, Explore, Modify, Model, Assess
(SEMMA). Memory-based reasoning is a process that identifies similar cases
and applies the information that is obtained from these cases to a new record.
In Enterprise Miner, the Memory-Based Reasoning (MBR) node is a modeling
tool that uses a k-nearest neighbor algorithm to categorize or predict
observations.
The k-nearest neighbor algorithm takes a data set and a probe, where each
observation in the data set is composed of a set of variables and the probe
has one value for each variable. The distance between an observation and
the probe is calculated. The k observations that have the smallest distances to
the probe are the k-nearest neighbor to that probe.
In Enterprise Miner, the k-nearest neighbors are determined by the Euclidean
distance between an observation and the probe. Based on the target values of
the k-nearest neighbors, each of the k-nearest neighbors votes on the target
value for a probe. The votes are the posterior probabilities for the binary or
nominal target variable.
The following display shows the voting approach of these neighbors for a
binary target variable when different values of k are specified. In this example,
observations 7, 12, 35, 108, and 334 are the five closest observations to the
probe. Observations 108 and 35 have the shortest and the longest distances
to the probe, respectively.
The k-nearest neighbors are first k observations that have the closest
distances to the probe. If the value of k is set to 3, then the target values of
the first three nearest neighbors (108, 12, and 7) are used. The target values
for these three neighbors are Y, N, and Y. Therefore, the posterior probability
for the probe to have the target value Y is 2/3 (67%).
If the target is an interval variable, then the average of the target values of the
k-nearest neighbors is calculated as the prediction for the probe observation.
See the Predictive Modeling section for information that applies to all of the
predictive modeling nodes.

Data Set Requirements of the Memory-Based


Reasoning Node
The Memory-Based Reasoning node requires exactly one target variable. If
more than one target variable is defined in a predecessor node of the process
flow diagram, you must select one variable as the target in order to run the
node. The target variables can be binary, nominal, or interval. You cannot
model an ordinal target variable, but ordinal target variables can be modeled
as interval targets.
The Memory-Based Reasoning node assumes that the variables with a model
role of "input" are numeric, orthogonal to each other, and standardized. You
may use the Princomp node to generate numeric, orthogonal, and
standardized variables that can be used as inputs for the Memory-Based
Reasoning node.

Memory-Based Reasoning Node and Missing Values


When the Memory-Based Reasoning node encounters input data that has
missing variable values, the missing variable values are replaced with the
mean of the variable, which is stored in the DMDB catalog.

Link analysis
In the policing domain, these data points are usually people, locations, vehicles,
property, organizations, and firearms. When visualizing this information, each
individual entity becomes a “node” on the chart tied to other nodes by way of their
relationships or “links” to each other. By leveraging link analysis, investigators are
better able to establish meaningful relationships between people, locations, property,
and bank accounts, and oftentimes expedite cases in the process.
We’ve all seen law enforcement TV shows where the detectives have a big board
behind them with strings and photos attached to it. In a nutshell, that’s what link
analysis is. It’s a quick way to see people, places, strings, and things.

Mining Association Rules in Large Databases

What are association rules in data mining?


Association rules are if-then statements that show the probability of
relationships between data items within large data sets in various types of
databases. At a basic level, association rule mining involves the use
of machine learning models to analyze data for patterns, called co-
occurrences, in a database. It identifies frequent if-then associations, which
themselves are the association rules.

For example, if 75% of people who buy cereal also buy milk, then there is
a discernible pattern in transactional data that customers who buy cereal
often buy milk. An association rule is that there is an association between
buying cereal and milk.

Different algorithms use association rules to discover such patterns within


data sets. These algorithms are capable of analyzing big data sets to
discover patterns. Artificial intelligence (AI) and machine learning are
being used to enable algorithms and their related association rules to keep
up with the large volumes of data being generated today.

Why are association rules important?


Various vertical markets use these algorithms in different ways. The
fundamental patterns and associations between data points discovered
using association rules shape how businesses operate. For example,
association rule mining is used to help discover correlations between
suspicious and normal transactions in transactional data or disease and
healthy patterns in medical data sets.

THIS ARTICLE IS PART OF

What is machine learning? Guide, definition and examples


 Which also includes:

 The different types of machine learning explained

 How to build a machine learning model in 7 steps

 CNN vs. RNN: How are they different?

These rules and the algorithms they apply expedite and simplify large-
scale analyses that are impossible for people to accomplish without
sacrificing productivity. They affect the work of nontechnical
professionals, as well as technical ones. A marketing team could use
association rules on customer purchase history data to better understand
which customers are most likely to repurchase. A cybersecurity
professional might use association rules on an algorithm used to detect
fraud and cyberattacks on IT infrastructures.

How do association rules work?


An association rule has two parts: an antecedent (if) and a consequent
(then). An antecedent is an item found within the data. A consequent is an
item found in combination with the antecedent. The if-then statements
form itemsets, which are the basis for calculating association rules made
up of two or more items in a data set.

Data pros search data for frequently occurring if-then statements. They
then look for support for the statement in terms of how often it appears and
confidence in it from the number of times it's found to be true.
Association rules are typically created from itemsets that include many
items and are well represented in data sets. However, if rules are built from
analyzing all possible itemsets or too many sets of items, too many rules
result, and they have little meaning.

Once established, data scientists and others in fields requiring data


analyses apply association rules to uncover important patterns.

What is support and confidence in data mining?


Association rules are created by searching data for frequent if-then patterns
and using the criteria support and confidence to identify the most important
relationships. Support indicates how frequently an item appears in the data.
Confidence indicates the number of times the if-then statement is found to
be true. A third metric, called lift, can be used to compare observed
confidence with expected confidence, or how many times an if-then
statement is expected to be found true.

Two steps are involved in generating association rules. Support and


confidence play a crucial role in these steps:

1. Identify items that commonly appear in a given data set. Given how
frequently certain items appear, set minimum support thresholds to
indicate how many times items must appear to undergo step two.
2. Look at each itemset that includes the items meeting certain minimum
support thresholds. Calculate confidence thresholds that indicate the
frequency an association between two items actually occurs. For
instance, if two items are matched more than half the time they
appear in a data set, that could constitute a simple confidence
threshold.

Genetic Algorithms - Introduction


Genetic Algorithm (GA) is a search-based optimization technique based
on the principles of Genetics and Natural Selection. It is frequently used to
find optimal or near-optimal solutions to difficult problems which
otherwise would take a lifetime to solve. It is frequently used to solve
optimization problems, in research, and in machine learning.

Introduction to Optimization
Optimization is the process of making something better. In any process,
we have a set of inputs and a set of outputs as shown in the following
figure.

Optimization refers to finding the values of inputs in such a way that we


get the best output values. The definition of best varies from problem to
problem, but in mathematical terms, it refers to maximizing or
minimizing one or more objective functions, by varying the input
parameters.

The set of all possible solutions or values which the inputs can take make
up the search space. In this search space, lies a point or a set of points
which gives the optimal solution. The aim of optimization is to find that
point or set of points in the search space.

What are Genetic Algorithms?


Nature has always been a great source of inspiration to all mankind.
Genetic Algorithms (GAs) are search based algorithms based on the
concepts of natural selection and genetics. GAs are a subset of a much
larger branch of computation known as Evolutionary Computation.
GAs were developed by John Holland and his students and colleagues at
the University of Michigan, most notably David E. Goldberg and has since
been tried on various optimization problems with a high degree of
success.

In GAs, we have a pool or a population of possible solutions to the given


problem. These solutions then undergo recombination and mutation (like
in natural genetics), producing new children, and the process is repeated
over various generations. Each individual (or candidate solution) is
assigned a fitness value (based on its objective function value) and the
fitter individuals are given a higher chance to mate and yield more fitter
individuals. This is in line with the Darwinian Theory of Survival of the
Fittest.

In this way we keep evolving better individuals or solutions over


generations, till we reach a stopping criterion.

Genetic Algorithms are sufficiently randomized in nature, but they


perform much better than random local search (in which we just try
various random solutions, keeping track of the best so far), as they
exploit historical information as well.

Advantages of GAs
GAs have various advantages which have made them immensely popular.
These include −

 Does not require any derivative information (which may not be available
for many real-world problems).
 Is faster and more efficient as compared to the traditional methods.
 Has very good parallel capabilities.
 Optimizes both continuous and discrete functions and also multi-objective
problems.
 Provides a list of good solutions and not just a single solution.
 Always gets an answer to the problem, which gets better over the time.
 Useful when the search space is very large and there are a large number
of parameters involved.

Limitations of GAs
Like any technique, GAs also suffer from a few limitations. These include

 GAs are not suited for all problems, especially problems which are simple
and for which derivative information is available.
 Fitness value is calculated repeatedly which might be computationally
expensive for some problems.
 Being stochastic, there are no guarantees on the optimality or the quality
of the solution.
 If not implemented properly, the GA may not converge to the optimal
solution.

Genetic Algorithms have the ability to deliver a good-enough solution


fast-enough. This makes genetic algorithms attractive for use in solving
optimization problems. The reasons why GAs are needed are as follows −

Solving Difficult Problems


In computer science, there is a large set of problems, which are NP-Hard.
What this essentially means is that, even the most powerful computing
systems take a very long time (even years!) to solve that problem. In
such a scenario, GAs prove to be an efficient tool to provide usable near-
optimal solutions in a short amount of time.

Failure of Gradient Based Methods


Traditional calculus based methods work by starting at a random point
and by moving in the direction of the gradient, till we reach the top of the
hill. This technique is efficient and works very well for single-peaked
objective functions like the cost function in linear regression. But, in most
real-world situations, we have a very complex problem called as
landscapes, which are made of many peaks and many valleys, which
causes such methods to fail, as they suffer from an inherent tendency of
getting stuck at the local optima as shown in the following figure.
Current Challenges with Link Analysis
Most systems only aggregate data and display link charts of that aggregation.
This limits the use of link charts to only displaying how entities are related
within a single document. CrimeTracer™ is different in that
it consolidates data from across the country, providing a comprehensive set of
relationships between entities that extend across jurisdictions.

Furthermore, while other systems can aggregate documents from various


agencies across the country into a single repository, they do not consolidate
the entities (people, places, things) into unified records that can be
investigated from a single point of reference. So, an investigator will still have
to read all the individual reports to find the associated people, vehicles,
locations, and more that are needed to build a case.

Ultimately, the problem facing analysts and detectives currently is that there is
no figurative “board” that is big enough to connect all the “strings” in any
meaningful way. Too many “strings” on the board obscure what is pertinent
and what is irrelevant. Ultimately, link analysis solves this problem.
How CrimeTracer Differentiates Itself in the World of Link Analysis
CrimeTracer is a powerful law enforcement search engine and information
platform that enables law enforcement to search over 1.3 billion records from
agencies across the U.S.

Furthermore, CrimeTracer recognizes that valuable data is not just limited to


arrests or bookings or incident reports: relevant data can be found in sources
as disparate as court records, LPR, ShotSpotter incidents, pawn tickets, and
jail records. CrimeTracer’s ability to consolidate data from over 40 different
document types provides the most robust searchable database for subjects of
interest available today.

What is a Neural Network?


Last Updated : 03 Apr, 2025

Neural networks are machine learning models that mimic the
complex functions of the human brain. These models consist of
interconnected nodes or neurons that process data, learn patterns,
and enable tasks such as pattern recognition and decision-making.
In this article, we will explore the fundamentals of neural networks,
their architecture, how they work, and their applications in various
fields. Understanding neural networks is essential for anyone
interested in the advancements of artificial intelligence.
Understanding Neural Networks in Deep
Learning
Neural networks are capable of learning and identifying patterns
directly from data without pre-defined rules. These networks are
built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is
governed by a threshold and an activation function.
1. Connections: Links between neurons that carry information,
regulated by weights and biases.
1. Weights and Biases: These parameters determine the strength
and influence of connections.
1. Propagation Functions: Mechanisms that help process and
transfer data across layers of neurons.
1. Learning Rule: The method that adjusts weights and biases
over time to improve accuracy.
Learning in neural networks follows a structured, three-stage
process:
1. Input Computation: Data is fed into the network.
1. Output Generation: Based on the current parameters, the
network generates an output.
1. Iterative Refinement: The network refines its output by adjusting
weights and biases, gradually improving its performance on
diverse tasks.
In an adaptive learning environment:
 The neural network is exposed to a simulated scenario or
dataset.
 Parameters such as weights and biases are updated in response
to new data or conditions.
 With each adjustment, the network’s response evolves, allowing
it to adapt effectively to different tasks or environments.
The image illustrates the analogy between a biological neuron and
an artificial neuron, showing how inputs are received and processed
to produce outputs in both systems.

Importance of Neural Networks


Neural networks are pivotal in identifying complex patterns, solving
intricate challenges, and adapting to dynamic environments. Their
ability to learn from vast amounts of data is transformative,
impacting technologies like natural language processing, self-
driving vehicles, and automated decision-making.
Neural networks streamline processes, increase efficiency, and
support decision-making across various industries. As a backbone
of artificial intelligence, they continue to drive innovation, shaping
the future of technology.
Evolution of Neural Networks
Neural networks have undergone significant evolution since their
inception in the mid-20th century. Here’s a concise timeline of the
major developments in the field:
 1940s-1950s: The concept of neural networks began with
McCulloch and Pitts' introduction of the first mathematical model
for artificial neurons. However, the lack of computational power
during that time posed significant challenges to further
advancements.
 1960s-1970s: Frank Rosenblatt's worked on
perceptrons. Perceptrons are simple single-layer networks that
can solve linearly separable problems, but can not perform
complex tasks.
 1980s: The development of backpropagation by Rumelhart,
Hinton, and Williams revolutionized neural networks by enabling
the training of multi-layer networks. This period also saw the rise
of connectionism, emphasizing learning through interconnected
nodes.
 1990s: Neural networks experienced a surge in popularity with
applications across image recognition, finance, and more.
However, this growth was tempered by a period known as
the "AI winter," during which high computational costs and
unrealistic expectations dampened progress.
 2000s: A resurgence was triggered by the availability of larger
datasets, advances in computational power, and innovative
network architectures. Deep learning, utilizing multiple layers,
proved highly effective across various domains.
 2010s: The landscape of machine learning has been dominated
by deep learning with CNNs (Convolutional Neural Networks)
excelling in image classification and RNNs (Recurrent Neural
Networks) , LSTMs, and GRUs gaining traction in sequence-
based tasks like language modeling and speech recognition.
 2017: Transformer models, introduced by Vaswani et al. in
"Attention is All You Need," revolutionized NLP by using a self-
attention mechanism for parallel processing, improving
efficiency. Models like BERT, GPT, and T5 set new benchmarks
in machine translation and text generation.

You might also like