Generalized Sequential Pattern (GSP) Mining in Data Mining
Last Updated :
02 Feb, 2023
GSP is a very important algorithm in data mining. It is used in sequence mining from large databases. Almost all sequence mining algorithms are basically based on a prior algorithm. GSP uses a level-wise paradigm for finding all the sequence patterns in the data. It starts with finding the frequent items of size one and then passes that as input to the next iteration of the GSP algorithm. The database is passed multiple times to this algorithm. In each iteration, GSP removes all the non-frequent itemsets. This is done based on a threshold frequency which is called support. Only those itemsets are kept whose frequency is greater than the support count. After the first pass, GSP finds all the frequent sequences of length-1 which are called 1-sequences. This makes the input to the next pass, it is the candidate for 2-sequences. At the end of this pass, GSP generates all frequent 2-sequences, which makes the input for candidate 3-sequences. The algorithm is recursively called until no more frequent itemsets are found.
Basic of Sequential Pattern (GSP) Mining:
- Sequence: A sequence is formally defined as the ordered set of items {s1, s2, s3, …, sn}. As the name suggests, it is the sequence of items occurring together. It can be considered as a transaction or purchased items together in a basket.
- Subsequence: The subset of the sequence is called a subsequence. Suppose {a, b, g, q, y, e, c} is a sequence. The subsequence of this can be {a, b, c} or {y, e}. Observe that the subsequence is not necessarily consecutive items of the sequence. From the sequences of databases, subsequences are found from which the generalized sequence patterns are found at the end.
- Sequence pattern: A sub-sequence is called a pattern when it is found in multiple sequences. The goal of the GSP algorithm is to mine the sequence patterns from the large database. The database consists of the sequences. When a subsequence has a frequency equal to more than the “support” value. For example: the pattern <a, b> is a sequence pattern mined from sequences {b, x, c, a}, {a, b, q}, and {a, u, b}.
Sequential Pattern (GSP) Mining uses:
Sequential pattern mining, also known as GSP (Generalized Sequential Pattern) mining, is a technique used to identify patterns in sequential data. The goal of GSP mining is to discover patterns in data that occur over time, such as customer buying habits, website navigation patterns, or sensor data.
Some of the main uses of GSP mining include:
Market basket analysis: GSP mining can be used to analyze customer buying habits and identify products that are frequently purchased together. This can help businesses to optimize their product placement and marketing strategies.
- Fraud detection: GSP mining can be used to identify patterns of behavior that are indicative of fraud, such as unusual patterns of transactions or access to sensitive data.
- Website navigation: GSP mining can be used to analyze website navigation patterns, such as the sequence of pages visited by users, and identify areas of the website that are frequently accessed or ignored.
- Sensor data analysis: GSP mining can be used to analyze sensor data, such as data from IoT devices, and identify patterns in the data that are indicative of certain conditions or states.
- Social media analysis: GSP mining can be used to analyze social media data, such as posts and comments, and identify patterns in the data that indicate trends, sentiment, or other insights.
- Medical data analysis: GSP mining can be used to analyze medical data, such as patient records, and identify patterns in the data that are indicative of certain health conditions or trends.
Methods for Sequential Pattern Mining:
- Apriori-based Approaches
- Pattern-Growth-based Approaches
Sequence Database: A database that consists of ordered elements or events is called a sequence database. Example of a sequence database:
S.No. |
SID |
sequences |
1. |
100 |
<a(ab)(ac)d(cef)> or <a{ab}{ac}d{cef}> |
2. |
200 |
<(ad)c(bcd)(abe)> |
3. |
300 |
<(ef)(ab)(def)cb> |
4. |
400 |
<eg(adf)CBC> |
Transaction: The sequence consists of many elements which are called transactions.
<a(ab)(ac)d(cef)> is a sequence whereas (a), (ab), (ac),
(d) and (cef) are the elements of the sequence.
These elements are sometimes referred as transactions.
An element may contain a set of items. Items within an element are unordered and we list them alphabetically.
For example, (cef) is the element and it consists of 3 items c, e and f.
Since, all three items belong to same element, their order does not matter. But we prefer to put them in alphabetical order for convenience.
The order of the elements of the sequence matters unlike order of items in same transaction.
k-length Sequence:
The number of items involved in the sequence is denoted by K. A sequence of 2 items is called a 2-len sequence. While finding the 2-length candidate sequence this term comes into use. Example of 2-length sequence is: {ab}, {(ab)}, {bc} and {(bc)}.
- {bc} denotes a 2-length sequence where b and c are two different transactions. This can also be written as {(b)(c)}
- {(bc)} denotes a 2-length sequence where b and c are the items belonging to the same transaction, therefore enclosed in the same parenthesis. This can also be written as {(cb)}, because the order of items in the same transaction does not matter.
Support in k-length Sequence:
Support means the frequency. The number of occurrences of a given k-length sequence in the sequence database is known as the support. While finding the support the order is taken care.
Illustration:
Suppose we have 2 sequences in the database.
s1: <a(bc)b(cd)>
s2: <b(ab)abc(de)>
We need to find the support of {ab} and {(bc)}
Finding support of {ab}:
This is present in first sequence.
s1: <a(bc)b(cd)>
Since, a and b belong to different elements, their order matters.
In second sequence {ab} is not found but {ba} is present.
s2: <b(ab)abc(de)> Thus we don’t consider this.
Hence, support of {ab} is 1.
Finding support of {bc}:
Since, b and c are present in same element, their order does not matter.
s1: <a(bc)b(cd)>, first occurrence.
s2: <b(ab)abc(de)>, it seems correct, but is not. b and c are present in different elements here. So, we don’t consider it.
Hence, support of {(bc)} is 1.
How to join L1 and L1 to give C2?
L1 is the final 1-length sequence after pruning. After pruning all the entries left in the set have supported greater than the threshold.
Case 1: Join {ab} and {ac}
s1: {ab}, s2: {ac}
After removing a from s1 and c from s2.
s1’={b}, s2’={a}
s1′ and s2′ are not same, so s1 and s2 can’t be joined.
Case 2: Join {ab} and {be}
s1: {ab}, s2: {be}
After removing a from s1 and e from s2.
s1’={b}, s2’={b}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {abe}
Case 3: Join {(ab)} and {be}
s1: {(ab)}, s2: {be}
After removing a from s1 and e from s2.
s1’={(b)}, s2’={(b)}
s1′ and s2′ are exactly same, so s1 and s2 be joined.
s1 + s2 = {(ab)e}
s1 and s2 are joined in such a way that items belong to correct elements or transactions.
Pruning Phase: While building Ck (candidate set of k-length), we delete a candidate sequence that has a contiguous (k-1) subsequence whose support count is less than the minimum support (threshold). Also, delete a candidate sequence that has any subsequence without minimum support.
{abg} is a candidate sequence of C3.
{abg} is a candidate sequence of C3.
To check if {abg} is proper candidate or not, without checking its support, we check the support of its subsets.
Because subsets of 3-length sequence will be 1 and 2 length sequences. We build the candidate sets increment like 1-length, 2-length and so on.
Subsets of {abg} are: {ab], {bg} and {ag}
Check support of all three subsets. If any of them have support less than minimum support then delete the sequence {abg} from the set C3 otherwise keep it.
Challenges in Generalized Sequential Pattern Data Mining
The database is passed many times to the algorithm recursively. The computational efforts are more to mine the frequent pattern. When the sequence database is very large and patterns to be mined are long then GSP encounters the problem in doing so effectively.
Similar Reads
Non-linear Components
In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Use Case Diagram - Unified Modeling Language (UML)
A Use Case Diagram in Unified Modeling Language (UML) is a visual representation that illustrates the interactions between users (actors) and a system. It captures the functional requirements of a system, showing how different users engage with various use cases, or specific functionalities, within
10 min read
Half Wave Rectifier
A Half-wave rectifier is an electronic device that is used to convert Alternating current (AC) to Direct current (DC). A half-wave rectifier allows either a positive or negative half-cycle of AC to pass and blocks the other half-cycle. Half-wave rectifier selectively allows only one half-cycle of th
15 min read
How to Download and Install the Google Play Store
The Google Play Store is the heartbeat of your Android experienceâhome to millions of apps, games, and updates that keep your device functional, fun, and secure. But what if your phone or tablet doesnât have it pre-installed? In this step-by-step guide, youâll learn how to safely download and instal
6 min read
What is Agile Methodology?
The Agile methodology is a proper way of managing the project with breaking them into smaller phases which is iteration. It basically focus on flexibility of the project which we can change and improve the team work regularly as per requirements. Table of Content What is Agile?What is the Agile Meth
15 min read
Top 8 Software Development Life Cycle (SDLC) Models used in Industry
Software development models are various processes or methods that are chosen for project development depending on the objectives and goals of the project. Many development life cycle models have been developed to achieve various essential objectives. Models specify the various steps of the process a
9 min read
Data Preprocessing in Data Mining
Data preprocessing is the process of preparing raw data for analysis by cleaning and transforming it into a usable format. In data mining it refers to preparing raw data for mining by performing tasks like cleaning, transforming, and organizing it into a format suitable for mining algorithms. Goal i
6 min read
MVC Framework Introduction
Over the last few years, websites have shifted from simple HTML pages with a bit of CSS to incredibly complex applications with thousands of developers working on them at the same time. To work with these complex web applications developers use different design patterns to lay out their projects, to
6 min read
IEEE 802.11 Architecture
The IEEE 802.11 standard, commonly known as Wi-Fi, outlines the architecture and defines the MAC and physical layer specifications for wireless LANs (WLANs). Wi-Fi uses high-frequency radio waves instead of cables for connecting the devices in LAN. Given the mobility of WLAN nodes, they can move unr
9 min read
Transaction in DBMS
In a Database Management System (DBMS), a transaction is a sequence of operations performed as a single logical unit of work. These operations may involve reading, writing, updating, or deleting data in the database. A transaction is considered complete only if all its operations are successfully ex
10 min read