PCY Algorithm in Big Data
Last Updated :
09 Jan, 2024
PCY was developed by Park, Chen, and Yu. It is used for frequent itemset mining when the dataset is very large.
What is the PCY Algorithm?
The PCY algorithm (Park-Chen-Yu algorithm) is a data mining algorithm that is used to find frequent itemets in large datasets. It is an improvement over the Apriori algorithm and was first described in 2001 in a paper titled "PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth" by Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, and Helen Pinto.
The PCY algorithm uses hashing to efficiently count item set frequencies and reduce overall computational cost. The basic idea is to use a hash function to map itemsets to hash buckets, followed by a hash table to count the frequency of itemsets in each bucket.
Example problem solved using PCY algorithm
Problem:
Apply the PCY algorithm on the following transaction to find the candidate sets (frequent sets) with threshold minimum value as 3 and Hash function as (i*j) mod 10.
T1 = {1, 2, 3}
T2 = {2, 3, 4}
T3 = {3, 4, 5}
T4 = {4, 5, 6}
T5 = {1, 3, 5}
T6 = {2, 4, 6}
T7 = {1, 3, 4}
T8 = {2, 4, 5}
T9 = {3, 4, 6}
T10 = {1, 2, 4}
T11 = {2, 3, 5}
T12 = {2, 4, 6}
Approach:
There are several steps that you have to follow to get the Candidate table.
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to them write their frequency. Note - Note: Pairs should not get repeated avoid the pairs that are already written before.
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It gives us the bucket number). It defines in what bucket this particular pair will be put.
Step 4: This is the last step, and in this step, we have to create a table with the following details -
- Bit vector - if the frequency of the candidate pair is greater than equal to the threshold then the bit vector is 1 otherwise 0. (mostly 1)
- Bucket number - found in the previous step
- Maximum number of support - frequency of this candidate pair, found in step 2.
- Correct - the candidate pair will be mentioned here.
- Candidate set - if the bit vector is 1, then "correct" will be written here.
Solution:
Step 1: Find the frequency of each element and remove the candidate set having length 1.
Items | 1 | 2 | 3 | 4 | 5 | 6 |
---|
Frequency | 4 | 7 | 7 | 8 | 6 | 4 |
---|
Step 2: One by one transaction-wise, create all the possible pairs and corresponding to it write its frequency.
T1 | {(1, 2), (1, 3)} | 2,3 |
---|
T2 | {(2, 3), (2, 4)} | 3,4 |
---|
T3 | {(3, 4),(3, 5)} | 4,3 |
---|
T4 | {(4, 5) ,(4, 6)} | 3,4 |
---|
T5 | {(1, 5)} | 1 |
---|
T6 | {(2, 6)} | 2 |
---|
T7 | {(1, 4)} | 2 |
---|
T8 | {(2, 5)} | 2 |
---|
T9 | {(3, 6)} | 1 |
---|
T10 | - | |
---|
T11 | - | |
---|
T12 | - | |
---|
Step 3: List all sets whose length is greater than the threshold and then apply Hash Functions. (It gives us the bucket number).
Hash Function = ( i * j) mod 10
(1, 3) = (1*3) mod 10 = 3
(2,3) = (2*3) mod 10 = 6
(2,4) = (2*4) mod 10 = 8
(3,4) = (3*4) mod 10 = 2
(3,5) = (3*5) mod 10 = 5
(4,5) = (4*5) mod 10 = 0
(4,6) = (4*6) mod 10 = 4
Bucket No.
Bucket no. | Pair |
---|
0 | (4,5) |
---|
2 | (3,4) |
---|
3 | (1,3) |
---|
4 | (4,6) |
---|
5 | (3,5) |
---|
6 | (2,3) |
---|
8 | (2,4) |
---|
Step 4: Prepare candidate set
Bit Vector | Bucket No. | Highest Support Count | Pairs | Candidate Set |
---|
1 | 0 | 3 | (4,5) | (4,5) |
1 | 2 | 4 | (3,4) | (3,4) |
1 | 3 | 3 | (1,3) | (1,3) |
1 | 4 | 4 | (4,6) | (4,6) |
1 | 5 | 3 | (3,5) | (3,5) |
1 | 6 | 3 | (2,3) | (2,3) |
1 | 8 | 4 | (2,4) | (2,4) |
Similar Reads
Page Rank Algorithm in Data Mining
Prerequisite: What is Page Rank Algorithm The page rank algorithm is applicable to web pages. The page rank algorithm is used by Google Search to rank many websites in their search engine results. The page rank algorithm was named after Larry Page, one of the founders of Google. We can say that the
3 min read
Greedy Algorithms
Greedy algorithms are a class of algorithms that make locally optimal choices at each step with the hope of finding a global optimum solution. At every step of the algorithm, we make a choice that looks the best at the moment. To make the choice, we sometimes sort the array so that we can always get
3 min read
Approximation Algorithms
Overview :An approximation algorithm is a way of dealing with NP-completeness for an optimization problem. This technique does not guarantee the best solution. The goal of the approximation algorithm is to come as close as possible to the optimal solution in polynomial time. Such algorithms are call
3 min read
Analysis of Algorithms
Analysis of Algorithms is a fundamental aspect of computer science that involves evaluating performance of algorithms and programs. Efficiency is measured in terms of time and space. Basics on Analysis of Algorithms:Why is Analysis Important?Order of GrowthAsymptotic Analysis Worst, Average and Best
1 min read
Preparata Algorithm
Preparata's algorithm is a recursive Divide and Conquer Algorithm where the rank of each input key is computed and the keys are outputted according to their ranks. [GFGTABS] C++ m[i, j] := M[i, j] for 1 <= i, j <= n in parallel; for r : = 1 to logn do { Step 1. In parallel set q[i, j, k] := m[
14 min read
Searching Algorithms
Searching algorithms are essential tools in computer science used to locate specific items within a collection of data. In this tutorial, we are mainly going to focus upon searching in an array. When we search an item in an array, there are two most common algorithms used based on the type of input
3 min read
Algorithms Design Techniques
What is an algorithm? An Algorithm is a procedure to solve a particular problem in a finite number of steps for a finite-sized input. The algorithms can be classified in various ways. They are: Implementation MethodDesign MethodDesign ApproachesOther ClassificationsIn this article, the different alg
10 min read
Difference between Data Structures and Algorithms
What are Data Structures and Algorithms? Data structures and algorithms are two interrelated concepts in computer science. Data structures refer to the organization, storage, and retrieval of data, while algorithms refer to the set of instructions used to solve a particular problem or perform a spec
2 min read
Analysis of Algorithms | Big-Omega ⦠Notation
In the analysis of algorithms, asymptotic notations are used to evaluate the performance of an algorithm, in its best cases and worst cases. This article will discuss Big-Omega Notation represented by a Greek letter (â¦). Table of Content What is Big-Omega ⦠Notation?Definition of Big-Omega ⦠Notatio
9 min read
Best Data Structures and Algorithms Books
Data Structures and Algorithms is one of the most important skills that every Computer Science student must have. There are a number of remarkable publications on DSA in the market, with different difficulty levels, learning approaches and programming languages. In this article we're going to discus
9 min read