0% found this document useful (0 votes)
10 views8 pages

Dmhuff

Uploaded by

manabideka19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

Dmhuff

Uploaded by

manabideka19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Domain Specific Hierarchical Huffman Encoding

[Link], [Link] Manik, [Link],[Link]

Department of Computer Science and Engineering,


Indian Institute of Information Technology, Design and Manufacturing, Kancheepuram, Chennai, India.
{coe09b007,coe09b008,sadagopan,sivaselvanb}@[Link]

Abstract. In this paper, we revisit the classical data compression problem for domain specific texts.
It is well-known that classical Huffman algorithm is optimal with respect to prefix encoding and the
compression is done at character level. Since many data transfer are domain specific, for example,
downloading of lecture notes, web-blogs, etc., it is natural to think of data compression in larger
arXiv:1307.0920v1 [[Link]] 3 Jul 2013

dimensions (i.e. word level rather than character level). Our framework employs a two-level compression
scheme in which the first level identifies frequent patterns in the text using classical frequent pattern
algorithms. The identified patterns are replaced with special strings and to acheive a better compression
ratio the length of a special string is ensured to be shorter than the length of the corresponding
pattern. After this transformation, on the resultant text, we employ classical Huffman data compression
algorithm. In short, in the first level compression is done at word level and in the second level it is at
character level. Interestingly, this two level compression technique for domain specific text outperforms
classical Huffman technique. To support our claim, we have presented both theoretical and simulation
results for domain specific texts.

1 Introduction

Data transfer between a pair of nodes is a fundamental problem in computer networks. Data compression is
a technique that speeds up the data transfer by compressing the data at the sender and the original data
is recovered at the receiver by employing a decompression technique. Data compression (decompression) is
a classical problem in computer science and it has attracted many researchers in the past and the popu-
lar one is due to Huffman. Huffman data compression technique does character-level compression and does
not assume anything about the underlying domain. Huffman’s approach is the following: assign a shorter
code for a character which occurs most in the text to be compressed. Interestingly, this approach is optimal
with respect to prefix encoding. With the discovery of Data mining and in particular, the data compression
perspective of data mining looks at the text from a larger dimension and focuses on identifying patterns
(words) that occur frequently in the text. This line of research was initiated in [9,10]. In both approaches
there is little assumption about the input text and the patterns to be searched is precisely from the standard
dictionary words. However, many data transfer operation is domain specific, for example, downloading of
lecture notes, web-blogs, etc. Moroever, we noticed that the data available in the Web Servers (Academic
Servers) are tagged or classified according to the domain from which the text is derived, for example, the
blog websites or news aggregators running on the web servers display plots (news) tagged with the domain of
the text. A technical blog on programming (computer science) containing posts related to computer science
can be better compressed and sent to the readers. These observations motivate us look at data compression
perspective for domain specific text. Moreover, the existing approaches identify patterns using dictionary
as the reference which are not efficient enough for domain specific text as most of the domain specific key-
words do not appear in the dictionary. So, this calls for a different data compression approach for domain
specific text. Therefore, the combinatorial problem at hand is to compress a domain specific text based on
the frequency of patterns generated from the text and the objective is to maximize the compression ratio by
minimizing the size of file to be transferred between sender and the receiver.
Related Work: The study of data compression was initiated by Huffman[3]. His technique focuses on
character-level compression using the frequency count of the characters in the text. Huffman’s result is very
well-known in the literature and it is in use even today. In [5], the possibility of data compression by replacing
certain characters in the words by a special character 0 ∗0 and retaining a few characters so that the word is
still retrievable unambiguously were discussed. It is important to note that any compression technique must
be loss less and towards this end Ian et al. proposed a language model for representing the data efficiently
through the identification of new tokens and tokens in context of the text under consideration. A slightly
different approach inspired by Pitman Shorthand was adopted in [7]. The approach is to encode a group of
successive 2-3 text characters into a single code. In [7], it is also shown that further application of Huffman
coding on the codes generated is possible and is expected to result in a greater compression. As far as pattern
learning and discovery algorithms are concerned, apart from classical frequent pattern mining algorithms
[2,1], the use of Genetic algorithms to arrive at rules or hypotheses for pattern learning in the text compres-
sion is presented in [4]. Searching of fixed length patterns in the text can be done very efficiently using the
’TARA’ algorithm proposed in [8]. Two level dictionary based text compression scheme is proposed in [10].
It involves the transformation of the original text with a dictionary of fixed frequent English words. The
disadvantage is the need for a huge dictionary to be present on both the compressor and de-compressor.
Our work: In this paper, we present a data compression algorithm for a domain specific text. As men-
tioned before, many data transfers are domain specific and it is natural to think of domain specific data
compression algorithms. We propose a framework for compression and it works for any domain in general.
For our discussion purposes, we work with computer science domain. Our framework employs a two-level
compression scheme in which the first level identifies frequent patterns in the text using classical frequent
pattern algorithms. The identified patterns are replaced with special strings and to ensure a better compres-
sion ratio the length of a special string is shorter than the length of the corresponding pattern. After this
transformation, on the resultant text, we employ classical Huffman data compression algorithm. In short, in
the first level compression is done at word level and in the second level it is at character level. Interestingly,
this two level compression technique for domain specific text outperforms classical Huffman technique. To
support our claim, we have presented both theoretical and simulation results for domain specific texts.
Road map: In Section 2, we present theoretical results of our framework. Simulation results of our proposed
algorithm is presented in Section 3. We conclude this section with a flowchart describing how our proposed
approach is employed for compression and decompression stages during data transfer.

2 Hierarchical Huffman Encoding: Theory and Simulation

In this section, we present a theoretical study of Hierarchical Huffman encoding followed by simulation
results. We first discuss our approach to find frequent patterns following which we present a polynomial-
time algorithm for Hierarchical Huffman encoding and it is polynomial in the size of the input text. In the
subsequent sections, we present an implementation using Hash table data structure followed by the run-time
analysis of the algorithm. Further, we show that Hierarchical Huffman encoding achieves better compression
ratio than classical Huffman for Domain specific text. We support our theoretical study by a thorough
simulation of Hierarchical-Huffman algorithm for various Domain specific text inputs. Our findings, both
theoretical and simulation reveal that the proposed two-level compression outperforms classical Huffman
encoding algorithm.

2.1 Identifying Frequent Patterns from the text

We mine the input text to identify frequent patterns. We make use of Python Natural Language Processing
Toolkit (NLTK).

– The NLTK toolkit supports the reading of the training data into the so called Corpus Reader, a specialized
object class to enable faster access to large text files stored on the secondary memory.
– The training data are loaded into the Corpus Reader in NLTK along with parameters Length and
Frequency. The clustering is done based on the parameter values. The words extracted from the text are
clustered into 4 clusters based on the parameters provided.

2
– The 4 clusters are infrequent short, infrequent long, frequent short, and frequent long clusters. The NLTK
clustering algorithm is so designed to choose only the frequent long cluster. The other 3 clusters are
avoided because their contribution in the improvement of compression ratio is minimal. i.e., the infrequent
long and infrequent short clusters are not replaced with special strings and they are simply passed to the
next stage of algorithm. Similarly, the frequent short patterns are also not replaced with special strings
as the overhead for replacing short pattern with a special string is more than leaving the pattern as such
in the text.
– Since our approach is for a domain specific text, we also append the keywords (frequent patterns) from
the standard text books to our corpus generated out of NLTK algorithm. The complete set of the frequent
patterns is now sorted in the order of their length and written to a file further processing.

Fig. 1. The Flowchart of Mining Algorithm to find Clusters

2.2 Hierarchical Huffman Algorithm


Hierarchical Huffman algorithm for domain specific texts is presented in Algorithm 1.

2.3 Implementation of Hierarchical-Huffman


– We maintain a Hash-table to store the set P of patterns obtained. Against, each pattern Pi , we store
the replacement string ri . For a given Pi , the location in Hash table is identified using the function
MAP-CONTAINER() available in UNIX systems.
– At the decoding stage, for a given ri , we can uniquely retrieve Pi using a bijective function.
– To implement classical-Huffman, we make use of a Min-heap data structure.

2.4 Run-time Analysis of Hierarchical-Huffman


Let the size of the text T be n, n being the number of characters in T . Clearly, |P | ≤ n. On an average, the
Hash table operations Insert and Search take O(1) time. The operations supported by Min-heap for Classical

3
Algorithm 1 Hierarchical Huffman Encoding Hierarchical-Huffman(Text T)
1: Get Patterns of length at least 3 using Frequent Pattern Algorithm to populate PATTERN-DATABASE
2: Also, get keywords of length at least 3 from standard texts (specific to domain) to populate PATTERN-
DATABASE
3: Let P = {P1 , P2 , . . . , Pk } denote the set of patterns such that |Pi | ≥ 3 and R = {r1 , . . . , rk } denote the set of
replacement strings for encoding.
4: while text T is not exhausted do
5: for each word w of T , check whether w is a pattern in P do
6: if w is Pi for some i then
7: Replace Pi by ri in T
8: else
9:
10: end if
11: end for
12: end while
13: Let Tlevel1 is the updated text of T
14: Call Classical-Huffman(Tlevel1 )
15: Let TH be the Huffman tree corresponding to Tlevel1 and CH be the corresponding codes for encoding

Huffman can be implemented in O(n log n) time. Therefore, the overall time-complexity is O(n log n). In the
worst case, the Hash table operations incur O(n) which is the size of the Hash table. The overall run-time
is still O(n log n). The data structure B-trees can also be used to maintain the set of patterns instead of
Hash table. For B-trees, the dictionary operations incur O(log n) time. Note that the overall run-time is still
O(n. log n).

2.5 Hierarchical Huffman Outperforms Classical Huffman

In this section, we show that Hierarchical Huffman achieves a better compression ratio than classical Huffman
for domain specific input text.
Theorem 1. Hierarchical Huffman Outperforms Classical Huffman.

Proof. Consider the set P = {P1 , P2 , . . . , Pk } of patterns such that |Pi | ≥ 3. Since each Pi occurs at random
in text T , we denote the number of occurrences of each Pi using a random variable. Let Xi be a random
variable denoting the number of occurrences of Pi in T . In Tlevel1 , each Pi is replaced with a special string
ri such that |ri | ≤ |Pi |. This is an invariant maintained by our algorithm during each scan of the text T .
Moreover, the patterns of length at most two are discarded and not stored in the Hash table. The set P
is organized in such a way that for each Pi and Pj , i < j, the replacement strings ri and rj are such that
|ri | ≤ |rj |. Therefore, the size of the text resulting from first-level compression is

– Tlevel1 = r1 .X1 + . . . + rk .Xk + Tinf requent + Tshortpattern , where Tshortpattern denotes patterns of length
at most 2 and Tinf requent denotes patterns that are infrequent.
– The compressed text Tlevel1 contains replacement strings for patterns which are frequent (say, for exam-
ple, it occurs at least 10 times in T ) and each such pattern is of length at least 3 in T .
– Since |ri | ≤ |Pi |, Tlevel1 ≤ T , where T is the original text size. i.e., to say that the size of the text
resulting from first level compression (replacement of patterns by special strings) is at most the input
text.
– In the second level compression, Tlevel1 is given as an input to Classical-Huffman. It is a well-known fact
that Classical Huffman is optimal with respect to prefix encoding. Therefore, Hierarchical Huffman is
optimal. Hence, the claim follows. t
u

Inference: Note that if |ri | < |Pi |, for many Pi ’s and the frequency count of each Pi is also higher, then the
size of the compressed text Tlevel1 << T .

4
Compression Ratio: Compression Ratio= Performance of Hierarchical Huffman , where
Performance of Classical Huffman
Two level compression on T
Performance of Hierarchical Huffman= Input Text T
Performance of Classical Huffman= Classical Huffman on T .
Input Text T
Two level compression on T
Essentially, Compression Ratio=
Classical Huffman on T
Two level compression on T refers to the replacement of Pi ’s by ri ’s followed by Classical Huffman. Clearly,
the lower the ratio, the better is the two-level compression.

3 Simulation of Hierarchical-Huffman Algorithm


In this section, we validate our theoretical study by a thorough simulation of Hierarchical Huffman for differ-
ent input texts. We have considered input texts from Computer Science Domain. Keywords from standard
computer science texts are also considered as patterns for the study. Our simulation includes various text
files whose size ranging from 500kB to 2MB. Various plots and findings from simulation are given below:
– Figure 2, illustrates the performance of Hierarchical Huffman and Classical Huffman for different text
inputs. From the plot we infer that for large input texts, the performance of Hierarchical Huffman is much
better than the performance of classical Huffman. This is true because, for large input texts, frequency
of patterns increases which lead to reduction in the size of the text output by two-level compression
routine.

Fig. 2. Input Text vs Performance

– In Figure 3, we show the plot between compression ratio and various input texts. Recall that, from our
theoretical study we infered that the higher the input text size, the better the compression ratio. This is
precisely evident in Figure 3 as well.
– Theoretical analysis reveals that to get a good compression ratio, the input text must contain smaller
patterns with large frequencies. For input texts with large patterns, it appears like Hierarchical Huffman

5
Fig. 3. Input Text vs Compression Ratio

as good as classical Huffman and this observation is clearly evident from our simulation study and it is
illustrated in Figure 4

Fig. 4. Comparison of Two Techniques for Large Patterns

– Note that during data transmission, the Hash table of reference strings of patterns will be sent along
with the input text. Since it is a domain-specific downloads, it is sufficient to send the Hash table exactly
once. Although, it is an overhead, the performance can be seen if the number of downloads is large.
This is precisely illustrated in Figure 5. Due to this overhead, if the number of downloads is small, then
classical Huffman performs better than the Hierarchical Huffman. There is critical point which denotes
the minimum number of downloads beyond which Hierarchical Huffman outperforms classical Huffman.
– As mentioned before, shorter patterns with good frequency count gives good compression ratio than
larger patterns with same frequency count. This observation is validated for different texts and it is

6
Fig. 5. Identifying Critical Point for the two Compression Schemes

illustrated in Figure 6. The plot is done by taking average of the compression ratios for different input
texts versus the vector (length of the pattern,frequency of the pattern).

Fig. 6. Effects of Shorter vs Larger Patterns on the Compression Ratio

3.1 Conclusions and Further Research

In this paper, we have proposed a framework for data compression for domain specific input text. Our
framework involves two-level compression in which the first level is done at word level and the second
level is at character level. We have also shown that this two-level approach outperforms classical Huffman
for domain specific input text. All our claims are supported by theoretical results and validated with a

7
thorough simulation. An interesting problem for further research is to analyse our two-level scheme for video
compression.

References
1. Naren Ramakrishnan, Ananth Grama: Data Mining: From Serendipity to Science, IEEE Computers 32(8): 34-37
(1999).
2. Jiawei Han, Micheline Kamber: Data Mining: Concepts and Techniques, Morgan Kaufmann 2000, ISBN 1-55860-
489-8.
3. Aho and Hopcroft: Data Structures and Algorithms, Academic Press, 1990.
4. David E. Goldberg: Genetic Algorithms in Search Optimization and Machine Learning. Addison-Wesley 1989,
ISBN 0-201-15767-5
5. Robert Franceschini, Amar Mukherjee: Data Compression Using Encrypted Text. ADL 1996: 130-138
6. Ian H. Witten, Zane Bray, Malika Mahoui, W. J. Teahan: Text Mining: A New Frontier for Lossless Compression.
Data Compression Conference 1999: 198-207
7. P. Nagabhushan, S. Murali: Pitman Shorthand Inspired Model for Plain Text Compression. ICDAR 2001: 132
8. TARA: An Algorithm for Fast Searching of Multiple Patterns on Text Files, Technical Report, Turkish Army
Gendarme Headquarter,Bestepe,Ankara,Turkey, 2007
9. Weifeng Sun, Nan Zhang, Amar Mukherjee: A Dictionary-Based Multi-Corpora Text Compression System. DCC
2003: 448
10. Md. Ziaul Karim Zia, Dewan Md. Fayzur Rahman, and Chowdhury Mofizur Rahman, Two-Level Dictionary
Based Text Compression Scheme, Proceedings of 11th International Conference on Computer and Information
Technology (ICCIT 2008), 2008, pp. 13-18

You might also like