A Compression-Boosting Transform For Two-Dimension
A Compression-Boosting Transform For Two-Dimension
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220789026
CITATION READS
1 16
3 authors, including:
Avraham A. Melkman
Ben-Gurion University of the Negev
47 PUBLICATIONS 682 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
CLARK: Fast and accurate classification of metagenomic and genomic sequences using discriminative k-
mers View project
All content following this page was uploaded by Stefano Lonardi on 26 May 2014.
1 Introduction
Every day massive quantities of two-dimensional data are produced, stored and
transmitted. Digital images are the most prominent type of data in this cat-
egory. However, matrices over finite alphabets are used to represent all sorts
of information, like graphs, database tables, geometric objects, etc. From the
compression standpoint, two-dimensional data has to be treated differently than
one-dimensional data. In order to obtain good compression of 2D data, one has
to exploit the dependencies (or equivalently, expose the redundancies) both be-
tween the rows and the columns of the matrix.
Lossless compression algorithms are typically composed by a pipeline of in-
dependent stages, usually ending with a statistical encoder. For example, the
celebrated bzip2 employs a pipeline composed by the Burrows-Wheeler trans-
form (BWT), a move-to-front encoder, and finally an Huffman compressor. Each
step somewhat reorder the data so that redundancies get exposed and removed
by the subsequent stages. The objective of the BWT is exactly that of elucidating
the dependencies between adjacent symbols in the original text string.
In this paper we propose an invertible transform for two-dimensional data
over an alphabet . For simplicity, we assume = {0, 1}. The extension to larger
?
This project was supported in part by NSF CAREER IIS-0447773, and NSF DBI-
0321756. AM was supported in part by the Paul Ivanier Center for Robotics Research
and Production Management. A one-page abstract about this work appeared in the
Proceedings of Data Compression Conference, Snowbird, Utah, 2005.
alphabets is immediate and is not pursued here. The goal of the transform is to
boost the compression achieved by later stages. The transform is described by
a simple recursive algorithm, which can be outlined as follows.
Given the matrix to be transformed, first search for the largest columnwise-
constant (resp., rowwise-constant) submatrix, that is, a submatrix identified by
a subset of rows and the columns (which are not necessarily contiguous) whose
columns (resp., rows) are constant (i.e., either all 0 or 1). Reorder the rows and
the columns such that the columnwise-constant (or rowwise-constant) submatrix
is moved to the left-upper corner of the matrix. Recursively apply the transform
on the rest of the matrix. Stop the recursion when the partition produces a
matrix which is smaller than a predetermined threshold (see Figure 3 for an
illustration of this process).
The intriguing question is whether this somewhat simple matrix transfor-
mation helps compression. Arguments can be made in favor or against. On one
hand, each columnwise-constant (or rowwise-constant) submatrix can be rep-
resented compactly in a canonical form (first all the 0-columns, then all the
1-columns) by the list of its rows and columns. If a matrix can be decomposed
into a small number of large constant submatrices, one would expect an im-
provement in compressibility. On the other hand, while the transform groups
together portions of the matrix which are similar, the reordering can also break
the local dependencies that exist in the original matrix between adjacent rows
and columns. Breaking these dependencies could increase the entropy and have
a negative impact on the compressibility.
The contribution of this paper is twofold. First, we present a novel invertible
transform for 2D data. The design of the transform went through a series of
refinements, and here we present the result of such process (Section 5). We
also studied the computational cost of the transform, which turns out to be
unbalanced. The inverse transform is extremely fast and simple, whereas the
direct transform is very expensive. Our compression-boosting phase is therefore
suitable to applications in which the data is compressed once and decompressed
many times, like large repositories of digital images where images are stored and
rarely modified.
The computational cost of the forward transform depends on the complexity
of finding the largest columnwise-constant/rowwise-constant submatrix. In [1] we
studied theoretically the general version of this problem. Although the problem
turns out to be NP-hard, a relatively simple randomized algorithm that has good
performance in practice was introduced in [1]. For completeness of presentation,
we will briefly outline the algorithm in Section 4. The interested reader can refer
to the original paper for more details.
Second, we study empirically the effects of the transform on the compressibil-
ity of two-dimensional data by comparing the performance of gzip and bzip2
before and after the application of the transform on synthetic data and digital
images. The preliminary results in Section 6 show that the transform boosts
compression.
In closing, we want to point out that since our transform simply changes
the representation of the data and it does not deal with the encoding problem
(i.e., assigning bits to symbols), here we are not proposing a complete data
compression software tool. Also, since we our transform is not optimized for
digital images, the transform is not an image compression tool either. As said
above, the primary use of our transform primary is as a preprocessing step to
reorder the data so that the downstream compression with standard lossless
encoder would be more efficient.
2 Related works
Since we are not proposing a new compression method, we will not delve into the
vast literature on lossless image compression. There are however, a few related
works on the idea of reordering the columns and/or the rows of a matrix with
the objective of reducing the storage space or the access time to the element of
the matrix.
In [35], the main concern is to compress database tables by exploiting the
dependencies between the columns. In [3], Buchsbaum et al. observe that parti-
tioning the table into blocks of columns and compressing each block of columns
individually could improve compression. The key problem is to find the optimal
partition of columns. In the follow-up paper [4], the authors add the possibil-
ity of rearranging the columns. Their tool, called pzip outperforms gzip by a
factor of two on database tables. Along the same line of research, the authors
of [5] introduce the k-transform which captures the dependencies between k + 1
columns of a matrix. Although the problem of computing the k-transform for
k 2 is NP-hard, the proposed polynomial-time heuristic for the 2-transform
performs remarkably well compared to pzip, bzip2 and gzip.
The task of compressing boolean (sparse) matrices is also addressed in [6
10]. For example, in [9] the objective is to reorder the columns of a matrix such
that the 1s in each row appear consecutively. Again, the problem of finding a
reordering which minimizes the number of runs of 1s is NP-hard. This problem
reduces to solving a traveling salesman problem for which the authors propose an
heuristic algorithm. In [10] the objective is to find a reordering of both rows and
columns of a boolean matrix so that the matrix can be broken into homogeneous
rectangles and the description complexity involved in describing those rectangles
(called cross-association) is minimized. The problem is defined in an information
theoretical context and a two-stage heuristics algorithm is proposed.
001011
101101
100110
Example. Given the 6 6 matrix X = over the alphabet = {0, 1},
101001
111101
110110
a selection (R, C) = ({2, 4, 5}, {1, 3, 5, 6}) results in the columnwise-constant
1101
submatrix X[R,C] = 1101 . X[R,C] is the largest area columnwise-constant sub-
1101
matrix in X.
The main computational problem is the following. Given a matrix X
{0, 1}nm, find a constant submatrix with the largest area. This problem is
strongly related to the Maximum Edge Biclique problem since a n m bi-
nary matrix can also be interpreted as the adjacency matrix of a bipartite graph.
The biclique problem requires to find the biclique which has the maximum num-
ber of edges which corresponds to the largest constant submatrix composed only
of 1s. This problem was proved to be NP-hard in [11] by reduction to 3Sat.
The weighted version of this problem was shown to be NP-hard by Dawande et
al. [12]. A 2-approximation algorithm based on LP-relaxation was given in [13].
Given that the problem of finding the largest constant submatrix of 1s is NP-
hard, it is unlikely that a polynomial time algorithm could be found. In [1] we
introduced a randomized algorithm which is able to find the optimal solution
with probability 1 , where 0 < < 1. For completeness of presentation,
S S C*
U C
1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1
0 1 0 1 1 1 0 1 0 1 1 1 11:3 0 1 0 1 1 1 0 1 0 1 1 1
0 1 1 0 0 0 0 1 1 0 0 0 10:1 0 1 1 0 0 0 0 1 1 0 0 0
1 1 0 1 1 1 1 1 0 1 1 1 01:1 1 1 0 1 1 1 R 1 1 0 1 1 1
0 0 1 1 0 0 0 0 1 1 0 0 00:1 0 0 1 1 0 0 0 0 1 1 0 0
0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 1 1
C*
next we give a brief outline of the algorithm for the largest columnwise-constant
submatrix. Rowwise-constant submatrices can be found along the same lines.
Recall that we are given a matrix X {0, 1}nm and the objective of the
algorithm is to discover a columnwise-constant submatrix X(R ,C ) . Let us as-
sume that the submatrix X(R ,C ) is maximal. To simplify the notation, let us
call r |R | and c |C |.
The key idea is the following. Observe that if we knew R , then C could be
determined by selecting the constant columns with respect to R . If instead we
knew C , then R could be obtained by taking the maximal set of rows which
read the same symbol on the columns C . Unfortunately, neither R nor C is
known. Our approach is to sample the matrix by randomly selecting subsets
of columns (or rows), expecting that eventually one of the subsets will overlap
with the solution (R , C ).
In the following we describe how to retrieve the solution by sampling columns
(one has also the choice to sample the rows). First, select a subset S of size k
uniformly at random from the set of columns {1, 2, . . . , m}. Assume for the time
being that S C 6= . If we knew S C , then (R , C ) could be determined
by the following three steps (1) select the string(s) w that appear exactly r
times in the rows of X[1:n,SC ] , (2) set R to be the set of rows in which w
appears and (3) set C to be the set of constant columns corresponding to R .
An example is illustrated in Figure 1.
The algorithm would work, but there are a few problems that need to be
solved. First, the set S C could be empty. The solution is to try several
different sets S, relying on the argument that the probability that S C 6= at
least once will approach one with more and more selections. The second problem
is that we do not really know S C . But, certainly S C S, so our approach
is to check all possible subsets of S. The final problem is that we assumed that we
knew r , but we do not. The solution is to introduce a row threshold parameter,
called r, that replaces r .
As it turns out, we need another parameter to avoid producing submatrices
with small area which could potentially degrade the compressibility at later
Largest Columnwise Constant Submatrix(X, t, k, r, c)
Input: X is a n m matrix over {0, 1}
t is the number of iterations
k is the selection size
r, c are the thresholds on the number of rows and columns, resp.
1 repeat t times
2 select randomly a subset S of columns such that |S| = k
3 for all subsets U S do
4 D all strings composed of either 0 or 1 induced by X[1:n,U ] that
appear at least r times
5 for each string w in D
6 V rows corresponding to w
7 Z all constant columns corresponding to V
8 if |Z| c then save (V, Z)
9 return the (V, Z) that maximizes |V | |Z|
log
t nr (1)
mc
Pk 1 c
mc
m
log k + i=1 1 1 ||i i ki log k
5 Forward transform
As mentioned in the introduction, our strategy to boost the compressibility of
two-dimensional data is to recursively decompose the input matrix based on the
presence of large columnwise-constant or rowwise-constant submatrices found by
the randomized search described above. The input to the recursive decomposition
algorithm is the original matrix X along with user-defined thresholds (r and c)
and the number of iterations t. If one fixes , then the number of iterations t can
be computed using equation (1).
The recursive decomposition is carried out as follows. First, the procedure
Largest Columnwise Constant Submatrix (and possibly also the proce-
dure Largest Rowwise Constant Submatrix) is ran on X. If a constant
submatrix is found, the rest of the matrix is partitioned into two submatrices
depending on the size of the constant submatrices discovered at the next recur-
sion level, as illustrated in Figure 3.
The decision whether to choose the partition (a+c, b) or the partition (a, b+c)
depends on the size of the constant submatrices found in the resulting matrices
a+c, b, b+c, and a. Let us call A1 , A2 , B1 , B2 the areas of the constant subma-
trices found in a+c, b, a, b+c, respectively. Based on the values of A1 , A2 , B1 ,
and B2 we studied three distinct criteria to determine the partition. The first is
based on the condition A1 + A2 > B1 + B2 (hereafter called sum). The second
and the third tests are max{A1 , A2 } > max{B1 , B2 } (called max) and A1 > B2
(called indiv3). In all cases, if the test is true the algorithm chooses the partition
(a + c, b). Otherwise, the algorithm chooses the partition (a, b + c). A discussion
on how the test type affects the final compressed size is reported in Section 6.
Once the partition is determined, the randomized search is performed re-
cursively on the newly formed matrices in the same manner. The recursion
stops when the matrix becomes non-decomposable. We say that a matrix is
non-decomposable if either it has less than r rows or less than c columns, or if
the largest constant submatrix contained in it is smaller than r c.
The reason behind our choice of splitting in (a + c, b) or (a, b + c), instead of
(a, b, c) is the following. Each time the algorithms partitions the matrix, we risk
to split large constant submatrices that we could have potentially found later.
The smaller is the number of matrices we split, the higher are the chances of
finding large constant submatrices. Experimental results (not shown) confirmed
our choice.
It should be noted that the user-defined thresholds (r and c) play an im-
portant role in the transform. If the thresholds are too low, there is a danger
of having a deep recursion tree and potentially finding a large number of tiny
constant submatrices. If the thresholds are too high, there will be just a few con-
stant submatrices. Both cases will have a negative impact on the compression.
An experimental study regarding the choice of these thresholds is reported in
Section 6.
We now describe how the transformed data is represented. Clearly, each constant
submatrix can be represented very succinctly. The column indices of columnwise-
constant submatrices are reordered so that each row reads 00. . . 0011. . . 11. Thus,
3
note that in this latter case we do not need to search in b and a.
find b
original largest reorder
matrix uniform a c
submatrix
a b
b c
a c
(a+c,b) (a,b+c)
decomposition decomposition
Fig. 3. Illustration of one step of the forward transform. Depending on the size of the
constant submatrices in a+c, b, a, b+c either the decomposition (a + c, b) or (a, b + c)
is chosen.
each constant submatrix can be represented by the list of rows and column
indices, and where the transition from 0 to 1 takes place. Non-decomposable
submatrices are saved contiguously in row-major order. The content of non-
decomposable matrices is saved in a file called string.
Row and column indices of constant and non-decomposable submatrices are
saved in another file called index. For each set of row and column indices, the
first index is saved as it is, while the rest is saved as differences between adjacent
indices. The length file is used to record the number of rows and the number of
columns for constant and non-decomposable submatrices, along with a binary
flag to indicate whether the submatrix is constant or non-decomposable.
The information contained in the files string, index and length allows one
to invert the transform and reconstruct the original matrix. The inverse trans-
form is simple and extremely fast. Basically, the matrix is reconstructed element
by element in the order of the indices stored in index. The inverse transform was
implemented and tested to make sure that we had all the information necessary
to recover the original matrix. The time complexity of the inverse transform is
linear in the size of input.
In order to determine whether the transform improves compression, we com-
pared the size of the file obtained by compressing the original matrix against the
overall size of the files string, index and length compressed with the same pro-
gram. We employed two popular lossless compression algorithms, namely gzip
and bzip2.
We tested the three criteria (sum, max, indiv) discussed in Section 5 on sev-
eral images and simulated data. The result on the image bird is reported in
Figure 4 for different choices of the thresholds. In the majority of our experi-
2200
sum + gzip sum + bzip 2
1800
1700
1600
1500
10 20 30 40 50 60 70 80 90 100 110 120
Threshold
Fig. 4. The performance of the algorithm on the image bird for different strategies
(sum, max, indiv) in choosing how to partition the matrix
Table 1. Results on 256 256 synthetic data. File matrixi contains i embedded
columnwise-constant submatrices of size 6464. Parameters: r = 10, c = 10, t = 10, 000
ments, the strategy indiv appeared to be the best. Therefore, all experimental
tests that follow employ the indiv test.
Table 2. Comparing the compressibility of 256 256 binary images before and after
the transform. Threshold parameters r = 60, c = 60. Iterations t = 10, 000
35 14000
30 12000
Average area
25 10000
Count
20 8000
15 6000
10 4000
5 2000
0 0
10 20 30 40 50 60 70 80 90 100 110 120 10 20 30 40 50 60 70 80 90 100 110 120
Threshold Threshold
1 2300
0.5 1900
0.4
1800
0.3
1700
0.2
1600
0.1
0 1500
10 20 30 40 50 60 70 80 90 100 110 120 10 20 30 40 50 60 70 80 90 100 110 120
Threshold Threshold
and we ran the transform on different parameter choices. We computed the to-
tal number and the average area of the columnwise-constant submatrices found
(Figure 5), for several choices of r = c and for two values for t. We also recorded
the total proportion of the matrix which was covered by columnwise-constant
submatrices (the rest is non-decomposable), and the final size of the files after
compression (Figure 6). Observe that when the thresholds are low, the propor-
tion of the matrix covered by columnwise-constant submatrices is quite high.
However with low thresholds, the transform finds a large number of columnwise-
constant submatrices which average area is low (Figure 5), which in turn results
in large file sizes for index and length. Compared to the file string, files index
and length are considerably harder to compress. Therefore, the consequence of
choosing thresholds too low is poor compression boosting. Good compression
relies on finding a balance between the gain of representing a portion of the
matrix a single bit and the cost of adding the extra information necessary to
reconstruct the original matrix.
The optimal value of the thresholds r = c for the image bird is around 40, but
other values in the range 40 to 70 achieve very similar results. We carried out the
same analysis on other 256 256 images, and the same general considerations
apply. With respect to the final compression, in most cases the larger is the
number of iterations t, the better is the compression.
References
1. Lonardi, S., Szpankowski, W., Yang, Q.: Finding biclusters by random projections.
In: Proceedings of Symposium on Combinatorial Pattern Matching (CPM04). Vol-
ume 3109 of LNCS., Istanbul, Turkey, Springer (2004) 7488
2. Storer, J.A., Helfgott, H.: Lossless image compression by block matching. Comput.
J. 40(2/3) (1997) 137145
3. Buchsbaum, A.L., Caldwell, D.F., Church, K.W., Fowler, G.S., Muthukrishnan,
S.: Engineering the compression of massive tables: an experimental approach. In:
Proceedings of the ACM-SIAM Annual Symposium on Discrete Algorithms, San
Francisco, CA (2000) 213222
4. Buchsbaum, A.L., Fowler, G.S., Giancarlo, R.: Improving table compression with
combinatorial optimization. In: Proceedings of the ACM-SIAM Annual Symposium
on Discrete Algorithms, San Francisco, CA (2002) 175184
5. Vo, B.D., Vo, K.P.: Using column dependency to compress tables. In Storer, J.A.,
Cohn, M., eds.: Data Compression Conference, Snowbird, Utah, IEEE Computer
Society Press, TCC (2004) 92101
6. Galli, N., Seybold, B., Simon, K.: Compression of sparse matrices: Achieving
almost minimal table sizes. In: Proceedings of Conference on Algorithms and
Experiments (ALEX98), Trento, Italy (1998) 2733
7. Bell, T., McKenzie, B.: Compression of sparse matrices by arithmetic coding. In
Storer, J.A., Cohn, M., eds.: Data Compression Conference, Snowbird, Utah, IEEE
Computer Society Press, TCC (1998) 2332
8. McKenzie, B., Bell, T.: Compression of sparse matrices by blocked Rice coding.
IEEE Trans. Inf. Theory 47(3) (2001) 1223 1230
9. Johnson, D.S., Krishnan, S., Chhugani, J., Kumar, S., Venkatasubramanian, S.:
Compressing large boolean matrices using reordering techniques. In: To appear in
Proceedings of International Conference on Very Large Data Bases (VLDB 2004),
Toronto, Canada (2004)
10. Chakrabarti, D., Papadimitriou, S., Modha, D., Faloutsos, C.: Fully automatic
cross-assocations. In: Proceedings of the Eighth ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining (KDD-04), ACM Press (2004)
8998
11. Peeters, R.: The maximum-edge biclique problem is NP-complete. Technical Re-
port 789, Tilberg University: Faculty of Economics and Business Adminstration
(2000)
12. Dawande, M., Keskinocak, P., Swaminathan, J.M., Tayur, S.: On bipartite and
multipartite clique problems. Journal of Algorithms 41 (2001) 388403
13. Hochbaum, D.S.: Approximating clique and biclique problems. Journal of Algo-
rithms 29(1) (1998) 174200