0% found this document useful (0 votes)
999 views4 pages

An Empirical Study To Detect The Collision Rate in Similarity Hashing Algorithm Using MD5

This document presents an empirical study to detect collision rates in the SimHash algorithm when using MD5 as the internal hashing function. The analysis was performed on bit sequences of varying lengths from 2 to 32 bits. Collision detection is desirable for applications using cryptographic hashing like digital signatures and proof-of-work systems. The study revealed a collision rate of 0% to 0.048% when using SimHash with MD5, varying based on the length of the bit sequence. Parallelizing the collision detection process using distributed computing improved the speed and efficiency of the analysis.

Uploaded by

Nairouz Alzin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
999 views4 pages

An Empirical Study To Detect The Collision Rate in Similarity Hashing Algorithm Using MD5

This document presents an empirical study to detect collision rates in the SimHash algorithm when using MD5 as the internal hashing function. The analysis was performed on bit sequences of varying lengths from 2 to 32 bits. Collision detection is desirable for applications using cryptographic hashing like digital signatures and proof-of-work systems. The study revealed a collision rate of 0% to 0.048% when using SimHash with MD5, varying based on the length of the bit sequence. Parallelizing the collision detection process using distributed computing improved the speed and efficiency of the analysis.

Uploaded by

Nairouz Alzin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

An Empirical Study to Detect the Collision Rate in

Similarity Hashing Algorithm Using MD5


Tushaar Gangavarapu Jaidhar C.D.
Worldwide Deals, Automated Advertising Department of Information Technology
Amazon.com, Inc. National Institute of Technology Karnataka
Bangalore, India Mangalore, India
[email protected] [email protected]

Abstract—Similarity Hashing (SimHash) is a widely used N −bit size) on random input, then it is likely that matching
locality-sensitive hashing algorithm employed in the detection outputs exist [4]. Most hash functions including Message
of similarity, in large-scale data processing, including plagiarism Digest 5 (MD5) [5] and Secure Hash Algorithm 1 (SHA-1) [6]
detection and near-duplicate web document detection. Collision
resistance is a crucial property of cryptographic hash algorithms that were estimated to be collision-resistant, were later broken
that are used to verify the message integrity in internet security [7], [8], [9]. The impact of collisions is essentially application-
applications. A hash function is said to be collision-resistant if it dependent, and determining the collision rate can help estimate
is hard to find two different inputs that hash to the same output. the collision resistance of a hash function. The use of hash
In this paper, we present an empirical study to facilitate the functions in the security of digital signature schemes, proof-
detection of collision rate when SimHash is employed to check
the integrity of the message. The analysis was performed using of-work systems, distributed content systems, data integrity
bit sequences with length varying from 2 to 32 and Message schemes, e-cash, group signature, and a multitude of other
Digest 5 (MD5) as the internal hash function. Furthermore, to cryptographic protocols makes it almost mandatory to deter-
enable faster collision detection with more significant speedup mine the collision rate of the underlying hash functions.
and efficient space utilization, we parallelized the process using In 1994, MD4 [10] was broken by attacking the last two
a distributed data-parallel approach with synchronous computa-
tion and optimum load balancing. Collision detection is desirable, rounds [4], [11], [12]. MD5 was broken in 1998, using the
owing to its applicability in digital signature systems, proof-of- modular differential attack, and the collisions can be generated
work systems, and distributed content systems. Our empirical in about 15 minutes to an hour [13], [8], which is estimated by
study revealed a collision rate of 0% to 0.048% in SimHash exploiting the weakness in the internal structure of MD5. Other
(with MD5) with the variation in the length of the bit sequence. hash functions including MD4, RIPEMD [14], and HAVAL-
Index Terms—Collision Rate, Collision Search, Integrity, MD5,
SimHash 128, [15] can also be broken using a modular differential
attack [8], [7]. SHA-1 is not broken yet, but a collision was
I. I NTRODUCTION found with the complexity of less than 269 hash operations
In today’s world of open communication and computing, [9], [4]. It can be noted that the existing literature does not
providing a way to check the integrity of the stored messages provide any collision information or collision search strategies
or transmitted messages through an unreliable medium is of for collision detection in SimHash [16], which is a locality-
vital importance [1]. The integrity of the message guarantees sensitive hashing scheme.
that the message is not tampered with, in the transit and For currently unbroken cryptographic hashing schemes,
is usually achieved by utilizing hash functions. It is quite there do not exist any known internal structural weaknesses,
evident from the pigeonhole principle that every hash function thus implying that the collision rate detection is the only
with fewer outputs than inputs would result in some of the guaranteed way of proving their collision resistance. SimHash,
inputs hashing to the same output, i.e., the collision of hashes developed by Moses Charikar, is widely used in detecting
is plausible with most hashing schemes [2], [3]. Collision similarity in large-scale data processing applications. When
resistance is a vital property of a cryptographic hash function, SimHash is used to check the message integrity, the need for
which ensures the difficulty of finding two distinct inputs that the detection of its collision rate becomes vital. In this paper,
hash to the same output value. While collision resistance is we present an efficient empirical analysis of the collision
desirable, it does not imply the non-existence of collisions. rates in SimHash algorithm through a distributed data-parallel
Cryptographic hash functions are customarily designed to dictionary-updation approach, with optimal load balancing and
ensure collision resistance. The birthday paradox gives a synchronous computation. This study employs MD5 as the
definitive upper bound internal hashing scheme, and the analysis is presented for bit
√ on the collision resistance, i.e., if an sequences ranging from 2 to 32 bits in length. Furthermore,
attacker computes 2N hash operations (for a hash digest of
the execution time taken to measure the collision rate is also
detailed, to give an overall estimate of the time taken to
978-1-7281-2087-4/19/$31.00 2019
c IEEE identify collisions in SimHash.

11
The rest of the paper is structured as follows: Section II SimHash for near-duplicate detection, we can reduce the time
presents a brief overview of the SimHash algorithm. Section complexity from O(N 2 ) for pair-wise comparison to O(N ).
III reviews the existing literature and work previously carried
out in this domain. The proposed methodology to compute III. R ELATED W ORK
the collision rates for SimHash is presented in great detail,
In the past, many cryptographic algorithms including MD4,
in Section IV. Section V presents the obtained experimental
MD5, SHA-0, and SHA-1, were broken by exploiting the
results, followed by conclusions and discussion on future
structural weaknesses of the underlying hashing schemes [8],
research possibilities in Section VI.
[9], [7]. Most of the studies concerning SimHash in the
II. BACKGROUND : R EVIEW OF S IM H ASH existing literature aim at evaluating the applicability of this
locality-sensitive algorithm to near-duplicate detection in data
While most hash algorithms including MD5, SHA-256, and processing applications, including plagiarism checking [19]
HAVAL-128 hash different inputs (even with the slightest and email spam detection [20].
of the variations) to entirely different hash digests, SimHash Sood and Loguinov [21] proposed a significantly faster
hashes similar inputs (in terms of the Hamming distance) to and a greater space-efficient approach to detect similar docu-
similar (closer) hash digests. Consider the following example: ment pairs in large-scale data collections. Their bit-flipping
phrase1 = "magic is all within you" algorithm resulted in certain performance overhead. Fu et
phrase2 = "magic is all in you"
phrase1.MD2 = 923ce24b045b25ad82341c2a8ac65f65 al. [22] presented a document-based query searchable en-
phrase2.MD2 = c0d972488d0c98763ab1f596a63e35f3 cryption scheme over encrypted cloud document, based on
hammingDistance(phrase1.MD2, phrase2.MD2) = 65
similarity hashing and trie based indexing. Jiang and Sun
phrase1.SimHash = 2da266b7f30b82d9 [23] proposed a semi-supervised SimHash algorithm to search
phrase2.SimHash = 2da366b773a382fd high-dimensional data. Their algorithm learned the optimal
hammingDistance(phrase1.SimHash, phrase2.SimHash) = 7
feature weights from prior knowledge, to relocate the data,
The SimHash algorithm uses an internal hashing algorithm ensuring that similar data inputs have similar hash digests. Ho
to hash shingles or n−grams obtained from a given phrase. et al. [20] employed the SimHash algorithm with a parallel
Each hash digest corresponding to each n−gram is then processing framework and meet-in-the-middle attack, to detect
utilized to arrive at the final similarity hash digest. Algorithm spam emails.
1 details the entire procedure employed to obtain the SimHash The existing research only presents the applications of the
digest for a given input phrase, the specified value of n SimHash algorithm without any reference to its collision rate
in n−grams (shingle size), and the defined internal hashing (and thus, the collision resistance). Hence, we conclude that
scheme. there exist no state-of-the-art studies concerning the deter-
mination of the collision rates (resistance) for the SimHash
Algorithm 1: SimHash Algorithm algorithm.
Input: input phrase, shingle size, hash algorithm
Output: SimHash digest IV. M ETHODOLOGY
1 n−grams ← inputPhrase.shingles(shingleSize)
2 hashDigests ← [] Collision detection aims at finding two distinct inputs (here
3 for shingle ∈ n−grams do bit sequences) hashing to the same digest. Firstly, 2n distinct
4 hashDigest ← binary(shingle.hashAlgorithm) bit sequences of length (n) varying from 2 to 32 are generated.
5 hashDigests.append(hashDigest) Then, the SimHashes for the generated bit sequences are
6 SimHashBits ← [0] * len(hashDigests[0])
computed using the procedure in Algorithm 1, with an internal
7 for hashDigest ∈ hashDigests do
8 for idx ← 0 to len(hashDigest) do hashing scheme as MD5 and the shingle size of two. All the
9 if hashDigest[idx] = 1 then hash digests are stored in a hash map (dictionary) to ensure
10 SimHashBits[idx] ← SimHashBits[idx] + 1 a constant lookup complexity (O(1)). In the hash map, we
11 else store the obtained hash digest as the key and the count of its
12 SimHashBits[idx] ← SimHashBits[idx] − 1
occurrence as the corresponding value.
13 SimHashDigest ← string.empty
14 for idx ← 0 to len(SimHashBits) do The computation for lower-order sequences (up to 16 bits
15 if SimHashBits[idx] > 0 then in length) is manageable and does not require any parallel
16 SimHashDigest.append(‘1’) considerations. However, for higher-order bit sequences, the
17 else computational complexity of collision detection is very high,
18 SimHashDigest.append(‘0’)
19 return SimHashDigest
especially in terms of the time taken. Thus, the need to
parallelize the entire process of collision detection becomes
more relevant when dealing with higher-order bit sequences.
The hash digests obtained through SimHash for similar In this study, we employ a distributed data-parallel approach
input phrases often have low Hamming distance and higher using OpenMP [24], [25], MPI [26], and multiprocessing
Jaccard similarity. This property of SimHash is extremely (in Python), to reduce the time complexity of the SimHash
practical in near-duplicate detection [17], [18]. By using collision detection process efficiently.

2019 Fifth International Conference on Data Science and Engineering (ICDSE)


12
Bit sequences
Shingles
Worker1 T1 T2 · · · Tm ...
Collisions: {} MD5 hashes
SimHashes
Update {}
Bit sequences
Shingles
fork
Master Worker2 T1 T2 · · · Tm ...

blockSize = 2numBits Collisions: {} MD5 hashes Output queue


numProcesses
SimHashes {} {} · · · {}
Update {} {}
join
Collisions
..
.

Bit sequences
Shingles
Workern T1 T2 · · · Tm ...
Collisions: {} MD5 hashes
SimHashes
Update {}

Fig. 1. Distributed data-parallel approach with optimal load balancing and synchronization to detect the collisions in SimHash (with MD5).

TABLE I
C OLLISION RATE IN S IM H ASH USING MD5 AS THE INTERNAL HASHING ALGORITHM .

#Bits #Collisions Collision rate (%) #Processes Time (s)


2 0 0 1 0.00007000
4 0 0 1 0.00032000
8 0 0 1 0.00613700
16 0 0 1 29.3268820
20 120 0.01144409180 4 464.700325
24 8,128 0.04844665527 4 6235.20520
28 41,523 0.01546852291 16 104092.872
32 1,438,275 0.03348744940 64 225431.219

The entire distributed data-parallel workflow employed in collision dictionary is ensured through locks. Once a worker
the detection of SimHash (with internal MD5 hashing scheme) process completes its workload, it enqueues its corresponding
collisions is depicted in Fig. 1. In our distributed data-parallel collision dictionary into a process output queue maintained
approach, the master process computes the block size as by the master process. All the collision dictionaries from the
2numBits process output queue are then merged by adding the values
numProcesses . The master process (denoted as Master, in Fig. 1)
then divides the task into several worker processes (denoted (counts) for same keys (SimHash digests) across various
by Workeri , i ∈ [1, n], in Fig. 1), which then compute dictionaries.
the workload corresponding to the predetermined block size. Furthermore, we recorded the execution times, to measure
Different worker processes are then run on multiple ma- the overall time taken in the identification of the collision rate
chines with identical computing power. Each worker process (and thus, the collision resistance) for a given bit sequence
maintains a collision dictionary into which it updates the length. Execution time for every bit sequence length (2 to 32)
SimHashes and their counts. The workload per worker process is collected eight times to overrule the bias caused due to any
involves the generation bit sequences, n−grams (shingles), and other system processes that are not under the control of the
MD5 hashes, along with SimHash computations for all the experimenter. Moreover, with every run of the experiment, the
shingles. Each process spawns several threads (denoted by T i , order of experimentation for a specific bit length was shuffled
i ∈ [1, m], in Fig. 1) to distribute the computation of MD5 to ensure an unbiased measurement of the time taken. The
hashes and SimHashes, thus ensuring further parallelization. individual measurements were then averaged to obtain the
Synchronization among the threads during the updation of the overall time taken to identify collisions accurately.

2019 Fifth International Conference on Data Science and Engineering (ICDSE) 13


·10−2
5 R EFERENCES
[1] H. Krawczyk, M. Bellare, and R. Canetti, “Hmac: Keyed-hashing for
4
message authentication,” Tech. Rep., 1997.
Collision rate (%)
[2] S. Goldwasser and M. Bellare, “Lecture notes on cryptography,” Summer
3
course “Cryptography and computer security” at MIT, vol. 1999, p.
1999, 1996.
2
[3] J. Floyd, “What do Hash Collisions Really Mean?”
Jul 2008, [Online; accessed 21. Dec. 2018] URL:
1 https://permabit.wordpress.com/2008/07/18/what-do-hash-collisions-
really-mean.
0 [4] R. Pass, “Lecture 21: Collision-Resistant Hash Functions
2 4 8 16 20 24 28 32 and General Digital Signature Scheme,” Course on
Number of bits Cryptography at Cornell University, Nov 2009, uRL:
https://www.cs.cornell.edu/courses/cs6830/2009fa/scribes/lecture21.pdf.
[5] R. Rivest, “The md5 message-digest algorithm,” Tech. Rep., 1992.
Fig. 2. A graph depicting the variation in the collision rate with the increasing
[6] J. H. Burrows, “Secure hash standard,” DEPARTMENT OF COM-
number of bits.
MERCE WASHINGTON DC, Tech. Rep., 1995.
[7] X. Wang, D. Feng, X. Lai, and H. Yu, “Collisions for hash functions
md4, md5, haval-128 and ripemd.” IACR Cryptology ePrint Archive, vol.
V. E XPERIMENTAL R ESULTS 2004, p. 199, 2004.
[8] X. Wang and H. Yu, “How to break md5 and other hash functions,”
All the results presented in this study are obtained using in Annual international conference on the theory and applications of
multiple nearly identical machines with an i5 7200U at 4× cryptographic techniques. Springer, 2005, pp. 19–35.
3.1 GHz processor, an 8 GB DDR3 at 1333 MHz memory, [9] X. Wang, Y. L. Yin, and H. Yu, “Finding collisions in the full sha-1,” in
Annual international cryptology conference. Springer, 2005, pp. 17–36.
and 10/100/1000 Gigabit LAN network. [10] R. Rivest, “The md4 message-digest algorithm,” Tech. Rep., 1992.
The collision rates are computed as numcollisions
2numBits
. The varia- [11] B. Den Boer and A. Bosselaers, “An attack on the last two rounds of
tion in the collision rate (%) plotted against the variation in the md4,” in Annual International Cryptology Conference. Springer, 1991,
pp. 194–203.
number of bits is presented in Fig. 2. It can be observed that [12] H. Dobbertin, “Cryptanalysis of md4,” in International Workshop on
the collision rates for lower-order bit sequences (2 to 16 bits) Fast Software Encryption. Springer, 1996, pp. 53–69.
is 0%. However, a maximum collision rate of 0.048% can be [13] J. Katz, A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone,
Handbook of applied cryptography. CRC press, 1996.
observed for 24−bit sequences (marked with a red dotted line [14] H. Dobbertin, “Ripemd with two-round compress function is not
in Fig. 2). Table I tabulates the experimental results obtained collision-free,” Journal of Cryptology, vol. 10, no. 1, pp. 51–69, 1997.
for bit sequences with length varying from 2 to 32 bits. [15] J. Seberry, “Haval a one-way hashing algorithm with variable length of
output 1 yuliang zheng josef pieprzyk,” 1993.
It is evident from Fig. 2 that the collision rate increases for [16] M. S. Charikar, “Similarity estimation techniques from rounding algo-
higher-order collisions (with a maximum value at 24 bits). It rithms,” in Proceedings of the thiry-fourth annual ACM symposium on
can also be observed from Table I that the time taken in the Theory of computing. ACM, 2002, pp. 380–388.
[17] G. S. Manku, A. Jain, and A. Das Sarma, “Detecting near-duplicates for
determination of the collision rate increases exponentially with web crawling,” in Proceedings of the 16th international conference on
the increase in the number of bits. Approximately a duration World Wide Web. ACM, 2007, pp. 141–150.
of a day and four hours for 28−bit sequences, and two days [18] S. Buyrukbilen and S. Bakiras, “Secure similar document detection with
simhash,” in Workshop on Secure Data Management. Springer, 2013,
and 14 hours for 32−bit sequences was required to determine pp. 61–75.
their respective collision rates. Distributed data parallelization [19] C. Sadowski and G. Levin, “Simhash: Hash-based similarity detection,”
with synchronization and optimal load balancing resulted in a 2007.
[20] P.-T. Ho, H.-S. Kim, and S.-R. Kim, “Application of sim-hash algorithm
greater speedup and more efficient storage utilization than the and big data analysis in spam email detection system,” in Proceedings of
sequential counterparts. the 2014 Conference on Research in Adaptive and Convergent Systems.
ACM, 2014, pp. 242–246.
VI. C ONCLUSIONS [21] S. Sood and D. Loguinov, “Probabilistic near-duplicate detection using
simhash,” in Proceedings of the 20th ACM international conference on
Evaluating the collision resistance of a cryptographic hash- Information and knowledge management. ACM, 2011, pp. 1117–1126.
ing algorithm plays a pivotal role in applications requiring [22] Z.-J. Fu, J.-G. Shu, J. Wang, Y.-L. Liu, and S.-Y. Lee, “Privacy-
preserving smart similarity search based on simhash over encrypted data
integrity, such as digital signature schemes, e-cash, and proof- in cloud computing,” ŁŁ, vol. 16, no. 3, pp. 453–460, 2015.
of-work systems. SimHash is a widely used locality-sensitive [23] Q. Jiang and M. Sun, “Semi-supervised simhash for efficient docu-
algorithm used in many large-scale data processing applica- ment similarity search,” in Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language
tions. In this paper, we presented a distributed data-parallel Technologies-Volume 1. Association for Computational Linguistics,
framework with synchronization and optimal load balancing, 2011, pp. 93–101.
to detect the collision rates of the SimHash algorithm with [24] B. Chanduka, T. Gangavarapu, and C. D. Jaidhar, “A single program
multiple data algorithm for feature selection,” in Intelligent Systems
a more significant speedup and efficient storage utilization. Design and Applications. Cham: Springer International Publishing,
We presented our analysis using bit sequences with length 2020, pp. 662–672.
varying from 2 to 32 bits. It was observed that the time taken to [25] T. Gangavarapu, H. Pal, P. Prakash, S. Hegde, and V. Geetha, “Parallel
openmp and cuda implementations of the n-body problem,” in Compu-
detect the collisions increases exponentially with the increase tational Science and Its Applications – ICCSA 2019. Cham: Springer
in the number of bits. As a part of the future work, we aim International Publishing, 2019, pp. 193–208.
at analyzing the bit patterns of the SimHash digests in great [26] M. Snir, S. Otto, S. Huss-Lederman, J. Dongarra, and D. Walker, MPI–
the Complete Reference: The MPI core. MIT press, 1998, vol. 1.
detail, to try and exploit any internal structural weaknesses.

2019 Fifth International Conference on Data Science and Engineering (ICDSE) 14

You might also like