Identifying Unnecessary Files
Identifying Unnecessary Files
Neil C. Rowe(&)
1 Introduction
As digital forensics has grown, larger and larger corpora of drive data are available.
To speed subsequent processing, it is essential in the triage process for a drive to first
eliminate from consideration those files that are clearly unrelated to an investigation [8].
This can be done either by directly eliminating files to create a smaller corpus or by
removing their indexing. We define as “uninteresting” those files whose contents do not
provide forensically useful information about users of a drive. These are operating-system
and applications-software files that do not contain user-created information, and also
include common Internet-document downloads that do not provide user-discriminating
information. (Metadata on uninteresting files may still be interesting, however, as in indi-
cating time usage patterns.) This definition applies to most criminal investigations and data
mining tasks but not to malware investigations. We can confirm that files are uninteresting
© Institute for Computer Sciences, Social Informatics and Telecommunications Engineering 2014
P. Gladyshev et al. (Eds.): ICDF2C 2013, LNICST 132, pp. 86–101, 2014.
DOI: 10.1007/978-3-319-14289-0_7
Identifying Forensically Uninteresting Files Using a Large Corpus 87
by opening them and inspecting them for user-created and user-discriminating data.
Additional files may also be uninteresting depending on the type of investigation, such as
presentation files in an investigation of accounting fraud. Uninteresting files usually com-
prise most of a drive, so eliminating them significantly reduces the size of the investigation.
Unfortunately, uninteresting files occur in many places on a drive, and some software
directories do contain interesting user files, so finding the uninteresting is not always
straightforward.
Most decisions about interestingness can be made from file-directory metadata
without examining the file contents. That is important because directory metadata
requires roughly 0.1 % of the storage of file contents. Directory metadata can provide
the name of a file, its path, its times, and its size, and this can give us a good idea of
the nature of a file [1]. One additional type of data is also very useful, a hash value
computed on the contents of the file, which enables recognition of identical content
under different file names. Forensic tools like SleuthKit routinely extract directory
metadata from drive images.
We can eliminate files whose hash values match those in published sets [5]. This
has the side benefit of detecting modified files since their hash values are different [9].
However, published hash values miss many kinds of software files [11], especially files
created dynamically. This paper will discuss methods for improving on this perfor-
mance, in particular by correlating files on drives and across drives on a corpus. This
provides both a new set of hash values and new methods for finding them.
2 Previous Work
A standard approach is to eliminate files whose hash values match those in the National
Software Reference Library (NSRL) from the U.S. National Institute of Standards and
Technology (NIST). The quality of the data provided in the NSRL is high [6]. How-
ever, our tests [11] found that it did not provide much coverage. Around one file of four
in our international corpus appeared in the NSRL, and there were surprising gaps in
coverage of well-known software. In part this is due to NIST’s usual approach of
purchasing software, installing it, and finding hash values for the files left on a drive.
This will not find files created only during software execution, most Internet down-
loads, and user-specific uninteresting files like software configuration files. Further-
more, the fraction of files recognized by NSRL on a typical drive is decreasing as
storage capacity increases. To fill the gap, commercial vendors like bit9.com and
hashsets.com sell additional hash values beyond NSRL.
Chawathe [2] investigates the problem of recognizing uninteresting files (which they
call “whitelisting”) and suggests that pieces of files need to be hashed separately, a
technique that considerably increases the workload. Tomazic et al. [14] details efficient
methods for indexing and matching hash values found on files. Many of the issues are
similar to the important problem of file deduplication for which file hashes are useful [7].
Ruback et al. [13] is the closest work to ours. They investigated methods for
improving a hash set of uninteresting files by using locality and time of origin to rule
out portions of the hash values in the NSRL, and their experiments showed they could
reduce the size of the hash set by 51.8 % without significantly impacting performance.
88 N.C. Rowe
They also identified as uninteresting those files occurring on multiple drives, similarly
to [11]. Their experiments were based on less than one million files, a weakness since
files in cyberspace are highly varied. A more serious weakness is that they used human
expertise to provide guidance in indicating uninteresting files, and then trained a model.
This seems risky because it may miss forensic evidence that is atypical or unantici-
pated. Legal requirements also often dictate that forensic evidence be complete, in
which case elimination of forensic evidence must be done by better-justified methods
than corpus-specific heuristic ones.
3 Experimental Setup
The experiments reported here were done with the Real Drive Corpus [3], which at the
time had 3471 drives purchased as used equipment in 28 non-U.S. countries, supple-
mented with additional drives from our previous research. There were 83,787,499 files
on these drives with 21,021,187 distinct hash values. We extracted directory metadata
with SleuthKit and the Fiwalk tool. As these drives were obtained from ordinary users,
we saw very little concealment or camouflage on them. Thus the hash values we
derived from them should accurately represent file contents, an issue important in some
forensic applications [4].
We also obtained the December 2012 version of the National Software Reference
Library Reference Data Set (NSRL-RDS, www.nsrl.nist.gov), the June 2012 version of
the database of hashsets.com, and an April 2013 download of the database of the Bit9
Cyber Forensics Service (www.bit9.com). Because hash values in Bit9 are encoded, we
were only able to test hashes in our corpus that were also in Bit9. Basic data is given in
Table 1.
This work used SHA-1 hash values. They are widely available and serve as the
primary key for the NSRL. MD5 hash values are also widely available, but 128 bits as
opposed to 160 does not provide a sufficiently low probability, in our view, of hash
collisions.
Identifying Forensically Uninteresting Files Using a Large Corpus 89
As explained earlier, “uninteresting” files will be defined as those that do not contain
user-created or user-discriminating data. Nine methods to identify them and then their
hash values were investigated as summarized in Table 2. Thresholds used by these
methods were set by the experiments reported in Sect. 6.
interesting since they are likely due to mass distribution. Such paths include dif-
ferent versions of the same file such as configuration files for different users or
successive versions of an updated executable. An example from our corpus was
restore/WINDOWS/inf/fltmgr.PNF which occurred on six drives, and none of its
hash values were in NSRL. We set a threshold of at least 20 occurrences, including
deleted files, for hash values based on our experiments.
• BD, frequent bottom-level directory-filename pairs: Files whose pair of the file
name and the immediate directory above it occurred especially frequently. This will
catch versions of the same file in different directories under different versions of an
operating system or software. Examples from our corpus were WINDOWS/
$NtUninstallWIC$ with 60 occurrences in our corpus and Config.Msi/4a621.rbf
with 20 occurrences; neither of them had any hash values in NSRL. This will also
catch many hidden files in common directories like the defaults “.” and “..”. We set
a threshold of at least five undeleted occurrences based on our experiments.
• TM, clustered creation times: Files with the same creation time within a short period
as that of many other files on the same drive. Such time clusters suggest automated
copying from an external source, particularly if the rate of creation exceeded human
limits. An example from our corpus were seven files created on one drive within
one second under directory Program Files/Adobe/Adobe Flash CS3/adobe_epic/
personalization: pl_PL, pl_PL/., pl_PL/.., pt_BR, pt_BR/., pt_BR/.., and pt_PT. All
were 56 bytes, and two hash values were not in NSRL. Creation times are more
useful than access and modification times because they indicate installation. We set
a threshold of at least 50 files created within the same minute based on our
experiments.
• WK, busy weeks: Files created unusually frequently in particular weeks across
drives, which suggest software updates. A period of a week is appropriate since it
takes several days for most users to connect to a site and get an update. Figure 1
shows a typical distribution of times by week in our corpus, showing some sharp
peaks. We count file creations per week and find “busy” weeks having at least five
times the average amount of creations, of which there were 116 in our corpus. Then
we find “busy” directories (full path minus the file name) in the busy weeks, those
whose frequency of creation was at least 100 times greater than their average
creation time per week. Again, thresholds were set by experiments; the 100 was
necessary to discriminate against user copying of file directories. We then collect the
hash values for those busy directories on those busy days as proposed uninteresting
file content.
• SZ, frequent sizes: Files with the same size and extension. This enables recognizing
fixed-size formats with different contents, as certain kinds of log records. However,
to reduce false matches we apply the additional criterion that the extension must
occur unusually often in all files of that size. Examples from our corpus were all 31
files of size 512 in directory System/Mail/00100000_S, and 8 files of size 2585 in
directory Program Files/Sun/JavaDB/docs/html/tools/ctoolsijcomref with extension
html, none of which had hash values in published hash sets. Based on experiments,
we set a threshold of occurrences of at least five drives where the occurrence rate of
the file extension in everything with that size was at least ten standard deviations
above the average rate in the corpus.
Identifying Forensically Uninteresting Files Using a Large Corpus 91
• Small files (included in SZ): Files less than a minimum size are unlikely to contain
forensically useful information. For example, there were 4,223,667 files of zero
length in the corpus. We set a minimum of 5 bytes from experiments.
• CD, contextually uninteresting files: Directories in which more than a certain
fraction of files were already identified as uninteresting by other methods suggest
that the rest are also uninteresting by “contagion”. An example from our corpus is
directory Program Files/Yahoo!/Messenger/skins/Purple/images and previously
unclassified files “.”, “..”, bg_hover.png, _GLH0307.TMP, and bg_selected.png.
We set a threshold of 50 % known hash values in directories based on experiments.
This method can be used to bootstrap better performance on each run on a corpus.
• TD, files in known uninteresting top-level or mid-level directories: We manually
compiled lists from study of the files remaining after filtering on the other criteria
mentioned here, and obtained 5749 top-level and 337 mid-level directories.
Example top-level directories were APPS/Symantec AntiVirus, Documents and
Settings/Admin/Templates, FOUND.029, Program Files (x86)/AutoCAD 2008,
WINDOWS, and system/install; example mid-level directories were /Help/ and /Default
User/. It is important not to exclude all applications directories (Program Files, Appli-
cations, etc.) because some software keeps user files there.
• DR, directory records: Default directory records such as “WINNT/Cursors/.” as a
path. These were excluded only in the final phase of processing (Sect. 7) since they
can help establish directory and time trends.
• EX, files with known uninteresting extensions: Some extensions are exclusively
associated with operating systems and software, such as exe, mui, log, dat, bin, and
config. We used a classification of 5802 file extensions that we are developing [11]
that maps extensions to 45 categories. The categories we labeled as nonuser and
92 N.C. Rowe
We inspected a sample of hash values returned for each method and concluded that
the NS hashes (NSRL RDS) and HA hashes (occurring on at least five drives in our
corpus) were highly reliable since, in samples of size 200, we saw no interesting files
Identifying Forensically Uninteresting Files Using a Large Corpus 93
incorrectly included. This makes sense for the NSRL hashes because the collection
technique of NSRL (buying the software and inspecting its files) is a highly reliable at
identifying uninteresting files. Similarly, the diversity of our corpus means that any file
that occurs on at least five drives is highly likely to be a distributed sharable resource
and thus uninteresting. So we recalculated our counts excluding NS and HA as shown
in Table 4. Excluding NSRL hashes reduced the number of distinct hashes from the
corpus from 21,021,187 to 19,735,599 (a reduction of 6.1 %) and the number of files
from 78,240,703 to 62,774,546 (a reduction of 19.8 %). Excluding hash values
occurring on at least five different drives in the corpus reduced the number of distinct
hashes further to 19,343,552 (a further reduction of 1.6 %) and the number of files to
43,909,093 (a significant further reduction of 30.1 %). The remaining hash sets had
significant amounts of overlap, supporting their validity for file exclusion.
Table 3. Number of files in the RDC corpus in March 2013 (in millions) that have hashes in
each of two hash sets.
NS HA PA BD TM WK SZ CD TD EX
NS 21.0 18.9 18.4 17.8 20.2 15.4 3.8 20.7 20.0 17.2
HA 18.9 33.1 30.0 29.0 31.5 25.1 5.6 31.9 31.4 26.8
PA 18.4 30.0 35.4 33.1 33.5 26.5 6.6 33.8 34.2 26.8
BD 17.8 29.0 33.1 35.8 33.5 25.3 7.0 33.2 34.1 26.4
TM 20.2 31.5 33.5 33.5 47.3 27.5 9.2 38.3 41.0 32.4
WK 15.4 25.1 26.5 25.3 27.5 27.9 5.1 27.0 27.2 21.8
SZ 3.8 5.6 6.6 7.0 9.2 5.1 10.5 7.7 8.9 6.2
CD 20.7 31.9 33.8 33.2 38.3 27.0 7.7 40.9 38.1 30.2
TD 20.0 31.4 34.2 34.1 41.0 27.2 8.9 38.1 45.6 32.2
EX 17.2 26.8 26.8 26.4 32.4 21.8 6.2 30.2 32.2 35.7
Table 4. Number of files in the RDC corpus in March 2013 (in millions) that have hashes in
each of two hash sets, after excluding NS (the National Software Reference Library) and HA
(hashes occurring on at least five drives).
HS B9 PA BD TM WK SZ CD TD EX
HS 0.9 0.0 0.3 0.0 0.7 0.2 0.1 0.9 0.9 0.1
B9 0.0 0.5 0.0 0.0 0.3 0.0 0.0 0.2 0.3 0.0
PA 0.3 0.0 5.0 4.3 4.0 1.5 1.6 4.7 5.0 1.5
BD 0.0 0.0 4.3 6.2 4.8 1.3 2.2 5.8 6.1 1.7
TM 0.7 0.3 4.0 4.8 14.0 2.3 3.5 12.1 12.1 5.1
WK 0.2 0.0 1.5 1.3 2.3 2.5 0.8 2.3 2.4 0.5
SZ 0.1 0.0 1.6 2.2 3.5 0.8 4.5 3.9 4.0 1.2
CD 0.9 0.2 4.7 5.8 12.1 2.3 3.9 15.5 14.7 6.2
TD 0.9 0.3 5.0 6.1 12.1 2.4 4.0 14.7 16.6 6.2
EX 0.1 0.0 1.5 1.7 5.1 0.5 1.2 6.2 6.2 7.1
94 N.C. Rowe
To compare the coverage of the published hash sets and our corpus, we classified
the file extensions of their associated files using our aforementioned taxonomy
(Table 5). We did this for the four primary sources we used; the last column will be
explained in Sect. 7. We used a sample of 30 million records of the Bit9 Cyber
Forensics Service, not just those matching our corpus. These counts are on files and not
hashes, so the same hash was weighted by how often it occurred. “None” means files
with no extension and “miscellaneous” includes ten lesser categories as well as
extensions occurring less than 200 times in our corpus. The statistics confirm that all
these hash sets had broad coverage of a variety of file types, not just executables. That
suggests that their coverage gaps are due to difficulties in covering all software rather
than in covering all file types.
Table 5. (Continued)
Type of NSRL Hashsets. Bit9 Real data Final RDC
extension RDS com sample corpus filtering
Copies 0.09 % 0.04 % 0.28 % 0.40 % 0.33 %
Integers 1.03 % 1.80 % 0.83 % 2.17 % 4.59 %
Configuration 5.32 % 5.10 % 3.66 % 5.14 % 2.35 %
Update 0.06 % 0.01 % 0.07 % 0.16 % 0.00 %
Security 0.22 % 0.20 % 0.13 % 0.33 % 0.26 %
Malicious 0.01 % 0.01 % 0.00 % 0.02 % 0.00 %
Games 2.44 % 1.70 % 1.64 % 3.24 % 0.00 %
Sci. and eng. 0.85 % 0.03 % 0.51 % 0.21 % 0.35 %
Virtual 0.12 % 0.04 % 0.09 % 0.08 % 0.08 %
machine
Multipurpose 2.79 % 3.76 % 3.48 % 2.46 % 5.64 %
Miscellaneous 9.89 % 5.32 % 7.55 % 2.41 % 2.21 %
These methods can err in identifying interesting hash values as uninteresting for several
reasons:
• PA, frequent paths: Some configuration files can give clues about a user although
they appear frequently in the corpus.
• BD, frequent bottom-level directories: A directory-filename pair may occur fre-
quently if its words are common, such as pics/portrait.jpg, yet still be interesting.
• TM, clustered creation times: During an automated software download, the user
may be working on their own files so that creation times are interspersed.
• WK, busy weeks: Users may also be working during busy weeks for software
updates.
• SZ, frequent sizes: A user file may accidentally be the same size as a standard size
used by an application. Also, the standard size may reflect a format imposed on a
user, as with camera formats.
• CD, contextually uninteresting files: Users may put files in a directory that is mostly
created by software.
• TD, files with known uninteresting top-level or mid-level directories: Users may
copy executables to their own directories for backup.
• EX, files with known uninteresting extensions: Users may assign their own
extensions to files.
To investigate how often these conditions occurred, we took random samples of
200 files produced by each method, 1600 in all. We then inspected the full paths, and
did Web research as necessary, to determine which files were user-related or otherwise
potentially interesting in an investigation (Table 6). A few files were unclear in function
so we counted them at half weight.
96 N.C. Rowe
Table 7. Number of files and hashes remaining after filtering out files having a given number of
matches to our eight new hash sets.
Number of 0 1 2 3 4 5 6 7 8
matches to hash
sets required
Logarithm of 18.1 17.5 17.0 16.8 16.5 16.3 16.0 15.5 13.5
number of files
matched
Logarithm of 16.8 16.1 15.9 15.5 15.0 14.6 14.0 12.2 9.5
number of
distinct hash
values matched
generated for our corpus to our set of 45 file-extension groups, so for instance “GIF
Image Data” was mapped to “graphics extension”. We compared the extension groups
for the files in our corpus with the assigned libmagic extension groups. Of the
17,814,041 file records having libmagic values (since it was only added to Fiwalk
recently), only 27.8 % groups matched between extensions and libmagic. A significant
number of files do not have extensions, libmagic strings are insufficiently detailed about
configuration files, and some records indicated outright errors in either extensions or
libmagic. We conclude that libmagic classifications are insufficiently reliable as a way
to detect camouflage. However, this subject needs further investigation.
Files for which SleuthKit did not find hash values, 16.2 % of our remaining files, can
also be eliminated if they match some of our criteria. Missing hash values occur for
deleted files with incomplete metadata. For these we applied the BD (bottom-directory),
TD (top-directory), and EX (extension) criteria to individual file records, and eliminated
files matching at least two criteria. 2.5 million files matched BD, 10.0 million matched
TD, 1109 matched EX, and 5.9 million matched two of the three criteria. For deleted
files having just a file name, we inferred a path when it was unambiguous in 0.28 million
cases, correcting underscores for missing characters if necessary. For instance, file name
REXXUTIL.DL_ was inferred to have path OS2/DLL/REXXUTIL.DLL since only one
path in our corpus had that file name after correcting the underscore.
We also eliminated in the final processing 2.9 million records of default directory
files (the DR criterion), 4.1 million files smaller than 6 bytes, and executables, and files
with extensions strongly suggesting uninterestingness: executables, support for the
operating system, installations, updates, help, hardware-related files, and games.
This reduced the number of potentially interesting files to 16,923,937, 20.2 % of
the original set of files. (If we allow matching to only one of the three criteria and not
just to files without hash values, we reduce the number of files to 11.1 % of the
98 N.C. Rowe
original set, but increase the error rate as estimated in the last section.) 83.9 % of the
remaining files were deleted (they are ignored by most methods) and many could also
be eliminated in investigations. To assess correctness, we took a random sample of
1000 files excluded in all the phases and found that only one file was possibly
incorrectly excluded, a flowchart that had been downloaded with a cluster of other files.
Thus we estimate the false negative rate at 0.1 %. We also took a random sample of
1000 of the files identified as interesting and found that 19.0 % of them were unin-
teresting false positives, not too large a number.
The last column of Table 5 gives the distribution of extension types for the
remaining “interesting” files. Filtering increased the fraction of camera images, tem-
poraries, Word documents, presentations, spreadsheets, and integer extensions as
expected, but did not exclude all non-user files due to missing metadata.
As mentioned earlier, we also tested two commercial hashsets, the full database of
Hashsets.com and a sample of the database of Bit9. Hashsets.com matched hashes on
5,906 of the remaining hashes, an unimpressive 0.06 %. We confirmed matches to the
Bit9 database for 79,067 of the remaining hashes; we had difficulty identifying which
actually matched, but we estimate Bit9 had hash codes for about 154,000 or 1.56 % of
the remainder, 0.73 % of the original number of hash values. (Bit9 stores hashes in a
proprietary encoding, so to guess which hashes it matched we used the file name and
size returned by Bit9 and matched them to our own data, then extrapolated the number.)
So Bit9 is also not much help in reducing files beyond our own methods, something
important to know since it is costly.
A criticism made of some hash-value collections is that their values will rarely occur
again. So an important test for our proposed new “uninteresting” hash values is to
compare those acquired from different drives. For this we split the corpus into two
pieces C1 and C2, roughly drives processed before 2012 and those processed in 2012.
We extracted uninteresting hash values using our methods for both separately, and then
compared them. Table 8 shows the results.
This table provides two kinds of indicators of the generality of a hash-obtaining
method. One is the average of the second and third columns, which indicates the
overlap between the hash sets. On this SZ is the weakest method and WK is second
weakest, which makes sense because these methods seek clustered downloads and
some clusters are forensically interesting. Another indicator is the ratios of column 7 to
column 4 and column 6 to column 5, which indicate the degree to which the hash
values found generalize from a training set to a test set. The average for the above data
was 0.789 for the average of column 7 to column 4, and 0.938 for the average of
column 6 to column 5, which indicates a good degree of generality. No methods were
unusually poor on the second indicator.
Identifying Forensically Uninteresting Files Using a Large Corpus 99
Table 8. Statistical comparison of hashes derived from partition of our corpus into two halves,
where HA = hashes, PA = paths, TM = time, SZ = size, BD = bottom-level directory,
CD = directory context, TD = top-level directory, EX = extension.
Method Fraction Fraction Fraction Fraction Fraction Fraction
of of values of values of all hash of all hash of all hash of all hash
obtaining found found values for values for values for values for
new hash for C1 for C2 C1 C2 C2 C1
values also also identified identified identified identified
found found by by by by
for C2 for C1 method method method method
using C1 using C2 using C1 using C2
HA .717 .597 .438 .355 .344 .389
PA .661 .458 .399 .291 .299 .312
TM .529 .365 .668 .493 .430 .477
WK .457 .234 .445 .303 .358 .323
SZ .339 .246 .196 .187 .157 .145
BD .583 .484 .474 .384 .350 .406
CD .640 .486 .558 .447 .411 .467
TD .553 .436 .627 .485 .419 .474
EX .603 .439 .497 .373 .338 .397
We suggest then the following protocol for eliminating uninteresting files from a
corpus:
1. Run methods HA (hashes), PA (paths), WK (weeks), SZ (sizes), and BD (bottom-
level directories) to generate hash sets of candidate uninteresting files on the full
corpus to see trends.
2. Eliminate all files whose hash values are in NSRL, our list at digitalcorpora.org, and
any other confirmed “uninteresting” lists available.
3. On the remaining files, run methods TM (times), CD (directory context), TD (top-
level directories), and EX (extensions) to generate further hash sets of candidate
uninteresting files.
4. Find hash values that occur in at least two candidate hash sets, and eliminate files
from the corpus with those hash values.
5. Eliminate files without hash values in the remainder from the corpus that match on
two of the three criteria BD, TD, and EX, are directories, are small files, and have
strongly-uninteresting extensions.
6. Save the list of eliminated hash codes for bootstrapping with future drives in step 2.
For the final run on our own corpus with additional new data, eliminating NSRL
hashes reduced the number of files by 25.1 %, eliminating hashes found by the nine
methods of this paper reduced the number by an additional 37.1 %, and eliminating
files by the remaining criteria reduced the number by an additional 21.6 %, resulting in
100 N.C. Rowe
20.2 % of the original number of files. The disadvantage of this approach is that does
require additional computation on a corpus before starting to investigate, not just
matching to hash lists; the advantage is that if finds considerably more files to
eliminate.
10 Conclusions
Acknowledgements. Riqui Schwamm assisted with the experiments, and Simson Garfinkel
provided the corpus. The views expressed are those of the author and do not represent those of
the U.S. Government.
References
1. Agrawal, N., Bolosky, W., Douceur, J., Lorch, J.: A five-year study of file-system metadata.
ACM Trans. Storage 3(3), 9:1–9:32 (2007)
2. Chawathe, S.: Fast fingerprinting for file-system forensics. In: Proceedings of the IEEE
Conference on Technologies for Homeland Security, pp. 585–590 (2012)
3. Garfinkel, S., Farrell, P., Roussev, V., Dinolt, G.: Bringing science to digital forensics with
standardized forensic corpora. Digit. Invest. 6, S2–S11 (2009)
4. Ke, H.-J., Wang, S.-J., Liu, J., Goyal, D.: Hash-algorithms output for digital evidence in
computer forensics. In: Proceedings of the International Conference on Broadband and
Wireless Computing, Communication and Applications (2011)
5. Kornblum, J.: Auditing hash sets: lessons learned from jurassic park. J. Digit. Forensic Pract.
2(3), 108–112 (2008)
6. Mead, S.: Unique file identification in the national software reference library. Digit. Invest. 3
(3), 138–150 (2006)
7. Panse, F., Van Keulen, M., Ritter, N.: Indeterministic handling of uncertain decision in
deduplication. ACM J. Data Inf. Qual. 4(2), 9 (2013)
8. Pearson, S.: Digital Triage Forensics: Processing the Digital Crime Scene. Syngress, New
York (2010)
Identifying Forensically Uninteresting Files Using a Large Corpus 101
9. Pennington, A., Linwood, J., Bucy, J., Strunk, J., Ganger, G.: Storage-based intrusion
detection. ACM Trans. Inf. Syst. Secur. 13(4), 30 (2010)
10. Roussev, V.: Managing terabyte-scale investigations with similarity digests. In: Advances in
Digital Forensics VIII, IFIP Advances in Information and Communication Technology vol.
383, pp. 19–34. Pretoria SA (2012)
11. Rowe, N.: Testing the national software reference library. Digit. Invest. 9S, S131–S138
(2012). (Proc. Digital Forensics Research Workshop 2012, Washington, DC, August)
12. Rowe, N., Garfinkel, S.: Finding suspicious activity on computer systems. In: Proceedings of
the 11th European Conference on Information Warfare and Security. Laval, France (2012)
13. Ruback, M., Hoelz, B., Ralha, C.: A new approach to creating forensic hashsets. In:
Advances in Digital Forensics VIII, IFIP Advances in Information and Communication
Technology vol. 383, pp. 83–97. Pretoria SA (2012)
14. Tomazic, S., Pavlovic, V., Milovanovic, J., Sodnik, J., Kos, A., Stancin, S., Milutinovic, V.:
Fast file existence checking in archiving systems. ACM Trans. Storage 7(1), 2 (2011)