Vol. 26 no.
17 2010, pages 2204–2207
BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btq351
Data and text mining Advance Access publication July 17, 2010
BigWig and BigBed: enabling browsing of large distributed
datasets
W. J. Kent, A. S. Zweig∗ , G. Barber, A. S. Hinrichs and D. Karolchik
Center for Biomolecular Science and Engineering, School of Engineering, University of California, Santa Cruz (UCSC),
Santa Cruz, CA 95064, USA
Associate Editor: Jonathan Wren
Downloaded from http://bioinformatics.oxfordjournals.org/ at North Dakota State University on October 28, 2014
ABSTRACT Though visualization of results is just one of the many informatics
Summary: BigWig and BigBed files are compressed binary indexed challenges of next-generation sequencing, it is one that we are
files containing data at several resolutions that allow the high- well positioned to address at UCSC. We have developed two new
performance display of next-generation sequencing experiment data formats, BigWig and BigBed, that make it practical to view
results in the UCSC Genome Browser. The visualization is the results of next-generation sequencing experiments as tracks in
implemented using a multi-layered software approach that takes the UCSC Genome Browser. The BigWig and BigBed files are
advantage of specific capabilities of web-based protocols and Linux compressed binary indexed files that contain the data at several
and UNIX operating systems files, R trees and various indexing and resolutions. Rather than transmitting the entire file, only the data
compression tricks. As a result, only the data needed to support the needed to support the current view in the Genome Browser are
current browser view is transmitted rather than the entire file, enabling transmitted. Collectively, BigWig and BigBed are referred to as Big
fast remote access to large distributed data sets. Binary Indexed (BBI) files.
Availability and implementation: Binaries for the BigWig and
BigBed creation and parsing utilities may be downloaded at http://
2 SYSTEM AND METHODS
hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/. Source code
for the creation and visualization software is freely available for non- BigBed files are generated from Browser Extensible Data (BED) files. Like
the BED format, the BigBed format is used for data tables with a varying
commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip,
number of fields. BED files consist of a simple text format: each line contains
implemented in C and supported on Linux. The UCSC Genome
the fields for one record, separated by white space. The first three fields
Browser is available at http://genome.ucsc.edu are required, and must contain the chromosome name, start position and
Contact: [email protected] end position. The standard BED format defines nine additional, optional
Supplementary information: Supplementary byte-level details of fields, which (if present) must appear in the predefined order (Supplementary
the BigWig and BigBed file formats are available at Bioinformatics Table 1). Alternatively, BED files may depart from the standard format after
online. For an in-depth description of UCSC data file formats and the third field, continuing with fields specific to the application and data set.
custom tracks, see http://genome.ucsc.edu/FAQ/FAQformat.html BigBed files that contain custom fields, unlike those of simple BED format,
and http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html must also contain the field name and a sentence describing the custom field.
To help others understand custom BED fields, an autoSql (.as) (Kent and
Received on February 18, 2010; revised on June 10, 2010; accepted Brumbaugh, 2002) declaration of the table format can be included in the
on June 28, 2010 BigBed file (Supplementary Table 2).
BigWig files are derived from text-formatted wiggle plot (wig) or
bedGraph files. They associate a floating point number with each base in the
1 INTRODUCTION genome, and can accommodate missing data points. In the UCSC Genome
Browser, these files are used to create graphs in which the horizontal axis
Recent improvements in sequencing technologies have made it is the position along a chromosome and the vertical axis is the floating
possible for labs to generate terabyte-sized genomic data sets. point data (Fig. 1). Typically, these graphs are represented by a wiggly
Visualization of these data sets is a key to scientific interpretation. line, hence the name ‘wiggle’. Three text formats can be used to describe
Typically, loading the data into a visualization tool such as the wiggle data at varying levels of conciseness and flexibility. Values may be
Genome Browser provided by the University of California, Santa specified for every base or for regularly spaced fixed-sized windows using the
Cruz (UCSC) (Kent et al., 2002; Rhead et al., 2010) has been ‘fixedStep’ format. The ‘variableStep’ format encodes fixed-sized windows
difficult. The data can be loaded as a ‘custom annotation track’, but that are variably spaced. The ‘bedGraph’ format encodes windows that are
for very large data sets the upload form times out before the data both variably sized and variably spaced.
transfer finishes. To work around this limitation, some labs with Data files of fixedStep format are divided into sections, each of which
starts with a line of the form:
access to Solexa and later-generation sequencing machines have
installed a local copy of the Genome Browser, but this requires fixedStep chrom=chrN start=position step=N span=N
a significant initial time investment by systems administrators and
where ‘chrom’ is the chromosome name, ‘start’ is the start position on the
other informatics professionals, as well as continuing efforts to keep chromosome, ‘step’ is the number of bases between items and ‘span’ shows
the data in the local browser installation current. the number of bases covered by each item. Step and span default to 1 if they
are not defined. This section line is followed by a line containing a single
∗ To whom correspondence should be addressed. floating point number for each item in the section.
© The Author(s) 2010. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
[15:37 30/7/2010 Bioinformatics-btq351.tex] Page: 2204 2204–2207
BigWig and BigBed
Downloaded from http://bioinformatics.oxfordjournals.org/ at North Dakota State University on October 28, 2014
Fig. 1. Genome Browser image of BigWig annotation tracks. The top track is displayed as a bar graph, the bottom track as a point graph. Shading is used to
distinguish the mean (dark), one standard deviation above the mean (medium) and the maximum (light). Peaks with clipped tops are colored magenta.
The variableStep format is similar, but the section starts with a line of the Because the BigWig and BigBed files are binary, we have created
format: additional tools that parse the files and describe the contents. The
bigWigSummary and bigBedSummary programs can quickly compute
variableStep chrom=chrN span=N
summaries of large sections of the files corresponding to zoomed-out views in
and each item line contains two fields: the chromosome start position, and the Genome Browser. The bigWigInfo and bigBedInfo can be used to quickly
the floating point value associated with each base. check the version numbers, compression status and data ranges stored in a
The bedGraph format is a BED variant in which the fourth column file. The bigBedToBed, bigWigToWig and bigWigToBedGraph programs
is a floating point value that is associated with all the bases between can convert all or just a portion of files back to text format.
the chromStart and chromEnd positions. Unlike the zero-based BED and
bedGraph, for compatibility reasons the chromosome start positions in
variableStep and fixedStep are one-based.
To create a BigBed or BigWig file, one first creates a text file in 3 IMPLEMENTATION
BED, fixedStep, variableStep or bedGraph format and then uses the
bedToBigBed, wigToBigWig or bedGraphToBigWig command-line utility The BigBed and BigWig readers and writers are written in portable
to convert the file to indexed binary format. In addition to the text file C; other programs that can interface with C libraries can make use
and (in the case of BigBed) the optional .as file, the conversion utilities of the code directly. For those working in languages that do not
require a chrom.sizes input file that describes the chromosome (or contig) interface well with C, the Supplemental Information describes the
sizes in a two-column format (chromosome name, chromosome size). The file format in sufficient detail to reimplement it in another language.
fetchChromSizes program may be used to obtain the chrom.sizes file for Several layers of software are involved in enabling the remote access
any genome hosted at UCSC. All of the command-line utilities can be run of the BigBed and BigWig files. This section describes the software
without options to display a usage summary.
architecture, algorithms and data structures at a high level, and
The wigToBigWig program accepts fixedStep, variableStep or bedGraph
should be useful to anyone trying to understand the code enough
input. The bedGraphToBigWig program accepts only bedGraph files, but has
the advantage of using much less memory. The wigToBigWig program can to usefully modify it or to implement similar file formats that work
take up to 1.5 times as much memory as the wig file it is encoding, while well in a distributed data environment.
bedGraphToBigWig and bedToBigBed use only about one-quarter as much
memory as the size of the input file.
Once a BigBed or BigWig file is created, it can be viewed in the UCSC
Genome Browser by using the custom track mechanism (Supplementary
3.1 Data transfer layer
Material). In brief the indexed file is put on a website accessible via HTTP, Though BigBed and BigWig can be used locally, the primary
HTTPS or FTP, and a line describing the file type and data location in the design goal for this format was to enable efficient remote access.
form: This is done using existing web-based protocols that are generally
track type=bigBed bigDataUrl=http://srvr/myData.bb already available at most sites. Unlike typical web use, bigBed and
bigWig files require random access. At the lowest layer, we take
is entered in the custom track section of the browser. Additional settings in advantage of the byte-range protocols of HTTP and HTTPS, and
var=value format can be used to control the name, color, and other attributes
the protocols associated with resuming interrupted FTP transfers, to
of the track. When the custom track is loaded and displayed, the Genome
Browser fetches only the data it needs to display at the resolution appropriate
achieve random access to binary files over the web. Web servers
for the size of the region being viewed. While it may take a few minutes to supporting HTTP/1.1 accept byte-ranges when the data is non-
convert the input text file to the indexed format, once this is done there is no volatile. OpenSSL provides SSL support for HTTPS via the BIO
need to upload the entire file, and the response time on the browser is nearly protocol. FTP uses the resume command and simply closes the
as fast as if the file resided on the local UCSC server. connection when sufficient data has been read.
2205
[15:37 30/7/2010 Bioinformatics-btq351.tex] Page: 2205 2204–2207
W.J.Kent et al.
3.2 URL data cache layer hundreds or thousands of scaffolds, and also lets the files be applied
Since remote access is still slow compared to local access, and to RNA as well as DNA databases. To improve the efficiency of the
data files typically are viewed many times without changing, we single R tree, we store the chromosome ID as an integer rather than
implemented a cache layer on top of the data transfer layer. Data are a name, and include a B+ tree to associate chromosome names and
fetched in blocks of 8 Kb, and each block is kept in a cache. The IDs in the file. In the source code, the combined B+ tree and R tree
cache is implemented using two files for each file that is cached: a index is referred to as a cirTree.
bitmap file that has a bit set for each file block in cache and a data file One additional indexing trick is used. Because the stored data are
that contains the actual blocks of data. The data file is implemented sorted by chromosome and start position, not every item in the file
very simply using the sparse file feature of Linux and most other must be indexed; in fact by default only every 512th item is indexed.
UNIX-like operating systems. The cache software simply seeks to The software finds the closest indexed item preceding the query, and
the position in the file where the block belongs and writes it. The then scans through the data, discarding some of the initial items if
operating system allocates disk space only for the parts of the file necessary. This may seem wasteful, since hundreds of thousands of
that are actually written. bytes may be transferred in the same time that it takes to seek to
Downloaded from http://bioinformatics.oxfordjournals.org/ at North Dakota State University on October 28, 2014
The cache layer is critical to performance. Parts of the file, a new position on disk, but in practice little time is lost and as a
including the file header and the root block of the index, are accessed benefit the index is less than 1% of the size of the data.
no matter what part of the genome is being viewed. These parts need
be transmitted only once. In addition if multiple users view the same 3.4 Compression
region of the genome, later users will benefit from the cache, as will
The data regions of the file (but not the index) are compressed using
a single user looking at the same region multiple times.
the same deflate techniques that are used in gzip as implemented in
Though a cache can help convert remote access to local access,
the zlib library, a very widespread, stable and fast library built into
a minimum of one remote access—to check whether the file has
most Linux and UNIX installations. The compression would not be
changed at the remote site—is required even on a completely
very efficient if each item was compressed separately, and it would
cached file. Minimizing the number of cache checks is one of the
not support random access if the entire data area were compressed
motivations for keeping the index and the zoomed data in the same
all at once. Instead the regions between indexed items (containing
file as the primary data. Even though a change check involves
512 items by default) are individually compressed. This maintains
little in the way of data transfer, it does require a round trip on
the same degree of random accessibility that was enabled by the
the network, which can take from 10 to 1000 ms depending on the
sparse R tree index while still achieving nearly the same level of
network connectivity. For similar reasons, though data are always
compression as compressing the entire file would.
fetched at least one full block at a time, the system will combine
The final layer of software is responsible for fetching and
multiple blocks into a single fetch operation whenever possible.
decoding blocks specified by the index. It is only this final layer
that differs between BigWig and BigBed.
3.3 Indexing
The next layer handles the indexing. It is based on a single
dimensional version of the R tree that is commonly used for indexing 4 RESULTS AND DISCUSSION
geographical data. The index size is typically less than 1% of the The BigBed and BigWig files succeed in overcoming browser upload
size of the data itself. timeout limits. By deferring the bulk of the data transfer to be on
A BigBed file can contain overlapping intervals. Overlapping demand, the upload phase of BigWig and BigBed files now takes
intervals are not as easy to index as strings, points or non- less than a second even on home and remote networks, well within
overlapping intervals, but several effective techniques do exist, the 300-s upload time limit at UCSC. The on-demand connectivity
including binning schemes (Kent et al., 2002), nested containment requirements are modest, adding 0.5–1.0 s of data transfer time
lists (Alekseyenko and Lee, 2007) and R trees (Guttman, 1984). overhead depending on where the Big file is hosted (Supplementary
R trees have several properties that make them attractive for this Table 3).
application. They perform well for data at a variety of scales in BigBed and BigWig files are similar in many ways to BAM
contrast to binning schemes that typically have a ‘sweet spot’ at files (Li et al., 2009), which are commonly used to store
a particular scale of data close to the smallest bin size. R trees mappings of short reads to the genome. BAM files are also
also minimize the number of seeks (and hence network roundtrips) binary, compressed, indexed versions of an existing text format,
compared to nested containment lists, another popular genomics SAM. The samtools C library associated with SAM and BAM
indexing scheme. (http://samtools.sourceforge.net/) caches the BAM index, though
The basic idea behind an R tree is fairly simple. Each node of not the data files. Samtools also can fetch data from the internet
the tree can point to multiple child nodes. The area spanned by a via FTP and HTTP, but not HTTPS. BAM files are not designed
child node is stored alongside the child pointer. The reader starts for wiggle graphs, and are more complex than BED files, but they
with the root node, and descends into all nodes that overlap the do store alignment, sequence and sequence quality information very
query window. Since most child nodes do not overlap, only a few efficiently. While this capability theoretically could be added as an
branches of the tree need to be explored for a typical query. extension to BigBed, we have adopted BAM for short read mapping
Though a separate R tree for each chromosome would have been to avoid a proliferation of formats. BAM files are supported as
simpler to implement, we elected to use a single tree in which the custom tracks at UCSC, and we have added HTTPS support to BAM
comparison operator includes both the chromosome and the position. using the data transfer and data cache layers developed for BigBed
This allows better performance on roughly assembled genomes with and BigWig.
2206
[15:37 30/7/2010 Bioinformatics-btq351.tex] Page: 2206 2204–2207
BigWig and BigBed
BigBed and BigWig files have been in use at genome.ucsc.edu Funding: This work was supported by the National Human Genome
since June 2009, and have proven to be popular. As of February Research Institute (5P41HG002371-09, 5U41HG004568-02). The
2010, we have displayed data from nearly 1300 files using open access charge was funded by the Howard Hughes Medical
these formats. The broader bioinformatics community has started Institute.
to support these files as well, with Perl bindings available at
Conflict of Interest: none declared.
http://search.cpan.org/∼lds/Bio-BigFile/ and a Java implementation
in progress (Martin Deacutis, personal communication) for use in the
Integrative Genome Viewer (http://www.broadinstitute.org/igv/). REFERENCES
Though the use of BigBed and BigWig requires access to the
Alekseyenko,A.V. and Lee,C.J. (2007) Nested containment list (NCList): a new
command line creation tools needed to create the files and a website algorithm for accelerating interval query of genome alignment and interval
or FTP site on which to place them, this is not an undue burden databases. Bioinformatics, 23, 1386–1393.
in the context of the informatics demands of a modern sequencing Guttman,A. (1984) R-Trees: a dynamic index structure for spatial searching. In
pipeline, and is clearly preferable to the long and uncertain uploads Proceedings of 1984 ACM SIGMOD International Conference on Management of
Data, pp. 47–57.
Downloaded from http://bioinformatics.oxfordjournals.org/ at North Dakota State University on October 28, 2014
of large custom tracks in text formats. Kent,W.J. and Brumbaugh,H. (2002) autoSql and autoXml: code generators from the
Genome Project. Linux J., 99, 68–77.
Kent,W.J. et al. (2002) The human genome browser at UCSC. Genome Res., 12,
ACKNOWLEDGEMENTS 996–1006.
Li,H. et al. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence
We would like to acknowledge James Taylor, Heng Li and Martin Alignment/Map (SAM) Format and SAMtools. Bioinformatics, 25, 2078–2079.
Deacutis for their testing and feedback on these formats, and Lincoln Rhead,B. et al. (2010) The UCSC Genome Browser database: update 2010. Nucleic
Stein for developing the Perl bindings. Acids Res., 38, D613–D619.
2207
[15:37 30/7/2010 Bioinformatics-btq351.tex] Page: 2207 2204–2207