nrg2814 1
nrg2814 1
Identify smaller-scale
5
repeated blocks using Further analysis:
statistical models Building regulatory networks
Figure 1 | Annotation process for non-coding regions: an overview. The annotation process Natureincludes two parallel
Reviews | Genetics
pipelines for comparative sequence analysis (comparative analysis) and functional genomics analysis (functional
Structural variants analysis) of experimental data. Comparative analysis includes analysis of repeated sequences in the reference human
Chromosomal rearrangements
genome, structural variation across the human population and sequence elements conserved across multiple species.
(deletions, duplications,
novel sequence insertions or
The annotation process for functional genomics data involves smoothing the raw signal (step 1), thresholding and
inversions) that are inherited segmentation of the smoothed signal (step 2), clustering of discrete segments (step 3), functional annotation of
and polymorphic across the clusters (step 4) and connecting clusters into networks (step 5).
human population. Structural
variants are by definition
longer than SNPs and can be
hundreds of thousands of base Here, we describe how a similar process can be of another human or to that of another species. Here,
pairs long. applied to genome annotation: we can do large-scale we use ‘comparative analysis’ to encompass all of these
similarity comparisons on the genome sequence, taking activities, as a more specific term is lacking in the field.
Copy-number variants
note of repeated regions at different scales, and then look Repeated sequences that can be identified in the ref-
Structural variants that arise
from deletion or duplication for function in the genome by mapping the ‘read-out’ erence human genome include segmental duplications
and thus lead to a change in from experiments onto sequence elements. The pro- (also known as low-copy-number repeats (lCRs)), sim-
copy number of the underlying cess for annotating the human genome can be separated ple and tandem repeats, transposons and pseudogenes.
region of the genome. into the two broad categories of comparative sequence We emphasize that here the term ‘repeated sequence’
Segmental duplication
analysis (comparative analysis) and functional genom- refers to a wider set of elements than the term ‘repeat ele-
The operational definition of a ics analysis (functional analysis), which correspond to ment’, which typically refers to a short, highly repetitive
segmental duplication rests on analysing DNA sequences and analysing the output from sequence. Structural variants are revealed by comparing
finding two regions in the same functional genomics experiments, respectively (fIG. 1). In genome sequences across the human population, and
genome ranging in length from
this Review we will focus mainly on functional analysis conserved NCEs, large syntenic blocks and orthologous
a thousand to several million
nucleotides with at least 90% and will provide only a brief overview of comparative genes are revealed by comparing the human genome
sequence identity. Segmental analysis simply as a framework for showing how it can to those of other species (BOX 1; TABle 1). There are two
duplications are inherited but be integrated with functional analysis. main methods for discovering repeated sequences:
not necessarily polymorphic first, scanning for sequence similarity, which involves
across the human population.
Comparative analysis grouping together sequences that fall above a minimum
Pseudogenes The field of DNA sequence analysis is in the middle of threshold of sequence conservation over some length
Copies of protein-coding genes a paradigm shift caused by the exponential reduction scale; and second, model-based discovery, in which
with mutations that disrupt in the cost of obtaining genome sequence data. The curated sets of known elements form the basis of statis-
their coding sequence and
traditional scope of comparative genomics is the com- tical models which are then used to scan the genome for
demolish their original
protein-coding function. parison of reference genome sequences from different additional elements that fit the model25. Note that these
species; however, the recent explosion in sequencing has two approaches can be used to compare the sequences of
Syntenic blocks made it possible to sequence populations of a species different organisms (in the traditional sense of compara-
Segments that align between and, in cancer genomics, the genomes of normal and tive genomics), to compare the sequences of organisms
genome sequences from two
species and that are believed
diseased cells within an individual. Similar concepts within a population (in the sense of personal genomics)
to define an orthologous and sequence analysis tools apply whether one is com- or to compare sequences within an individual (when
relationship. paring one human genome sequence to itself, to that comparing cancerous to normal cells).
Non-coding elements are found at all scales within the genome: structural variants Nature Reviews | Genetics
short repeats, regulatory factor binding regions and small RNAs at small These can be generated by insertion, deletion, reciprocal translocation or
scales; broad histone marks, transcripts, transposable elements and inversion17,18. Duplications and deletions cause copy-number variation
pseudogenes at medium scale; and regulatory forests and deserts, across the population.
segmental duplications and structural variants at larger scales (right to left conserved and ultraconserved non-coding elements
in the figure). Insertions and deletions in the human genome range in scale Comparative genomics has found non-coding elements (NCEs) that are
from SNPs to chromosome-scale abnormalities. conserved to varying degrees across mammalian or vertebrate genomes,
simple and tandem repeats which suggests some function conserved by natural selection94,12,13. Lengths
Simple repeats are duplications 1–5 bp in length that are probably of conserved NCEs range from one to thousands of base pairs. Lack of
generated by polymerase slippage errors100. Tandem repeats are 100–200 bp function for some ultraconserved elements casts doubt on the assumption
duplications and are often found at the centromeres and telomeres of that sequence conservation implies function11,12,78.
chromosomes, where they have a structural role101; variation in their
Functional non-coding RnAs
number in gene promoters can affect nucleosome positioning and
Recent years have seen a revolution in our understanding of the role of small
gene expression102.
regulatory RNAs10; new classes continue to be discovered. MicroRNAs
transposable elements (miRNAs) are 22 nucleotide (nt) RNAs that bind predominantly to the 3′ UTRs
Transposable elements are divided into DNA-based transposons and of mRNA, causing gene silencing105–107. Small interfering RNAs (siRNAs) are
RNA-based retrotransposons . Some are still active in genomes today, 21 nt long and also function in the degradation of complementary mRNAs10.
whereas others have become inactive103. Long interspersed elements (LINEs) Piwi-interacting RNAs (piRNAs) are 27–28 nt RNAs that repress transcription
are retrotransposons that themselves encode reverse transcriptase. Active of transposons in the germ line of fruitflies108 and vertebrates109. Large
LINEs increase genome size by copying themselves into new locations. intergenic non-coding RNAs (lincRNAs) are spliced like protein-coding
Short interspersed repeats (SINEs) like the human Alu element are genes that function in several central cellular processes81,82. Small RNAs,
fragments of RNA polymerase III-transcribed genes that rely on LINE such as small nucleolar RNAs (snoRNAs), generated by RNA polymerases I
elements for propagation. Long terminal repeat (LTR) retrotransposons are and III help to synthesize the translational apparatus and make up 90% of the
flanked on both ends by direct LTRs. They become inactive when RNA in the cell.
homologous recombination between the LTRs deletes the intervening gag
Regulatory elements
and pol genes8,9.
The human genome contains 1,700–1,900 transcription factors110. Binding
Pseudogenes sites of some 100 transcription factors have been characterized at genome
Several categories of pseudogenes have been annotated, including scale by chromatin immunoprecipitation followed by microarray (ChIP–
duplicated pseudogenes , processed pseudogenes and unitary chip)20,33,34 or by sequencing (ChIP–seq)36,37. Classes of regulatory elements
pseudogenes73,104. to which transcription factors bind include promoters, enhancers, silencers,
segmental duplications insulators and locus-control regions (LCRs)20,111. Promoters are regulatory
About 45% of human segmental duplications occur in tandem runs spaced sites that alter the expression of the nearest gene, whereas the other
less than 1 Mb apart on the same chromosome71. elements act on more distant genomic locations.
DNA-based transposons Scanning for sequence similarity. looking for regions to group elements together often have no connection to
Transposable DNA elements of sequence similarity consists of grouping sequences evolutionary history and the underlying mechanisms
that rely on a transposase that share some minimum sequence similarity over a of formation. For example, the operational definition of
enzyme to excise themselves specified minimum length. The main problem with this segmental duplications (BOX 1; TABle 1) excludes ancient
from one region of the genome
and insert themselves into a
approach is that it invariably leaves out related regions duplications that were formed by the same mechanisms
different region, without that have degraded over time, so their similarity is long ago but that have since degraded below 90%
increasing in copy number. below the threshold. moreover, the thresholds chosen sequence identity.
Table 1 | Length, number and genome coverage of a representative collection of non-coding features
classification Property Length (nucleotides) number Genome Genome
of items coverage (Mb) coverage (%)
Average Longest
From comparative analysis
Short and tandem Simple repeat 63 2,961 415,917 26.1 0.84
repeats
Satellite 1,444 160,602 8,997 13.0 0.42
Low complexity 46 2,023 370,102 17.0 0.55
DNA transposons 215 3,625 459,524 98.6 3.17
Retrotransposons LINEs 426 8,505 1,490,241 634.6 20.4
Alu SINE element 261 614 1,186,885 309.7 9.97
RNA-based
retrotransposons
Pseudogenes Duplicated 6,607 181,882 2413 15.9 0.51
Transposable elements Processed 723 15,732 8303 6.0 0.19
generated when reverse
transcriptase enzymes copy Segmental duplications 5,740 630 kb 26,469 151.9 4.89
RNA elements into DNA and Structural variants 8,761 3.3 Mb 96,874 848.8 27.3
insert the DNA copies back
into the genome. From functional analysis
Punctate binding sites STAT1 446 9,079 ~2,300 1.0 0.03
Duplicated pseudogenes
Pseudogenes that result from CTCF 1,181 79,200 ~35,000 41.4 1.33
whole-genome or segmental
H3K4me3 1,759 71,025 ~62,000 110.2 3.55
duplications, in which one copy
maintains its ancestral function Broad binding sites H3K36me3 4,518 380,076 ~130,000 589 19.0
and the other copy degrades
MicroRNA 89 150 718 0.063 0.00
into a pseudogene.
TARs 72 1,854 644,200 46.7 1.50
Processed pseudogenes
Pseudogenes that arise when
Regulatory forests 3,890 35,165 68,900 268 8.62
the mRNA of a parent gene is Regulatory deserts 27,107 203,691 72,500 1,970 63.4
retrotranscribed back into DNA
Pseudogene counts are taken from build 53 at Pseudogene.org29. MicroRNA counts are from miRBase121. Counts of structural
and inserted into the genome.
variants are from the Database of Genomic Variants122. Data on transcriptionally active regions (TARs) and regulatory forests
and deserts are extrapolated to whole-genome scale from the 1% of the genome covered by the ENCODE pilot project20.
Unitary pseudogenes The extrapolation is biased by the high fraction of genic regions in the ENCODE pilot regions. All other data were collected
A rare class of pseudogene in from the University of California-Santa Cruz (UCSC) Table Browser45 using the March 2006 build of the human genome (UCSC
which a single-copy parent hg18, NCBI build 36). CTCF, CCCTC-binding factor; H3K4me3, histone 3 lysine 4 trimethylation; H3K36me3, histone 3 lysine 36
gene becomes non-functional. trimethylation; LINE, long interspersed element; SINE, short interspersed element; STAT1, signal transducer and activator of
transcription 1; TAR, transcriptionally active region.
Chromatin
immunoprecipitation
(ChIP.) A technique for Model-based discovery of non-coding elements. Some statistical tools, such as Gibbs sampling 30, can reveal
identifying potential regulatory classes of element can be identified by using more sensi- subtle motifs that are common to the promoter and
sequences that are bound by tive, model-based comparison techniques. In particular, enhancer regions of all of the genes to which the regula-
the protein of interest. Soluble
DNA–chromatin extracts
in situations in which more detailed information about tory factor binds. Scanning the genome with a model of
(complexes of DNA and the structure or mechanism of formation of a specific ele- such a sequence motif can identify a more complete set
protein) are isolated by using ment is available, we can use it to discover more diverged of binding regions.
antibodies that recognize class members25. We can also search for elements based This brief overview of comparative analysis is
specific DNA-binding proteins.
on their tendency to fold into stable structures26. intended simply to provide a context for the following
In ChIP–chip, the ChIP step is
followed by microarray Transposable elements and pseudogenes are exam- functional analysis section. Readers who wish more
analysis, whereas in ChIP–seq, ples of non-coding sequences that can be identified by detail are referred to several excellent reviews30–32. In the
it is followed by sequencing. using models based on the descent of these sequences remainder of this Review, we focus on functional analysis
from protein-coding elements. For instance, the same and its integration with comparative analysis.
Tiling arrays
A class of microarray in which
powerful tools used to identify protein-coding genes
probes of a specific length and can be used to identify active transposable elements Functional analysis
spacing provide uniform that still code for (retro)transposase enzymes. Inactive In functional genomics, experimental techniques that
coverage of an entire genome transposable elements can be identified by their simi- characterize the biological role of genetic sequences are
or portion of a genome to a
larity to active transposable element profiles and by the expanded to generate data at genome scale in a high-
desired resolution.
stereotypical structure of short repeats at their margins throughput way. For instance, chromatin immunoprecipitation
RNA sequencing that mark excision scars. likewise, protein sequence followed by microarray (ChIP–chip)33–35 or by sequenc-
The use of high-throughput similarity to parent genes is the main feature used to ing (ChIP–seq)36–38 can be used to identify regulatory-
sequencing of RNA that has detect pseudogenes27–29 and is a much more sensitive factor-binding regions (RFBRs), and transcription
been reverse-transcribed into
DNA to characterize the set
indicator than raw nucleotide identity. tiling arrays39,40 and RNA sequencing (RNA–seq)41–44 can
of RNA transcripts produced Where previous work has identified a set of genes be used to identify transcriptionally active regions
by a cell. that are all regulated by the same regulatory factor, (TARs). Here, we give an overview of a standardized
ChIP–seq
ChIP–chip
b XIAP STAG2
Tiling array
signal
Called TARs
Relaxed
threshold
Stringent
threshold
Figure 2 | signal resolution and signal thresholding. a | Comparison of signal tracks obtained from chromatin immu-
Nature Reviews | Genetics
noprecipitation followed by sequencing (ChIP–seq) and ChIP followed by microarray (ChIP–chip). The example shown
focuses on the binding of the transcription factor signal transducer and activator of transcription 1 (STAT1) to the
promoters of genes in the interleukin receptor cluster on chromosome 21. It is clear that following ChIP with short read
DNA sequencing (ChIP–seq, top) generates a much cleaner signal than using a microarray (ChIP–chip, bottom). The
ChIP–seq track clearly identifies three STAT1 binding sites, whereas the ChIP–chip track requires a more complex
thresholding step. There is negative signal (red) in the ChIP–chip track because the microarray signal is a ratio of STAT1
binding compared to a control state. Positive binding signals of the same magnitude should be treated as noise.
b | Issues with signal thresholding. The genes stromal antigen 2 (STAG2) and X-linked inhibitor of apoptosis (XIAP) have
different levels of exonic, intronic and intergenic transcription signals in a tiling array signal track. If the global
threshold to differentiate signal from noise is set high (stringent threshold), exons and introns in highly expressed genes
(here STAG2) will be correctly segregated, but even exons of weakly expressed genes (here XIAP) will not be flagged as
expressed. Conversely, if the threshold is set low enough (relaxed threshold) to differentiate exons from introns in
weakly expressed genes (XIAP), then both introns and exons of highly expressed genes (STAG2) will be flagged. These
difficulties in thresholding can lead to intronic RNA from precursor mRNA being flagged as expressed transcriptionally
active regions (TARs). IFNAR, interferon (alpha, beta and omega) receptor; IL10RB, interleukin 10 receptor, beta.
Smoothing
The process of filtering noise signal processing approach to analysing such functional we can transform into a set of discrete genomic regions,
from a signal by removing genomics data sets. We do not review the long history or ‘hits’, represented in another track. First, we explain
fine-scale variation. of annotation of NCEs on an element-by-element basis a signal-processing pipeline that transforms raw signal
(BOX 1; TABle 1). tracks into processed annotation tracks. later, we high-
Thresholding
The process of discretizing a
A useful way to conceptualize the analysis of a light how integrative analysis of multiple tracks can lead
continuous signal by choosing generic functional genomics experiment is with a sig- to larger, derived annotations.
a signal value above which nal-processing paradigm. Each experiment generates
the signal is considered ‘on’ a raw signal of some kind across the genome that can Primary data processing: smoothing the raw signal.
or ‘active’ and below which
be analysed by smoothing it and then thresholding and The raw signal of a functional genomics experiment
the signal is considered
‘off’ or ‘inactive’. segmenting it into discrete units of initial annotation. In gives the read-out of transcription, protein binding
practical terms, the ubiquity of this paradigm is appar- or some other biological process at discrete points in
Segmenting ent from the fact that the university of California-Santa the genome. Depending on the technology used, sig-
The result of thresholding in Cruz (uCSC) Genome Browser, a major clearing-house nals are mapped to the reference genome with differ-
signal processing — that is,
segments are those regions
for genomic information45, treats each experiment as a ent resolutions. High-throughput sequencing generates
defined as ‘on’ or ‘active’ after separate ‘signal track’. A signal track usually represents alignments at base-pair resolution, whereas tiling arrays
discretization of the signal. a continuous-valued number across the genome, which provide resolutions from 5 to 50 bp, depending on probe
density 39,40 (fIG. 2a). The output is a noisy signal consist- Interpreting the initial annotations: transcriptionally
ing of many piled-up sequence reads or probe values. active regions. The result of thresholding RNA–seq or
From this noisy data, the goal is to determine where a tiling array signals58,59 is a set of TARs (also known as
transcription factor actually binds36,37, where a particu- transcription fragments (‘transfrags’)). Although most
lar DNA46 or histone47–49 modification is being made, or TARs stem from protein-coding genes, they can also
what sequence is being transcribed39–42. mark non-coding RNAs. An unexpected result from the
Numerous technical issues that influence signal qual- ENCODE pilot project was the discovery of pervasive
ity need to be addressed. In particular, because arrays transcription — that is, large numbers of novel TARs
rely on hybridization to measure the amount of target in unannotated portions of the genome20,60. There is
DNA present, the signal obtained for each oligonucle- much debate over whether these and other unannotated
otide probe is modulated by its sequence composition. transcripts are functional or simply the result of cross-
Probes with greater GC content, for example, show hybridization or transcriptional noise61–63. Although the
higher signal50. Another issue is cross-hybridization, fraction of transcribed RNA sequences that map to inter-
in which regions of the genome with similar sequences genic and intronic regions is fairly low (~5–10%), that set
bind to multiple probes on the array. Cross-hybridization of TARs covers a relatively large fraction of nucleotides
often gives rise to spikes in the signal, causing problems in the genome. This finding is consistent with the fact
for measuring the expression of multi-gene families and that annotated genic regions are transcribed at higher
pseudogenes with tiling arrays. Sequencing technologies levels. moreover, even though a large fraction of the
do not suffer from cross-hybridization; however, analo- human genome is transcribed as primary transcripts,
gous problems occur because short sequence reads can which include introns, it remains a challenge to distin-
misalign to an incorrect location in the genome owing to guish novel processed RNA products from remnants
sequencing and mapping errors51–54. In general, correct of primary transcription that can be associated with
read mapping is one of the main technical challenges in known genes63.
next-generation sequencing (BOX 2).
Interpreting the initial annotations: regulatory factor
Thresholding and segmenting to generate small initial binding. Segmentation of ChIP–chip or ChIP–seq sig-
annotations. After smoothing, it is necessary to set a nals generates RFBRs64. This awkward abbreviation was
threshold to differentiate regions with and without sig- chosen by the ENCODE consortium to refer to both
nal. Thresholding issues have been most thoroughly transcription factor binding and histone modification
explored for ChIP–seq experiments, which we focus experiments, as both are important for genome anno-
on here. We expect that approaches for thresholding tation. (Other abbreviations considered include CHIRP
RNA–seq signals will evolve along similar lines. (chip hit of regulatory potential) and EIGR (experimen-
To correctly construct local thresholds, it is impor- tally identified genomic region).) Here, we use the sim-
tant to model or simulate an appropriate null process pler term ‘binding sites’ for RFBRs and highlight several
for the background55,56. Because the background signal issues with transforming a set of raw binding sites into
can be noisy (fIG. 2b), naive methods of thresholding more developed annotations.
using the assumption of a uniform background are First, binding sites can be divided into two major
not successful. classes: punctate and broad. For example, some histone
In ChIP–seq experiments, the signal from ‘input modifications cluster in sharp peaks around transcript
DNA’ is often used as the background. This signal is promoter regions (punctate binding sites), whereas oth-
generated by sequencing genomic DNA without any ers mark the entire transcribed region with a broad peak
enrichment step. The most commonly accepted expla- (broad binding sites)49. Although punctate sites have,
nation for the non-uniformity of this control signal for a long time, been identified by scoring algorithms,
is that it reflects the chromatin state of the genome. methods for identifying signals across broad regions of
Regions of open chromatin are more likely to shear the genome have been less thoroughly developed65,66.
and generate DNA fragments of an appropriate size to Second, binding sites differ in the degree to which
pass a sizing filter and be captured by sequencing 57. they have a clear sequence motif connecting them to
using the input DNA signal as background also their associated transcription factor. A transcription
accounts for the differential ‘mappability’ of regions of factor with a weak motif may be present at very high
the genome — that is, the fact that some regions, most concentration in a given tissue, binding more promis-
obviously repeats, are underrepresented in the output cuously and activating more genes than in another
of the experiment because they are less likely to pro- tissue in which it is present at lower concentration67.
duce reads that can be mapped uniquely back to the Even for transcription factors with a strong motif, chro-
reference genome. matin accessibility often modulates binding in a cell-
The initial output from thresholding and segment- type-specific manner 47,48. A strong binding event may
ing an experimental signal is a number of small anno- require both open chromatin and a matching motif. A
tation blocks that are represented as a discrete ‘feature’ complete picture of transcription factor binding thus
track45. The next step is to assign biological meaning to needs to incorporate both sequence information about
the blocks. The experimental read-out is interpreted dif- transcription-factor-binding-site motifs and func-
ferently depending on whether the experiment involves tional genomic information about chromatin state and
transcription or immunoprecipitation. transcription factor expression level.
One difficulty connected with determining transcrip- groups of co-regulated TARs, one can build a matrix in
tion factor binding motifs is the loss of information on which the rows correspond to different cell lines or tis-
binding specificity owing to crosslinking. Transcription sues with transcriptional information and the columns
initiation complexes often consist of a DNA element correspond to transcribed regions (including exons) that
bound by multiple interacting transcription factors, have been identified in any of the experiments. We can
some bound to distal enhancer regions that are adja- compute correlations of the column vectors of expression
cent in three-dimensional (3D) space because of chro- signals between novel TARs and nearby known exons.
mosomal looping. Thus, immunoprecipitation of one Novel TARs co-expressed with exons of neighbouring
transcription factor in such a complex may elute DNA genes are likely to be part of the same larger transcrip-
to which it binds only indirectly. The resulting set of tional unit 70. In addition, novel TARs distant from any
target sequences will identify poorly the sequence motif known gene can be clustered into groups with strongly
to which the transcription factor actually binds68. correlated expression signals, which can help in piecing
Third, determining the relationship between binding together larger non-coding transcript structures. (This
sites and their target genes is crucial to gaining a global operation can be compared with the connectivity pro-
picture of transcriptional regulation, including epigenetic vided by paired-end reads, which are described below.)
mechanisms. moreover, this information is the starting Clustering at a higher level naturally gives rise to
point in building regulatory networks that connect tran- networks of transcripts that are co-expressed across cell
scription factors with their targets. In compact genomes, lines or other conditions. The same kind of column clus-
such as those of yeast and C. elegans, associating binding tering applied to transcription factor binding sites, for
regions with downstream targets is fairly straightforward. example, would form a network of co-regulated target
However, in the vast expanse of the human genome, this genes (fIG. 3Ab).
determination is less straightforward. Sites that are thou-
sands of bases apart are often brought into proximity by Co-clustering approaches: biplot. The next step is to
complex chromatin structures, including looping. examine simultaneous clustering of columns and rows.
For example, we can recognize pairs of factors that often
Integrating information bind together by the high correlation of their row vec-
The types of data presented above can be displayed as a tors in the genomic matrix. Computing the correlation
single track in a genome browser. This track can show of each factor against all others gives rise to another
either a continuous signal across the genome or a set of matrix, called the correlation matrix (fIG. 3Ac). likewise,
discrete ‘hit’ regions, such as binding sites. The next step we can also cluster regions of the genome (matrix col-
is to group the information from a single track or from umns) together based on which factors bind them,
multiple tracks into larger annotation structures, such which generates a second correlation matrix for regions
as entire transcripts, that have more biological meaning. (fIG. 3Ab). Principal components analyses of these two cor-
Eventually, multiple classes of functional elements that relation matrices give rise to ‘eigen-factors’ (which repre-
are not proximally located on the genome can be wired sent the typical behaviour of factors across the genome)
together into networks. and ‘eigen-regions’ (which represent the typical modes
of binding of many factors across the genome).
Grouping small annotation units into larger structures Given a data matrix consisting of the number of times
with a genomic matrix. Track integration begins by gen- each factor binds each region, we can cluster regions
erating a ‘genomic matrix’ in which each row corresponds with regions and factors with factors, and cross-correlate
to a different experiment and each column to a different regions with factors. All three of these types of linkage
genomic region (fIG. 3). Then each matrix cell represents can be visualized in a biplot 69, which shows region and
the aggregated read-out of a particular experiment within factor clustering simultaneously (fIG. 3Ad). Effectively,
Specificity a specific region — for example, the average transcriptional the biplot performs principal components analysis on
A measure of the proportion of
signal within a specific 1 kb region of the genome in the each of the correlation matrices and shows the natural
true negatives correctly
identified as such (for example, Hela cell line or the number of nucleotides in that region interrelationship of eigen-factors and eigen-regions.
the percentage of healthy bound by a specific transcription factor. The genome can
people who are identified as be decomposed naturally into regions of different scales Aggregation and saturation plots. Another type of
not having a disease). (BOX 1; TABle 1); correspondingly, different matrices can analysis that can be done with the genomic matrix is
Regulatory forests
bin genomic regions at different resolutions. the saturation plot. Here, one looks at the cumulative
Regions of the genome that are Simple statistical operations on genomic matrices fraction of the genome ‘covered’ by adding more rows to
enriched with binding sites for can then provide useful information. In particular, for the matrix to see how many different assays one needs
regulatory factors, such as a set of tracks of experimental features binned at a fine to achieve ‘saturation’ of a particular type of element.
transcription factors.
resolution (for example, factor binding sites collected in This type of plot has been used extensively in the
Principal components small 1 kb bins), one can find larger blocks (for example, ENCODE and modENCODE projects to measure over-
analysis 150 kb) that are statistically enriched or depleted in these all progress in annotating the genome. One complication
A statistical method used features compared with a randomized null distribution. with saturation plots is that the slope of the increase in
to simplify data sets by Enriched and depleted regions have been termed ‘regulatory cumulative fraction depends on the order in which the
transforming a series of
forests’ and ‘regulatory deserts’69 (fIG. 3Aa). assays are chosen. To get around this problem, one can
correlated variables into
a smaller number of Correlated genomic regions can be identified by clus- shuffle the assays and show a box plot resulting from all
uncorrelated factors. tering columns of the matrix 70. For example, to identify different possible orderings (fIG. 3B).
A Matrix operations
Aa Ab
Site C
Factors and
chromatin Site A
modifications ...
(different Site B
tissues)
...
...
...
...
...
...
Correlation of
columns identifies
RNA networks of
(different ...
co-regulated and
tissues) co-expressed
genome sites
Site A B C
Forest Desert
Ac
Correlation of rows identifies related tissues and co-regulating factors Ad
Correlation of rows
Factor A
and columns shown
as biplots of
Factor B co-regulating
factors and their
Factor C co-regulated sites.
B Saturation analysis
ws
ws
ro
ro
ro
y1
y2
y3
An
An
An
C Aggregation analysis
Signal track
Anchor track
1 2 3 4
4
3
2
1
Figure 3 | Matrix showing how to correlate genomic elements. A | Simple matrix operations.Nature Each row in the
Reviews matrix
| Genetics
corresponds to a different experiment and each column to a different genomic region (Aa). The numerical value of each
matrix element corresponds to an aggregated read-out of that experiment in that specific region of the genome. Simple
statistical operations on the matrix can provide useful information. For example, correlations between columns (Ab)
identify networks of co-regulated and co-expressed genome sites, whereas correlations between rows (Ac) identify
related tissues and co-regulating factors. Simultaneous correlation of rows and columns (Ad) can associate
co-regulating factors with the sites they regulate. Grouping columns into regions enriched or depleted for regulatory
sites compared with the genome average identifies regulatory forests and deserts. B | This schematic saturation plot
shows how genome coverage increases as related signal tracks are joined together. Here, signal track 3 covers a larger
fraction of the genome than any other single track; signal tracks 2 and 3 together cover more of the genome than any
other pair of tracks, and so on. Saturation plots present genomic summary statistics in a useful visual framework. c | This
schematic aggregation plot shows how a class of genomic features from one annotation track can be used as anchor
points to sum up the values in a related set of signal tracks. This plot could represent, for example, the average profile of
short RNA sequencing reads around all transcription start sites in the genome.
In addition to measuring saturation across tracks, by sequence similarity to parent genes — fundamentally
one can analyse the statistical distribution within a sin- a result of comparative analysis — can be examined for
gle track using an aggregation plot. Here, one sums the transcriptional activity by comparison with functional
signal within a set distance around all instances of a set of tracks derived from RNA–seq or tiling-array data. In fact,
genomic anchor points — for example, around transcrip- evidence from the ENCODE pilot project suggests that
tion start sites or within exons (fIG. 3C). In a sense, each at least 20% of human pseudogenes are transcribed73. It
aggregation plot builds a special coordinate system for has been suggested that some transcribed pseudogenes
the matrix in which matrix elements (or bins) are placed have been recruited into the RNA-interference pathway
at predefined distances from each of the genomic anchor to control transcription of their parent genes. In these
points, which can be expressed as another track. cases, the antisense transcript from the pseudogene binds
to the mRNA of its parent, generating a natural endo-
Analysis of sequence features in a similar framework. The genous small interfering RNA74,75. These observations
type of integrated analysis done for different classes of suggest that pseudogenes can play a significant part in
functional genomics tracks can also be done for tracks gene regulation76. However, in the ENCODE pilot no
defined by sequence analysis. For example, we might obvious sequence signature for transcribed pseudogenes
expect some correlation between the occurrence of seg- could be found — that is, they were conserved no more
mental duplications and short repeat elements, as one and have no fewer SNPs than other pseudogenes73.
of the formation mechanisms of segmental duplications
is non-allelic homologous recombination (NAHR)71, and Sequence conservation versus function. This ambiguous
the presence of short repeats increases the likelihood finding about transcribed pseudogenes is an example of
of non-allelic crossing-over. This correlation uses the the broader result from the pilot ENCODE project that
same genomic matrix approach described above, except conserved elements identified by comparative analysis
bins represent the number of segmental duplications or are not always functional, and vice versa20,77,78. Before
short repeats (for example, Alus) in a genomic interval. the project, it was expected that, to some degree, all
In fact, one study that related segmental duplications to conserved blocks would have some function mapped to
short repeats showed that segmental duplications tend them. Somewhat surprisingly, many blocks were found
to be associated with Alus and that the change in this to have no experimental evidence of function and,
association over time highlights the effect of the Alu burst conversely, many experimentally identified functional
~40 million years ago72. elements were not conserved. We highlight the case of
ultraconserved elements. Some of these elements are tis-
Integrating comparative and functional tracks sue-specific enhancers11,13, whereas others have a role in
In the previous section, we looked for correlations mRNA degradation79. However, deletion of several ultra-
between regions of the genome that shared either conserved elements in mice was not lethal and caused no
sequence-based or experimental features. Here, we com- problems in growth, longevity, fertility or metabolism77.
bine information from both comparative and functional If some function is not found for conserved non-coding
analysis. We can do this in two ways. First, we can meas- regions even after an exhaustive array of functional
ure the overlap between the two sets of features in terms assays, evolutionary models may need to be revised78,80.
of the number of base pairs. using the genomic matrix
framework, we can compute a correlation between the Annotating lincRNAs. A final example of integrating
rows of the matrix that represent sequence features and comparative and functional analysis comes from the
the rows of the matrix that represent functional features. study of large intergenic non-coding RNAs (lincRNAs;
Second, we can calculate a ‘sequence metric’, such as the also known as large intervening non-coding RNAs).
degree of conservation or variability, for each function- ‘K4-K36 domains’ are histone signatures that mark
ally annotated feature or a ‘functional metric’, such as the actively transcribed elements with a punctate histone 3
amount of transcription, for each sequence feature. lysine 4 trimethylation (H3K4me3) mark at the tran-
When calculating such metrics, it is important to scription start site and a broad H3K36me3 mark across
assess their values relative to an appropriate genomic the transcribed region81. A large set of K4-K36 domains
null. For example, is a functional element more or less identified from functional genomics experiments in mice
conserved than one would expect it to be by chance? was screened against known protein-coding genes and
Non-allelic homologous
One can determine this expectation trivially by ran- regulatory RNAs to find domains without any known
recombination domly shuffling elements in the genome. However, there annotation81. A custom tiling array designed to map a
Recombination between are a number of better ways to construct an appropriate subset of those unannotated regions showed transcrip-
segmental duplications that null, such as by using the genome structure correction tion in most of them. Sequence analysis showed that
leads to local duplication,
(GSC) statistic20. almost none of the transcribed elements was protein-
deletion or inversion of
genome sequence. Below, we highlight these approaches to interrelating coding, so they represented a set of candidate lincRNAs.
comparative and functional analysis using a number of When clustered with protein-coding genes of known
Ultraconserved elements representative case studies. function by their shared expression level across several
Operationally defined as tissues, groups of lincRNAs involved in the DNA dam-
non-coding elements that are
hundreds of base pairs long and
Detecting transcribed pseudogenes. An example that com- age response, immune signalling and maintenance of
100% identical across human, bines comparative and functional evidence is the annota- stem cell pluripotency were identified. Recent work has
mouse and rat genomes. tion of transcribed pseudogenes. Pseudogenes identified extended this analysis into humans82.
Discussion and future directions link widely spaced regions of the genome. Additional
We have provided an overview of the annotation process research is needed to find the most intuitive ways of
for non-coding regions of the genome. For a long time it analysing and visualizing this type of data88.
has been a mystery why more than 98% of the genomic In particular, paired-end tag sequencing is beginning
text seems to have no meaning, with less than 2% consist- to replace conventional single-end sequencing owing to
ing of protein-coding exons. The realization that much the additional information provided. For RNA–seq,
of the non-coding DNA in the genome is transcribed at paired-end reads add information about the connectivity
low levels into RNA has compounded the mystery 20. In of spliced transcripts, including trans-splicing events43,44.
the future the annotation effort will continue to evolve, For mapping structural variation, paired reads enable the
driven by rapid improvement in sequencing technolo- detection of inversions and translocations in addition to
gies. Although second-generation platforms sacrifice copy-number variation18 (BOX 2).
read length to reduce cost and increase coverage, some Chromosome conformation capture provides infor-
third-generation platforms promise to generate very mation on interactions between DNA elements that are
long reads that would span repetitive regions and make adjacent in the 3D space of the nucleus but located on
assembly much easier 83,84. We highlight here two direc- different chromosomes or widely spaced on the same
tions of future work. One area is a chronic problem that chromosome87. This method involves crosslinking of
requires attention, and the other is a direction in which chromatin, then shearing and ligation to create a library
the field of functional genomics is already moving. of fused DNA fragments from two distant genomic loca-
tions. High-throughput methods, including chromatin
Validation. The chronic problem area is validation. interaction analysis by paired-end tag sequencing (ChIA-
validation of predictions from genome-scale experi- PET)89, carbon-copy chromosome conformation cap-
ments using more established molecular biology tech- ture90 and Hi-C91, use tiling arrays or deep sequencing
niques is crucial. The goal of ENCODE is to make to map these fusion products onto the genome. These
predictions with high accuracy (for example, 5% error techniques enable the systematic identification of dis-
rate and 95% sensitivity). A benefit of ENCODE is that tant targets of regulatory elements, such as enhancers,
researchers can compare different scoring algorithms and the mapping of the 3D structure of chromatin in
applied to the same large data sets. However, if each the nucleus91,92.
Sensitivity algorithm makes 10,000 binding site predictions for an A paradox of the genomic era has been that the
A measure of the proportion
experiment, perhaps only 60 of them might be targeted number of protein-coding genes is no higher in humans
of true positives that are
correctly identified as such for validation by high-quality, low-throughput methods. than in apparently simpler organisms93. Human com-
(for example, the percentage That number is simply not high enough to readily cali- plexity may stem more from differences in regulation
of sick people who are brate the error rates, so it is important that regions for than from differences in protein-coding sequences94.
identified as having a disease). validation are selected systematically to maximize statis- The ENCODE pilot project showed that non-coding
Paired-end sequencing
tical power 85. On the experimental side, new medium- DNA tends to be functionalized mainly in new cell types:
Determination of the sequence throughput techniques are under development to transcribed regions conserved across many cell lines
at both ends of a fragment of increase the number of predictions that can be validated were almost exclusively exonic, whereas intronic and
DNA of known size. (for example, the NanoString nCounter 86). intergenic TARs were mainly restricted to a single cell
type20. So part of the increase in organismal complexity
Chromosome conformation
capture Annotation of connectivity between elements. most may be caused by the proliferation of cell types. Smaller
A technique used to study the genome-scale data sets discussed in this Review can be organisms with fewer cell types have comparatively less
long-distance interactions displayed in a browser as a single one-dimensional track, non-coding DNA (TABle 2), although some deviate sub-
between genomic regions, whereas new classes of functional genomic data cannot stantially from this trend95. Yeast has three cell types96;
which in turn can be used to
study the three-dimensional
be as easily represented. The key difference is that tech- the nematode C. elegans has nearly 1,000 cells of about
architecture of chromosomes niques such as paired-end sequencing18 and chromosome 20 cell types97, and humans have perhaps thousands of
within a cell nucleus. conformation capture87 generate connectivity maps that cell types, not all of them enumerated98,99.
Future work in understanding the evolutionary (NIH) projects such as the Genotype-Tissue Expression
trajectory of the human genome should focus on (GTEx) project and the Transcriptional Atlas of Human
providing functional genomic data in a comprehen- Brain Development. methods currently being devel-
sive array of human cell and tissue types, a goal that oped to annotate NCEs will be critical for the success of
is being pursued by uS National Institutes of Health such projects.
1. Britten, R. J. & Kohne, D. E. Repeated sequences in 28. Zhang, Z. L. et al. PseudoPipe: an automated 53. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L.
DNA. Science 161, 529–540 (1968). pseudogene identification pipeline. Bioinformatics 22, Ultrafast and memory-efficient alignment of short
2. Ohno, S. So much ‘junk’ DNA in our genome. 1437–1439 (2006). DNA sequences to the human genome. Genome Biol.
Brookhaven Symp. Biol. 23, 366–370 (1972). 29. Karro, J. E. et al. Pseudogene.org: a comprehensive 10, R25 (2009).
3. Lewin, R. Proposal to sequence the human genome database and comparison platform for pseudogene 54. Li, H. & Durbin, R. Fast and accurate short read
stirs debate. Science 232, 1598–1600 (1986). annotation. Nucleic Acids Res. 35, D55–D60 (2007). alignment with Burrows–Wheeler transform.
4. Robertson, M. The proper study of mankind. 30. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Bioinformatics 25, 1754–1760 (2009).
Nature 322, 11 (1986). Biological Sequence Analysis: Probabilistic Models of 55. Zhang, Z. D., Rozowsky, J., Snyder, M., Chang, J. &
5. Choi, M. et al. Genetic diagnosis by whole exome Proteins and Nucleic Acids (Cambridge Univ. Press, Gerstein, M. Modeling ChIP sequencing in silico with
capture and massively parallel DNA sequencing. 1998). applications. PLoS Comput. Biol. 4, e1000158
Proc. Natl Acad. Sci. USA 106, 19096–19101 (2009). 31. Miller, W., Makova, K. D., Nekrutenko, A. & (2008).
6. Gnirke, A. et al. Solution hybrid selection with ultra- Hardison, R. C. Comparative genomics. Annu. Rev. 56. Rozowsky, J. et al. PeakSeq enables systematic scoring
long oligonucleotides for massively parallel targeted Genomics Hum. Genet. 5, 15–56 (2004). of ChIP–seq experiments relative to controls. Nature
sequencing. Nature Biotech. 27, 182–189 (2009). 32. Margulies, E. H. & Birney, E. Approaches to Biotech. 27, 66–75 (2009).
7. Ng, S. B. et al. Targeted capture and massively comparative sequence analysis: towards a functional 57. Auerbach, R. K. et al. Mapping accessible chromatin
parallel sequencing of 12 human exomes. Nature 461, view of vertebrate genomes. Nature Rev. Genet. 9, regions using Sono-Seq. Proc. Natl Acad. Sci. USA
272–276 (2009). 303–313 (2008). 106, 14926–14931 (2009).
8. International Human Genome Sequencing Consortium. 33. Ren, B. et al. Genome-wide location and function of 58. Kapranov, P. et al. Large-scale transcriptional activity
Initial sequencing and analysis of the human genome. DNA binding proteins. Science 290, 2306–2309 in chromosomes 21 and 22. Science 296, 916–919
Nature 409, 860–921 (2001). (2000). (2002).
9. Venter, J. C. et al. The sequence of the human genome. 34. Iyer, V. R. et al. Genomic binding sites of the yeast 59. Rinn, J. L. et al. The transcriptional activity of human
Science 291, 1304–1351 (2001). cell-cycle transcription factors SBF and MBF. Nature Chromosome 22. Genes Dev. 17, 529–540 (2003).
10. Ghildiyal, M. & Zamore, P. D. Small silencing RNAs: 409, 533–538 (2001). 60. Kapranov, P. et al. RNA maps reveal new RNA classes
an expanding universe. Nature Rev. Genet. 10, 35. Lee, T. I., Johnstone, S. E. & Young, R. A. and a possible function for pervasive transcription.
94–108 (2009). Chromatin immunoprecipitation and microarray- Science 316, 1484–1488 (2007).
11. Bejerano, G. et al. Ultraconserved elements in the based analysis of protein location. Nature Protoc. 1, 61. Ponjavic, J., Ponting, C. P. & Lunter, G. Functionality or
human genome. Science 304, 1321–1325 (2004). 729–748 (2006). transcriptional noise? Evidence for selection within
12. Siepel, A. et al. Evolutionarily conserved elements 36. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. long noncoding RNAs. Genome Res. 17, 556–565
in vertebrate, insect, worm, and yeast genomes. Genome-wide mapping of in vivo protein–DNA (2007).
Genome Res. 15, 1034–1050 (2005). interactions. Science 316, 1497–1502 (2007). 62. Struhl, K. Transcriptional noise and the fidelity of
13. Pennacchio, L. A. et al. In vivo enhancer analysis of 37. Robertson, G. et al. Genome-wide profiles of STAT1 initiation by RNA polymerase II. Nature Struct. Mol.
human conserved non-coding sequences. Nature 444, DNA association using chromatin immunoprecipitation Biol. 14, 103–105 (2007).
499–502 (2006). and massively parallel sequencing. Nature Methods 4, 63. van Bakel, H., Nislow, C., Blencowe, B. J. &
14. Kleinjan, D. A. & van Heyningen, V. Long-range control 651–657 (2007). Hughes, T. R. Most dark matter transcripts are
of gene expression: emerging mechanisms and disruption 38. Park, P. J. ChIP–seq: advantages and challenges associated with known genes. PLoS Biol. 8,
in disease. Am. J. Hum. Genet. 76, 8–32 (2005). of a maturing technology. Nature Rev. Genet. 10, e1000371 (2010).
15. Yeager, M. et al. Comprehensive resequence analysis 669–680 (2009). A recent reappraisal, based on RNA–seq and
of a 136 kb region of human chromosome 8q24 39. Bertone, P. et al. Global identification of human tiling-array data, of the degree of pervasive
associated with prostate and colon cancers. transcribed sequences with genome tiling arrays. transcription in the human genome.
Hum. Genet. 124, 161–170 (2008). Science 306, 2242–2246 (2004). 64. Farnham, P. J. Insights from genomic profiling of
16. Visel, A., Rubin, E. M. & Pennacchio, L. A. 40. Cheng, J. et al. Transcriptional maps of 10 human transcription factors. Nature Rev. Genet. 10, 605–616
Genomic views of distant-acting enhancers. Nature chromosomes at 5-nucleotide resolution. Science (2009).
461, 199–205 (2009). 308, 1149–1154 (2005). 65. Pinkel, D. et al. High resolution analysis of DNA copy
17. Lupski, J. R. Genomic disorders: structural features 41. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. number variation using comparative genomic
of the genome can lead to DNA rearrangements and & Wold, B. Mapping and quantifying mammalian hybridization to microarrays. Nature Genetics 20,
human disease traits. Trends Genet. 14, 417–422 transcriptomes by RNA–seq. Nature Methods 5, 207–211 (1998).
(1998). 621–628 (2008). 66. Gokcumen, O. & Lee, C. Copy number variants (CNVs)
A prescient exposition of the important link 42. Nagalakshmi, U. et al. The transcriptional landscape in primate species using array-based comparative
between disease and structural variation in the of the yeast genome defined by RNA sequencing. genomic hybridization. Methods 49, 18–25 (2009).
human genome. Science 320, 1344–1349 (2008). 67. Stathopoulos, A., Van Drenth, M., Erives, A.,
18. Kidd, J. M. et al. Mapping and sequencing of 43. Sultan, M. et al. A global view of gene activity and Markstein, M. & Levine, M. Whole-genome analysis of
structural variation from eight human genomes. alternative splicing by deep sequencing of the human dorsal-ventral patterning in the Drosophila embryo.
Nature 453, 56–64 (2008). transcriptome. Science 321, 956–960 (2008). Cell 111, 687–701 (2002).
The first high-resolution sequence map of human 44. Wang, Z., Gerstein, M. & Snyder, M. An elegant study of the effect of transcription
structural variation. RNA–seq: a revolutionary tool for transcriptomics. factor concentration on the arrangement of
19. Lupski, J. R. & Stankiewicz, P. Genomic disorders: Nature Rev. Genet. 10, 57–63 (2009). cis-regulatory elements at target genes.
molecular mechanisms for rearrangements and 45. Karolchik, D. et al. The UCSC Genome Browser 68. Tantin, D., Gemberling, M., Callister, C. &
conveyed phenotypes. PLoS Genet. 1, e49 (2005). Database. Nucleic Acids Res. 31, 51–54 (2003). Fairbrother, W. High-throughput biochemical analysis
20. The ENCODE Project Consortium. Identification and 46. Lister, R. et al. Human DNA methylomes at base of in vivo location data reveals novel distinct classes of
analysis of functional elements in 1% of the human resolution show widespread epigenomic differences. POU5F1(Oct4)/DNA complexes. Genome Res. 18,
genome by the ENCODE pilot project. Nature 447, Nature 462, 315–322 (2009). 631–639 (2008).
799–816 (2007). 47. Bernstein, B. E. et al. A bivalent chromatin structure 69. Zhang, Z. D. D. et al. Statistical analysis of the
A comprehensive overview of what was learned marks key developmental genes in embryonic stem genomic distribution and correlation of regulatory
during the ENCODE pilot project. cells. Cell 125, 315–326 (2006). elements in the ENCODE regions. Genome Res. 17,
21. Celniker, S. E. et al. Unlocking the secrets of the 48. Barski, A. et al. High-resolution profiling of histone 787–797 (2007).
genome. Nature 459, 927–930 (2009). methylations in the human genome. Cell 129, 70. Rozowsky, J. S. et al. The DART classification of
22. Searls, D. B. The language of genes. Nature 420, 823–837 (2007). unannotated transcription within the ENCODE
211–217 (2002). 49. Mikkelsen, T. S. et al. Genome-wide maps of chromatin regions: associating transcription with known and
23. Whitfield, J. Across the curious parallel of language state in pluripotent and lineage-committed cells. novel loci. Genome Res. 17, 732–745 (2007).
and species evolution. PLoS Biol. 6, e186 (2008). Nature 448, 553–560 (2007). 71. Bailey, J. A. & Eichler, E. E. Primate segmental
24. Pagel, M. Human language as a culturally transmitted 50. Royce, T. E., Rozowsky, J. S. & Gerstein, M. B. duplications: crucibles of evolution, diversity and
replicator. Nature Rev. Genet. 10, 405–415 (2009). Assessing the need for sequence-based normalization disease. Nature Rev. Genet. 7, 552–564 (2006).
25. Saha, S., Bridges, S., Magbanua, Z. V. & Peterson, D. G. in tiling microarray experiments. Bioinformatics 23, 72. Kim, P. M. et al. Analysis of copy number variants
Empirical comparison of ab initio repeat finding 988–997 (2007). and segmental duplications in the human genome:
programs. Nucleic Acids Res. 36, 2284–2294 (2008). 51. Li, H., Ruan, J. & Durbin, R. Mapping short DNA evidence for a change in the process of formation
26. Washietl, S. et al. Structured RNAs in the ENCODE sequencing reads and calling variants using mapping in recent evolutionary history. Genome Res. 18,
selected regions of the human genome. Genome Res. quality scores. Genome Res. 18, 1851–1858 (2008). 1865–1874 (2008).
17, 852–864 (2007). 52. Li, R. Q., Li, Y. R., Kristiansen, K. & Wang, J. SOAP: 73. Zheng, D. et al. Pseudogenes in the ENCODE regions:
27. Harrow, J. et al. GENCODE: producing a reference short oligonucleotide alignment program. consensus annotation, analysis of transcription, and
annotation for ENCODE. Genome Biol. 7, S4 (2006). Bioinformatics 24, 713–714 (2008). evolution. Genome Res. 17, 839–851 (2007).
74. Tam, O. H. et al. Pseudogene-derived small interfering 94. King, M. C. & Wilson, A. C. Evolution at two levels in 113. Kaiser, J. A plan to capture human diversity in 1000
RNAs regulate gene expression in mouse oocytes. humans and chimpanzees. Science 188, 107–116 genomes. Science 319, 395–395 (2008).
Nature 453, 534–538 (2008). (1975). 114. Levy, S. et al. The diploid genome sequence of an
75. Watanabe, T. et al. Endogenous siRNAs from naturally 95. Gregory, T. R. Synergy between sequence and size in individual human. PLoS Biol. 5, 2113–2144 (2007).
formed dsRNAs regulate transcripts in mouse oocytes. large-scale genomics. Nature Rev. Genet. 6, 699–708 115. Chen, K. et al. BreakDancer: an algorithm for
Nature 453, 539–543 (2008). (2005). high-resolution mapping of genomic structural
76. Sasidharan, R. & Gerstein, M. Protein fossils live on as 96. Galgoczy, D. J. et al. Genomic dissection of the variation. Nature Methods 6, 677–681 (2009).
RNA. Nature 453, 729–731 (2008). cell-type-specification circuit in Saccharomyces 116. Hormozdiari, F., Alkan, C., Eichler, E. E. &
77. Ahituv, N. et al. Deletion of ultraconserved elements cerevisiae. Proc. Natl Acad. Sci. USA 101, Sahinalp, S. C. Combinatorial algorithms for
yields viable mice. PLoS Biol. 5, e234 (2007). 18069–18074 (2004). structural variation detection in high-throughput
78. Monroe, D. Genomic clues to DNA treasure sometimes 97. Sulston, J. E., Schierenberg, E., White, J. G. & sequenced genomes. Genome Res. 19, 1270–1278
lead nowhere. Science 325, 142–143 (2009). Thomson, J. N. The embryonic-cell lineage of the (2009).
79. Lareau, L. F., Inada, M., Green, R. E., Wengrod, J. C. & nematode Caenorhabditis elegans. Dev. Biol. 100, 117. Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M.
Brenner, S. E. Unproductive splicing of SR genes 64–119 (1983). MoDIL: detecting small indels from clone-end
associated with highly conserved and ultraconserved 98. Vickaryous, M. K. & Hall, B. K. Human cell type diversity, sequencing with mixtures of distributions.
DNA elements. Nature 446, 926–929 (2007). evolution, development, and classification with special Nature Methods 6, 473–474 (2009).
80. Baer, C. F., Miyamoto, M. M. & Denver, D. R. reference to cells derived from the neural crest. 118. Kidd, J. M. et al. Characterization of missing human
Mutation rate variation in multicellular eukaryotes: Biol. Rev. Camb. Philos. Soc. 81, 425–455 (2006). genome sequences and copy-number polymorphic
causes and consequences. Nature Rev. Genet. 8, 99. Arendt, D. The evolution of cell types in animals: insertions. Nature Methods 7, 365–371 (2010).
619–631 (2007). emerging principles from molecular studies. The authors report the characterization of new
81. Guttman, M. et al. Chromatin signature reveals over a Nature Rev. Genet. 9, 868–882 (2008). insertion sequences relative to the human
thousand highly conserved large non-coding RNAs in 100. Schlotterer, C. & Tautz, D. Slippage synthesis of simple reference genome; this study is a useful addition to
mammals. Nature 458, 223–227 (2009). sequence DNA. Nucleic Acids Res. 20, 211–215 the field as it moves towards a series of reference
A good example of the benefits of integrating (1992). genomes for sub-populations.
comparative and functional analysis, which in this 101. Amor, D. J. & Choo, K. H. A. Neocentromeres: role in 119. Lam, H. Y. K. et al. Nucleotide-resolution analysis of
case led to the discovery of a new class of human disease, evolution, and centromere study. structural variants using BreakSeq and a breakpoint
functional NCEs. Am. J. Hum. Genet. 71, 695–714 (2002). library. Nature Biotech. 28, 47–55 (2010).
82. Khalil, A. M. et al. Many human large intergenic 102. Vinces, M. D., Legendre, M., Caldara, M., Hagihara, M. 120. Li, R. Q. et al. Building the sequence map of the
noncoding RNAs associate with chromatin-modifying & Verstrepen, K. J. Unstable tandem repeats in human pan-genome. Nature Biotech. 28, 57–63
complexes and affect gene expression. Proc. Natl promoters confer transcriptional evolvability. Science (2010).
Acad. Sci. USA 106, 11667–11672 (2009). 324, 1213–1216 (2009). 121. Griffiths-Jones, S., Saini, H. K., van Dongen, S. &
83. Clarke, J. et al. Continuous base identification for 103. Mills, R. E., Bennett, E. A., Iskow, R. C. & Devine, S. E. Enright, A. J. miRBase: tools for microRNA genomics.
single-molecule nanopore DNA sequencing. Nature Which transposable elements are active in the human Nucleic Acids Res. 36, D154–D158 (2008).
Nanotechnol. 4, 265–270 (2009). genome? Trends Genet. 23, 183–191 (2007). 122. Iafrate, A. J. et al. Detection of large-scale variation in
84. Eid, J. et al. Real-time DNA sequencing from single 104. Zhang, Z., Frankish, A., Hunt, T., Harrow, J. & the human genome. Nature Genet. 36, 949–951
polymerase molecules. Science 323, 133–138 (2009). Gerstein, M. Identification and analysis of unitary (2004).
85. Du, J. et al. A supervised hidden markov model pseudogenes: historic and contemporary gene losses
framework for efficiently segmenting tiling array in humans and other primates. Genome Biol. 11, R26 Acknowledgements
data in transcriptional and ChIP–chip experiments: (2010). The authors thank members of the Gerstein laboratory for
systematically incorporating validated biological 105. Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & helpful discussions and careful reading of the manuscript. We
knowledge. Bioinformatics 22, 3016–3024 (2006). Tuschl, T. Identification of novel genes coding for small acknowledge support from the US NIH and from the Albert L.
86. Geiss, G. K. et al. Direct multiplexed measurement of expressed RNAs. Science 294, 853–858 (2001). Williams Professorship funds.
gene expression with color-coded probe pairs. Nature 106. Lau, N. C., Lim, L. P., Weinstein, E. G. & Bartel, D. P.
Biotech. 26, 317–325 (2008). An abundant class of tiny RNAs with probable Competing interests statement
87. Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. regulatory roles in Caenorhabditis elegans. Science The authors declare no competing financial interests.
Capturing chromosome conformation. Science 295, 294, 858–862 (2001).
1306–1311 (2002). 107. Lee, R. C. & Ambros, V. An extensive class of small
88. Krzywinski, M. et al. Circos: an information aesthetic RNAs in Caenorhabditis elegans. Science 294, FURTHER INFORMATION
for comparative genomics. Genome Res. 19, 862–864 (2001). Mark B. Gerstein’s homepage:
1639–1645 (2009). 108. Brennecke, J. et al. Discrete small RNA-generating loci http://genometech.gersteinlab.org
89. Fullwood, M. J. et al. An oestrogen-receptor-a-bound as master regulators of transposon activity in 1000 Genomes Project: http://www.1000genomes.org
human chromatin interactome. Nature 462, 58–64 Drosophila. Cell 128, 1089–1103 (2007). Berkeley Drosophila Genome Project: http://www.fruitfly.org
(2009). 109. Carmell, M. A. et al. MIWI2 is essential for Database of Genomic Variants:
90. Dostie, J. et al. Chromosome Conformation Capture spermatogenesis and repression of transposons in the http://projects.tcag.ca/variation
Carbon Copy (5C): a massively parallel solution for mouse male germline. Dev. Cell 12, 503–514 (2007). FlyBase: http://flybase.org
mapping interactions between genomic elements. 110. Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. GTEx project: http://nihroadmap.nih.gov/GTEx
Genome Res. 16, 1299–1309 (2006). & Luscombe, N. M. A census of human transcription The ENCODE Project: http://www.genome.gov/10005107
91. Lieberman-Aiden, E. et al. Comprehensive mapping of factors: function, expression and evolution. Nature Human Genome Structural Variation Project:
long-range interactions reveals folding principles of the Rev. Genet. 10, 252–263 (2009). http://humanparalogy.gs.washington.edu/structuralvariation
human genome. Science 326, 289–293 (2009). A useful synthesis of the current state of miRBase: http://www.mirbase.org
92. Duan, Z. et al. A three-dimensional model of the yeast knowledge about human transcription factors. The modENCODE Project: http://www.modencode.org
genome. Nature 465, 363–367 (2010). 111. Maston, G. A., Evans, S. K. & Green, M. R. Pseudogene.org: http://www.pseudogene.org
References 91 and 92 are two examples of the Transcriptional regulatory elements in the human Saccharomyces Genome Database:
power of using long-distance connectivity data in genome. Annu. Rev. Genomics Hum. Genet. 7, 29–59 http://www.yeastgenome.org
the genome to map genome structure. (2006). UCSC Genome Browser: http://genome.ucsc.edu
93. Clamp, M. et al. Distinguishing protein-coding and 112. Bovee, D. et al. Closing gaps in the human genome WormBase: http://www.wormbase.org
noncoding genes in the human genome. Proc. Natl with fosmid resources generated from multiple ALL Links ARe Active in the onLine PdF
Acad. Sci. USA 104, 19428–19433 (2007). individuals. Nature Genet. 40, 96–101 (2008).