Computing for Comparative Microbial Genomics
Bioinformatics for Microbiologists
Visit the link below to download the full version of this book:
https://medipdf.com/product/computing-for-comparative-microbial-genomics-bioinfo
rmatics-for-microbiologists/
Click Download Now
Preface
Overview and Goals
This book describes how to visualize and compare bacterial genomes. Sequencing
technologies are becoming so inexpensive that soon going for a cup of coffee will be
more expensive than sequencing a bacterial genome. Thus, there is a very real and
pressing need for high-throughput computational methods to compare hundreds and
thousands of bacterial genomes.
It is a long road from molecular biology to systems biology, and in a sense this
text can be thought of as a path bridging these fields. The goal of this book is to pro-
vide a coherent set of tools and a methodological framework for starting with raw
DNA sequences and producing fully annotated genome sequences, and then using
these to build up and test models about groups of interacting organisms within an
environment or ecological niche.
Organization and Features
The text is divided into four main parts: Introduction, Comparative Genomics,
Transcriptomics and Proteomics, and finally Microbial Communities. The first five
chapters are introductions of various sorts. Each of these chapters represents an
introduction to a specific scientific field, to bring all readers up to the same basic
level before proceeding on to the methods of comparing genomes. First, a brief
overview of molecular biology and of the concept of sequences as biological infor-
mation are given. The equivalent in the post-genomics era of the ‘Central Dogma’
of molecular biology (DNA makes RNA makes protein) is that the genome makes
the transcriptome, which makes the proteome. Before going on to the details of this,
a historical background is provided that pictures the scene of the origins of molecu-
lar biology and biological sequences. After this introduction, Chapter 2 describes
sequence alignment, the most common procedure used to compare biological
sequences. Instead of going into technical details of how exactly these alignments
are calculated, the text focuses on their practical use. Chapter 3 introduces bacterial
genomes and Chapter 4 deals with the most important databases, whilst Chapter 5 is
v
vi Preface
an introduction to the computational background of the tools necessary to analyze
all of this information.
The second part, on Comparative Genomics (Chapters 6–8), describes some
basic methods of comparing genomes. This section introduces various atlases
building up to the ‘Genome Atlas,’ which is our standard visualization tool for rep-
resenting the DNA sequence of a chromosome in a single figure, mapping the most
relevant DNA properties along the chromosome. We have found such atlases very
useful for mapping newly sequenced genomes and quickly visualizing regions
of potential interest. The value of atlas projections is illustrated by the examples
provided.
Part three (Chapters 9–11) takes the reader from genome sequences to RNA
sequences (transcriptomics) to proteins (proteomics) and regulation of gene expres-
sion. An important overview of experimental results can be obtained by mapping
back and visualizing the transcriptomic and proteomic data onto physical chro-
mosomal maps. Examples illustrate how important chromosome location is, and
which features can be predicted by careful analysis of genes and their surrounding
sequences.
The final part (Chapters 12–14) deals with microbial communities. In a sense
this can be thought of as ‘population genomics’ (as opposed to the more traditional
‘population biology’ which often focuses on only one or a few genes). First the con-
cept of ‘pan-genome’ and ‘core genome’ is introduced (Chapter 12), followed by
metagenomics (Chapter 13), and then evolution of microbial communities (Chapter
14). From a larger perspective, population genomics can provide a framework for
modeling ecosystems in terms of interacting biological systems.
Target Audiences and Required Background Knowledge
The reader should have basic knowledge about computers and be able to use web
interfaces. For programmers, some general knowledge of microbiology is assumed,
but it is our hope that both programmers and more ‘biology-oriented’ readers will
find this book helpful. Details on programming were deliberately left out; instead,
the text concentrates on the use and interpretation of publicly available web tools.
This book has grown out of lectures for the course in Comparative Microbial
Genomics,1 which DWU has taught since 2001 as a full semester length course at
the Technical University of Denmark, and as one-week workshops given in Bang-
kok, Thailand; in Petropolis, Brazil; and in Oslo, Norway.
This book is in a sense merging different scientific languages. The three authors
have different scientific and national backgrounds. DWU is from the U.S., studied
biochemistry, worked in molecular biology, and for the last 10 years has led a group
1
http://www.cbs.dtu.dk/dtucourse/programme27444.php
Preface vii
in bioinformatics and genomics. SB is from Italy, studied quantum chemistry with
focus on scientific programming, data standardization, and software integration;
whereas TMW studied biochemistry and worked in molecular biology and later as a
consultant in microbiology. These different backgrounds actually helped to develop
a common language in science. The subject area of this textbook is extremely inter-
disciplinary, covering (bio)chemistry, physics, biology, microbiology, mathemat-
ics, and computational science, and by the introduction of concepts (and some
jargon) from these various disciplines, the different languages used by specialists
are bridged.
This book is meant mainly for people studying bacterial genomes, although
of course nearly all of the methods described in the text would work for viral,
Archaeal, or Eukaryotic genomes as well. There are two main target audiences.
The first is the microbiologist who wants to get the most out of a bacterial genome
sequence. This could be a university student, or an experienced laboratory micro-
biologist who enters the field of genomics. This book enables one to get a handle
on how to use high-throughput computational methods to compare only a few, or
hundreds of sequenced genomes. The second audience comprises the computer
programmers who assist these microbiologists in actually carrying out the analy-
ses. From experience we know there can be communication problems between
the experimental bacteriologist who is more laboratory-oriented, and the com-
puter scientist who wants to do everything on computers. Both disciplines are
essential in present-day research. This book aims to explain to the computational
scientist why and how we want to study bacterial genomes, and what questions
we hope to answer. At the same time, it explains to the biologist some of the
basics behind the bioinformatic tools that are necessary for research in the field.
Bringing these two worlds, scientific interests, and languages together is our
ultimate goal.
Notes to the Instructor
There are no exercises or questions at the end of the chapters, although at the end
of most chapters textboxes present descriptions of essential methods used. From
experience we can say that giving small groups of students a project in which they
can choose a recently sequenced bacterial genome and compare it to other similar
genomes can produce surprisingly successful results. It is very motivating to work
with recently published data (new genome sequence papers are being published on
an almost daily basis now), and sometimes the students produce important obser-
vations that the authors of the scientific papers had missed! In some occasions,
such activities have resulted in a real scientific publication by the students, illus-
trating how ‘easy’ it is to do these kinds of analyses, as long as one asks relevant
questions.
viii Preface
Supplemental Resources
A number of web links are mentioned in the book, and since web addresses are
not always stable, a dedicated web page is put up on which all web pages pre-
sented in the book are summarized, and as necessary, updated. This can be found at
http://comparativemicrobial.com.
Lyngby, Denmark David Ussery
Zurich, Switzerland Stefano Borini
Zotzenheim, Germany Trudy Wassenaar
Acknowledgements
This book is based on input from many people, including our research team and
external collaborators. We are grateful for all the advice, assistance, and help we
received throughout this project. We thank all current and former members of the
Comparative Microbial Genomics group at CBS: in particular, Peter F. Hallin for
his excellent programming skills and help with development of many of the pro-
grams mentioned in this book; Flemming Hansen for his work on bacterial replica-
tion and his vast knowledge of E. coli; Henrik J. Nielsen for his help with E. coli
genomics; Kristoffer Kiil for his help with phylogeny and work on protein function;
and Carsten Friis for his assistance with various analyses and for keeping the group
running whilst we were writing.
We thank former group members whose work also contributed to this book,
including Tim T. Binnewies for his work with Vibrio genomes and secretion systems,
and Hanni Willenbrock for her work with developing pan-genome microarrays.
We are grateful to external collaborators, notably Thomas Quinn from Denver
University for his work on phylogenetic trees whilst on sabbatical in our group;
Karin Lagesen from CMBN, Institute of Medical Microbiology, Rikshospitalet
University Hospital in Oslo, who has helped with the rRNA and tRNA searches;
and Jon Bohlin from the Norwegian School of Veterinary Science, who has helped
with analysis of oligonucleotide usage patterns in bacterial genomes.
We would also like to acknowledge help from the many people at CBS, which is
currently one of the largest bioinformatics groups in Europe. In particular, we thank
Hans Henrik Stærfeldt from the CBS systems administration group, who wrote the
original code for the GeneWiz program that is used to construct the atlas plots,
and for his help and support over the past 10 years in updating and maintaining
GeneWiz. Jannick D. Bendtsen helped us on the secretome, and Thomas Blicher
kindly provided wonderful pictures of protein structures. Finally, Søren Brunak,
center director for CBS has established a wonderful place to work (including an
excellent coffee machine!) and has been supportive of our group since it was formed
in 1998.
David would like to thank his students over the years in his Comparative Micro-
bial Genomics course for their many helpful suggestions and comments. He would
also like to thank his wife for helpful editorial comments and for her support during
the writing of this book.
ix
x Acknowledgements
Stefano would like to thank his parents, Paola Marani and Walter Padovani, for
their constant support and trust, and his dear friends Paolo Soriani and Ruggero
Paratelli for their life-long support and understanding.
Trudy would like to thank her son Martijn for inventing the analogy of a road to
explain DNA strand direction, both of her sons for their understanding and patience,
and her husband for his constant support.
Much of the work described in this textbook has been funded by grants over the
past decade from the Danish National Research Foundation (Danmarks Grundfor-
skningsfond), Danish Research Councils, and the EU. Many of the calculations pre-
sented in this book have been made on our large computer system at CBS, funded
in part by money from the Danish Center for Scientific Computing.
Contents
Preface ................................................................................................................. v
Acknowledgements ........................................................................................... ix
Part I Introductions
1 Sequences as Biological Information: Cells Obey the Laws
of Chemistry and Physics .......................................................................... 3
Why Study Microbes?.................................................................................. 3
What is Biological Information and Where Does It Come From ................ 5
How DNA Sequences Code for Information ............................................... 7
From DNA to Protein: Transcription and Translation.................................. 9
DNA Sequences: More than Protein-Coding Genes .................................... 12
From DNA to DNA: Replication ................................................................. 14
Proteins: Structure and Function.................................................................. 14
2 Bioinformatics for Microbiologists: An Introduction ............................. 19
Identifying Similarities: Sequence Comparison by Means of Alignments ...... 19
From Alignments to Phylogenic Relationships............................................ 28
Genome Annotation: the Challenge to Get It Right ..................................... 31
Information Beyond the Single Genome ..................................................... 33
3 Microbial Genome Sequences: A New Era in Microbiology .................. 37
The First Completely Sequenced Microbial Genome .................................. 37
The Importance of Visualization .................................................................. 38
Genome Atlases to Visualize Chromosomes ............................................... 42
A Race Against the Clock: The Speed of Sequencing ................................. 44
The First Completely Sequenced Bacterial Genome ................................... 46
Comparative Bacterial Genomics ................................................................ 47
The Microbial Genome: Not All Bacteria Are Like E. coli ......................... 50
4 An Overview of Genome Databases ......................................................... 53
What is a Database? ..................................................................................... 54
xi
xii Contents
Three Databases Storing Sequences and a Lot More................................... 57
Data Files and Formats .............................................................................. 61
RNA Databases .......................................................................................... 62
Protein Databases ....................................................................................... 64
5 The Challenges of Programming: a Brief Introduction ....................... 69
Part 1: A Brief Overview of Computer Science Concepts ......................... 69
A Look at the Most Common Bioinformatic Procedures........................... 73
Achieving Better Automation .................................................................... 81
Part 2: Some Technical Details and Future Directions .............................. 83
Programming Languages ........................................................................... 83
Markup Languages..................................................................................... 86
Service Oriented Architecture .................................................................... 88
Specific Tools for Bioinformatic Use......................................................... 89
Part II Comparative Genomics
6 Methods to Compare Genomes: the First Examples ............................ 95
Genomic Comparisons: The Size of a Genome ......................................... 95
Pairwise Alignment of Genomes ............................................................... 99
Comparing Gene Content and Annotation Quality .................................... 100
RNA Comparisons: A Look at rRNAs ....................................................... 102
Proteome Comparisons: What Makes a Family? ....................................... 103
7 Genomic Properties: Length, Base Composition
and DNA Structures ................................................................................. 111
Length of Genomes: the ‘C-Value Paradox’ .............................................. 112
Genome Average Base Composition: The Percentage of AT ..................... 114
GC Skew—Bias Towards The Replication Leading Strand ...................... 118
Global Chromosomal Bias of AT Content ................................................. 122
DNA Structures .......................................................................................... 125
The Structure Atlas..................................................................................... 128
Bias In Purines—A-DNA Atlases .............................................................. 129
More on Structure Atlases.......................................................................... 131
8 Word Frequencies and Repeats .............................................................. 137
Analyzing Word Frequencies in a Genome................................................ 137
DNA Repeats Within a Chromosome ........................................................ 139
Introduction to the DNA Repeat Atlas ....................................................... 143
Local DNA Repeats are Related to Chromosomal AT Content ................. 146
DNA Structures Related to Repeats in Sequences ..................................... 147
The Genome Atlas: Our Standard Method for Visualization ..................... 147
Contents xiii
Part III Transcriptomics and Proteomics
9 Transcriptomics: Translated and Untranslated RNA........................... 153
Counting rRNA and tRNA Genes .............................................................. 154
A Closer Look at Ribosomal RNA............................................................. 155
Genes Encoding Transfer RNA.................................................................. 160
Genes Coding mRNA: Comparing Codon Usage Between Bacteria ........ 161
Other Non-coding RNA: tmRNA .............................................................. 164
10 Expression of Genes and Proteins .......................................................... 167
Comparing Gene Expression and Protein Expression ............................... 168
Part 1: Regulation of Transcription ............................................................ 169
Part 2: Regulation of Translation ............................................................... 179
Part 3: Protein Modification and Cellular Localization ............................. 180
Antigen and Epitope Prediction ................................................................. 185
11 Of Proteins, Genomes, and Proteomes ................................................... 189
Part 1: Analysis of Individual Protein-Coding Genes ................................ 190
Part 2: How to Annotate a Complete Genome ........................................... 197
Part 3: Proteome Comparisons................................................................... 203
PART IV MICROBIAL COMMUNITIES
12 Microbial Communities: Core and Pan-Genomics ............................... 213
Defining Pan-Genomes and Core Genomes .............................................. 214
Current Data Available for Pan- and Core Genome Analysis .................... 218
The Pan- and Core Genome of Streptococcus ........................................... 219
The Current Bacillus Pan- and Core Genome............................................ 221
An Overview of Some Proteobacterial Pan- and Core Genomes .............. 222
The Burkholderia Pan- and Core Genome................................................. 223
13 Metagenomics of Microbial Communities ............................................. 229
Metagenomics Based on 16S rRNA Analysis............................................ 230
Metagenomics Based on Complete DNA Sequencing............................... 232
Environmental Influences on Base Composition ....................................... 234
Visualization of Environmental Metagenomic Data .................................. 235
Marine Metagenomics ............................................................................... 240
Other Metagenomics Applications............................................................. 241
14 Evolution of Microbial Communities; or, On the Origins
of Bacterial Species .................................................................................. 243
Where Does Diversity Come From? .......................................................... 244
xiv Contents
Evolution Takes Time ................................................................................ 245
Evidence of Evolution in a Single Genome ............................................... 247
Genome Islands.......................................................................................... 249
Evolution on a Chip ................................................................................... 252
Species and Speciation: Vibrio cholerae.................................................... 253
Can We Predict Evolution? Escherichia coli Genome Reduction ............. 253
Abbreviations ................................................................................................... 257
Index .................................................................................................................. 263
Part I
Introductions
Chapter 1
Sequences as Biological Information: Cells Obey
the Laws of Chemistry and Physics
Outline Molecular biology has revolutionized our understanding of life. Biologi-
cal information is organized in a way that resembles text, meaning that biological
information is based on the specific order of components (a sequence of building
blocks) forming biological polymers. The building blocks are monomeric subunits
of nucleic acids or amino acids, which form long polymers (DNA, RNA, and pro-
teins). Their sequence determines their shape, and it is this shape (structure) that
determines function. The Central Dogma of biology states that the flow of biological
information is from DNA to RNA to proteins. The development of high-throughput
methods to determine DNA sequences is revolutionizing our approach to the study
of life. In the age of sequenced genomes, the flow of scientific information can now
frequently be read from the genome to the transcriptome, proteome, and cellular
components. Cellular processes obey the laws of chemistry and physics, and we
can use information from biological sequences to model structures and extrapolate
their functions, without the need to resort to unexplainable ‘vital life forces’ for the
molecular basis of life.
Why Study Microbes?
Aristotle divided all life into three basic kingdoms: Plants, Animals, and Minerals.
Minerals are no longer considered a kingdom of life, although diatoms living in the
ocean, such as Thalassiosira pseudonana, have characteristics of plants, animals,
and minerals (Armbrust et al. 2004). Three kingdoms are still recognized today,
as shown in Fig. 1.1, although for historical reasons these are often referred to as
superkingdoms. Two of these, Bacteria and Archaea, are based entirely on unicel-
lular organisms that do not have a nucleus and are too small to see without the aid
of a microscope. They are jointly referred to as prokaryotes. The third kingdom,
Eucarya (also spelled Eukarya), is characterized by cells that contain a nucleus; it
includes all plants and animals, but compared to microbes these represent only a
tiny fraction of the overall diversity.
Though not generally appreciated, even the kingdom of eukaryotes is dominated
by microscopic life. Most life is microscopic, in terms of the physical number of
D.W. Ussery et al., Computing for Comparative Microbial Genomics, Computational 3
Biology 8, DOI 10.1007/978-1-84800-255-5_1, © Springer-Verlag London Limited 2009
4 1 Sequences as Biological Information
etes
BACTERIA
yc
ria
Planctom
Actinobacte
um
cteria
eri
act
eoba
vob
Fla
ARCHAEA
P r ot
Ch Cy ota
lam an ae
yd
ob
ac arch
iae ter Eury
Chlo ia
robi
Bacteroidetes
tes
Spirochae
Clostridium ta
eo
s s ha
icute acillu ar
c
Firm B en
i Cr
flex ia s
loro r u
Ch cte rm
ot s
m cu
a
ba The
og
er o c
id o
e
Th noc
Ac
ica
i
uif
De
Aq
EUCARYA
rdia old
Gia em
Slim
Protozoans
ces
ro my
cha Babesia
Sac
a
Unicellular som Animals Plants
no
pa
eukaryotes Try
Macro-organisms
Fig. 1.1 A phylogenetic tree displaying the genetic distances between members of the three super-
kingdoms of life: Bacteria, Archaea, and Eucarya. The represented bacterial genera will appear in
examples throughout the book. The distance between bacterial genera is much larger than that of
plants and animals, drawn on the same scale of genetic distance
organisms, the number of species present in the environment, and, despite their
small size, the biomass they represent on a worldwide scale. Even inside an animal,
microbes are abundant: only one out of every 10 cells in a human body is actually
human, whilst the other nine cells are prokaryotic.
From an evolutionary perspective, Bacteria and Archaea have been around for
more than 3 billion years; plants and animals are relatively recent ‘newcomers’ on
the scene, arriving less than half a billion years ago. Since Bacteria and Archaea
can divide rather quickly and have had much more time to evolve, their diversity
by far exceeds that of eukaryotes (the members of Eucarya). Our human perception
is that plants and animals are completely unlike each other, and so are, say, insects
and mammals, as they are strikingly different even at first sight. The diversity of
What is Biological Information and Where Does It Come From? 5
microbes, however, cannot be judged from their looks. Only when zooming in at
their genetic material do we appreciate their diversity. In a phylogenetic tree, which
depicts genetic lineages, the microbial world is dominant and so diverse that, on
the drawn scale of diversity as shown in Fig. 1.1, plants and animals actually group
very close together.
What is Biological Information and Where Does It Come From?
It is obvious that children often resemble their parents, and for thousands of years,
humans have wondered about how hereditary traits are passed from one generation
to the next. Several hundred years ago, it was thought that sperm cells contained
‘little people’ inside of them, which then somehow enlarged to become children.
Although this concept proved to be incorrect, the subsequently proven nature of
heredity builds on the underlying concept of an organism inheriting from its par-
ents a complete ‘blueprint’ that determines its outcome. The ongoing debate about
‘nature vs. nurture’ demonstrates that the environment too, to some degree, deter-
mines the outcome of reproduction. Original models of genetic blueprints did not
allow for such environmental effects, and although the impact of environmental
factors acting on embryonic development is still being investigated, it is clear that
life is a dance of interactions between the genetic material and the environment.
The Physical Basis of Heredity
In 1866, the Czech monk Gregor Mendel proposed that there were physical units
of inheritance. These were responsible for attributes such as how tall or short an
organism was, as well as other characteristics. Mendel proposed a theoretical unit of
inheritance, called a gene. The chemical structure for the four bases present in DNA
was determined a few years later, by Albrecht Kossel, although at the time there was
no link between DNA and these so-called genes. It took many decades of experimen-
tal detective work to determine that the physical basis of the genes causing Mendel’s
traits were due to the activity of specific proteins, which were again encoded by
DNA. Mendel’s ideas were ignored and largely forgotten, and the search for the basis
of heredity took a few detours before the impact of his observations was realized.
The first clue that genes were something real, rather than just a theoretical unit,
came from studies of fruit flies. In the early 1900s, photographs of cells from Droso-
phila showed some densely staining structures, called chromosomes, when the cells
were getting ready to divide. Careful analysis of these structures showed a correla-
tion with different characteristics of the fruit flies, and eventually it was proposed
that chromosomes contained the hereditary information. How this information was
stored in the chromosomes was not known, but for the next few decades, it was
thought that proteins were responsible for this storage of information. They were
a more likely candidate than DNA, since proteins were known to contain many
6 1 Sequences as Biological Information
more different building blocks (i.e., 20 different amino acids) than DNA, which for
many years was considered to be a boring polymer of repeats of the four different
nucleotide subunits.
What is the Genetic Material?
During the 1930s, George Beadle and Edward Tatum proposed the ‘one gene, one
enzyme’ hypothesis, stating that a gene somehow correlates with an enzyme. At the
time enzymes were known to consist of proteins, and proteins were still assumed to
be the basis of genetic material. However, a few years later, Oswald Avery, Maclyn
McCarty, and Colin MacLeod demonstrated experimentally that DNA was the
material of inheritance. The initial reaction to this was skepticism, although this
work was inspirational for James Watson and Francis Crick, who in 1953 published
a model of the DNA double helix. Francis Crick was also inspired by what the
physicist Erwin Schrödinger had written in 1943: ‘We believe a gene—or perhaps
the whole chromosome fibre—to be an aperiodic solid.’ Schrödinger further com-
pared the gene to Morse code, which can encode information by using different
combinations of dots and dashes. From this came the idea that somehow the DNA
sequence was the genetic material inherited from one generation to the next and that
this contained information on how to make proteins.
In the 1950s, Watson proposed the ‘General Idea’ of molecular biology, which
consisted of three parts: the Sequence Hypothesis, describing how the amino acid
sequence in proteins is specified from DNA via RNA sequences; the Central
Dogma, which states that once the information flows from DNA to RNA to protein,
it can’t flow backwards; and the Structure/Function Relationship, which states that
the sequence of DNA (or RNA or protein) determines its shape, and the shape deter-
mines its function. All three parts have proven to be largely correct.
Cells Obey the Laws of Chemistry
Despite the protests of the Intelligent Design community, more than 40 years after
the first publication of Watson’s ‘Molecular Biology of the Gene,’ it is still clear
that cells obey the laws of chemistry and physics. We can understand the flow of
biological information in terms of coding sequences, which can be used to model
structures and then functions, with no need to resort to some sort of unexplainable
vital life force for the molecular basis of life. Maybe it is easier to understand or
accept this when dealing with microbes than with complex eukaryotes. However,
the basic biological processes taking place in, say, a mammalian nerve cell are the
same as in a bacterial cell. We can’t yet completely understand complex biological
processes such as thought or memory, but the underlying biology obeys the laws
of chemistry, just as the movement of a microbe towards a food source is ruled by
chemical processes.