0% found this document useful (0 votes)
354 views308 pages

Computational Molecular Biology - An Introduction Volume in Wiley Series in Mathematical and Computational Biology - Wiley (PDFDrive)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
354 views308 pages

Computational Molecular Biology - An Introduction Volume in Wiley Series in Mathematical and Computational Biology - Wiley (PDFDrive)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 308

Page i

Computational Molecular Biology


An Introduction
Page ii

WILEY SERIES IN MATHEMATICAL AND COMPUTATIONAL BIOLOGY

Editor-in-Chief

Simon Levin
Department of Ecology and Evolutionary Biology, Princeton University, USA

Associate Editors

Zvia Agur, Tel-Aviv University, Israel


Odo Diekmann, University of Utrecht, The Netherlands
Marcus Feldman, Stanford University, USA
Bryan Grenfell, Cambridge University, UK
Philip Maini, Oxford University, UK
Martin Nowak, Oxford University, UK
Karl Sigmund, University of Vienna, Austria

CHAPLAIN/SINGH/MCLACHLAN—On Growth and Form: Spatio-temporal Pattern Formation in Biology


CHRISTIANSEN—Population Genetics of Multiple Loci
CLOTE/BACKOFEN—Computational Molecular Biology: An Introduction

DIEKMANN/HEESTERBEEK—Mathematical Epidemiology of Infectious

Diseases: Model Building, Analysis and Interpretation

Reflecting the rapidly growing interest and research in the field of mathematical biology, this outstanding new book series
examines the integration of mathematical and computational methods into biological work. It also encourages the advancement
of theoretical and quantitative approaches to biology, and the development of biological organisation and function.

The scope of the series is broad, ranging from molecular structure and processes to the dynamics of ecosystems and the
biosphere, but unified through evolutionary and physical principles, and the interplay of processes across scales of biological
organisation.

Topics to be covered in the series include:

• Cell and molecular biology

• Functional morphology and physiology

• Neurobiology and higher function

• Immunology

• Epidemiology

• Ecological and evolutionary dynamics of interacting populations

A fundamental research tool, the Wiley Series in Mathematical and Computational Biology provides essential and invaluable
reading for biomathematicians and development biologists, as well as graduate students and researchers in mathematical biology
and epidemiology.
Page iii

Computational Molecular Biology


An Introduction

Peter Clote

Department of Computer Science and Department of Biology, Boston College, USA

Formerly
Ludwig-Maximilians-Universität München, Germany

Rolf Backofen

Ludwig-Maximilians-Universität München, Germany


Page iv

Copyright ©2000 John Wiley & Sons Ltd


Baffins Lane, Chichester,
West Sussex, PO19 1UD, England
National 01243 779777
International (+44) 1243 779777

e-mail (for orders and customer service enquiries): [email protected]

Visit our Home Page on http://www.wiley.co.uk or http://www.wiley.com

All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or
by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright,
Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency, 90 Tottenham Court
Road, London W1P 9HE, UK, without the permission in writing of the Publisher and the copyright owner, with the exception of
any material supplied specifically for the purpose of being entered and executed on a computer system, for the exclusive use by
the purchaser of the publication.

Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley
& Sons is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the
appropriate companies for more complete information regarding trademarks and registration.

Other Wiley Editorial Offices

John Wiley & Sons. Inc., 605 Third Avenue,


New York, NY 10158–0012, USA

Wiley-VCH Verlag GmbH


Pappelallee 3, D-69469 Weinheim, Germany

Jacaranda Wiley Ltd, 33 Park Road, Milton,


Queensland 4064, Australia

John Wiley & Sons (Asia) Pte Ltd, 2 Clementi Loop #02-01,
Jin Xing Distripark, Singapore 129809

John Wiley & Sons (Canada) Ltd, 22 Worcester Road,


Rexdale, Ontario, M9W 1L1, Canada

Library of Congress Cataloging-in-Publication Data

Clote, Peter.
Computational biology : a self contained approach to bioinformatics
/ Peter Clote, Rolf Backofen
p. cm – (Wiley series in mathematical and computational biology)
Includes bibliographical references (p.)
ISBN 0-471-87251-2 (alk. paper) – ISBN 0-471-87252-0 (pbk.: alk. paper)
1. Genetics—Mathematical Models. 2. Molecular biology—
Mathematical models. I. Backofen, Rolf. II. Title. III. Series.

QH438.4.M3 C565 2000


572.8'01'51 187-dc21 00 -038169

British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library

ISBN 0-471-87251-2
ISBN 0-471-87252-0

Some content in the original version of this book is not available for inclusion in this electronic edition.

Produced from PostScript files supplied by the authors.


Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire.
This book is printed on acid-free paper responsibly manufactured from sustainable forestry
in which at least two trees are planted for each one used for paper production.
Page v

To my wife, Marie, and to my son, Nicolas. (P.C.)


To my wife, Doris, and my children, Ina and Lara. (R.B.)
Page vii

Contents

Series Preface xi

Preface xiii

1 1
Molecular Biology

1.1 Some Organic Chemistry 3

1.2 Small Molecules 4

1.3 Sugars 6

1.4 Nucleic Acids 6

1.4.1 Nucleotides 6

1.4.2 DNA 8

1.4.3 RNA 13

1.5 Proteins 14

1.5.1 Amino Acids 14

1.5.2 Protein Structure 15

1.6 From DNA to Proteins 17

1.6.1 Amino Acids and Proteins 17

1.6.2 Transcription and Translation 19

1.7 Exercises 21

Acknowledgements and References 22

2 23
Math Primer

2.1 Probability 23

2.1.1 Random Variables 25

2.1.2 Some Important Probability Distributions 27

2.1.3 Markov Chains 38

2.1.4 Metropolis–Hastings Algorithm 43


2.1.5 Markov Random Fields and Gibbs Sampler 47

2.1.6 Maximum Likelihood 52

2.2 Combinatorial Optimization 53

2.2.1 Lagrange Multipliers 53

2.2.2 Gradient Descent 54

2.2.3 Heuristics Related to Simulated Annealing 54

2.2.4 Applications of Monte Carlo 55

2.2.5 Genetic Algorithms 60

2.3 Entropy and Applications to Molecular Biology 61

2.3.1 Information Theoretic Entropy 62

2.3.2 Shannon Implies Boltzmann 63


Page viii

2.3.3 Simple Statistical Genomic Analysis 66

2.3.4 Genomic Segmentation Algorithm 69

2.4 Exercises 72

2.5 Appendix: Modification of Bezout's Lemma 77

Acknowledgements and References 79

3 81
Sequence Alignment

3.1 Motivating Example 83

3.2 Scoring Matrices 84

3.3 Global Pairwise Sequence Alignment 88

3.3.1 Distance Methods 88

3.3.2 Alignment with Tandem Duplication 99

3.3.3 Similarity Methods 110

3.4 Multiple Sequence Alignment 111

3.4.1 Dynamic Programming 112

3.4.2 Gibbs Sampler 112

3.4.3 Maximum-Weight Trace 114

3.4.4 Hidden Markov Models 117

3.4.5 Steiner Sequences 117

3.5 Genomic Rearrangements 118

3.6 Locating Cryptogenes and Guide RNA 120

3.6.1 Anchor and Periodicity Rules 122

3.6.2 Search for Cryptogenes 122

3.7 Expected Length of gRNA in Trypanosomes 123

3.8 Exercises 128

3.9 Appendix: Maximum-Likelihood Estimation for Pair Probabilities 132

Acknowledgements and References 133


4 135
All about Eve

4.1 Introduction 135

4.2 Rate of Evolutionary Change 137

4.2.1 Amino Acid Sequences 137

4.2.2 Nucleotide Sequences 139

4.3 Clustering Methods 144

4.3.1 Ultrametric Trees 147

4.3.2 Additive Metric 152

4.3.3 Estimating Branch Lengths 156

4.4 Maximum Likelihood 157

4.4 1 Likelihood of a Tree 159

4.4.2 Recursive Definition for the Likelihood 160

4.4.3 Optimal Branch Lengths for Fixed Topology 162

4.4.4 Determining the Topology 166

4.5 Quartet Puzzling 166

4.5.1 Quartet Puzzling Step 169

4.5.2 Majority Consensus Tree 170

4.6 Exercises 171


Page ix

Acknowledgements and References 173

5 175
Hidden Markov Models

5.1 Likelihood and Scoring a Model 177

5.2 Re-estimation of Parameters 180

5.2.1 Baum–Welch Method 181

5.2.2 EM and Justification of the Baum–Welch Method 184

5.2.3 Baldi–Chauvin Gradient Descent 187

5.2.4 Mamitsuka's MA Algorithm 191

5.3 Applications 193

5.3.1 Multiple Sequence Alignment 193

5.3.2 Protein Motifs 194

5.3.3 Eukaryotic DNA Promotor Regions 195

5.4 Exercises 197

Acknowledgements and References 198

6 201
Structure Prediction

6.1 RNA Secondary Structure 202

6.2 DNA Strand Separation 213

6.3 Amino Acid Pair Potentials 223

6.4 Lattice Models of Proteins 228

6.4.1 Monte Carlo and the Heteropolymer Protein Model 231

6.4.2 Genetic Algorithm for Folding in the HP Model 233

6.5 Hart and Istrial's Approximation Algorithm 234

6.5.1 Performance 234

6.5.2 Lower Bound 236

6.5.3 Block Structure, Folding Point, and Balanced Cut 239

6.6 Constraint-Based Structure Prediction 243


6.7 Protein Threading 246

6.7.1 Definition 246

6.7.2 A Branch-and-Bound Algorithm 249

6.7.3 NP-hardness 258

6.8 Exercises 259

Acknowledgements and References 261

Appendix A 263
Mathematical Background

A.1 Asymptotic complexity 263

A.2 Units of Measurement 263

A.3 Lagrange Multipliers 264

Appendix B 265
Resources

B.1 Web Sites 265

B.2 The PDB Format 266

References 269

Index 281
Page 281

Index

absolute performance 236

addition law 24

additive metric 152

additive tree metric 147

additivity of alignments 90, 95

adenine 8

adenoside triphosphate (ATP) 4–6

alcohol 5

alignment 88, 90–5, 97–110

additivity 90, 95

distance 88, 90, 100

alleles 11–12

amine 5

amino acids 6, 14–15, 17–19

codes 18

pair potentials 35, 223–8

pair probabilities 86–8

sequences 137–9

substitution matrix methods 139

amino group 4, 5, 14

aminoacyl-tRNA synthetase 20, 21

Amoeba dubia 12

anchor region rules 122

antiparallel ß sheets 16

Archaea 1

archaebacteria 1

asymptotic complexity 263

asymptotic performance 236

Australopithecus 135

Avogadro's number 38

B
back mutations 137, 138

backtrack 210

balanced cut 239–42

balanced state 44

Baldi–Chauvin gradient descent 187–91

Bernouilli random variable 27, 28

Baldi-Chauvin updates 188–91

bases, chemical forms 8

basic U-folds 239

Baum–Welch method 180–8

Baum–Welch parameter 184

Baum–Welch score 178, 179

Bayes' rule 25

Bender's theorem 205

Bernouilli trial 27

ß sheet 16

Bezout's Lemma 77–9

binary phylogenetic trees 145–7

binary trees 144, 145, 166

binomial coefficients 24

binomial distribution 27–8

bioinformatics 2

block-respecting codes 56, 57, 59

block structure 239–42

block-structured code 56

BLOSUM matrices 88

Boltzmann distribution 35–8, 45, 46, 181, 221

Boltzmann probability 45, 46

Boltzmann probability distribution 63, 64

Boltzmann's constant 38, 66

Boltzmann's law 63

boolean cellular automation 74

Box–Muller algorithm 32

branch-and-bound algorithm 249–58

branch lengths 156–7, 162–5

Brookhaven Protein Database (PDB) 266


C

Cantor–Bendixson derivative 207

carbohydrates 4

carboxyl group 4, 5, 14

carboxylic acid 5
Page 282

catalan numbers 204

CATH database 266

Cavalli-Sforza–Edwards theorem 145–7

central limit theorem 31

chaperones 21

chloroplast DNA (cpDNA) 140

chromosomal duplication 119

chromosomal rearrangement 119

chromosomes 9, 12, 60, 119, 233–4

clustering methods 144–57

codons 17

combinatorial optimization 53–61

exercises 73–5

complete maximum-weight trace (CMWT) formalization 114

computational biology 2

conditional likelihood 161–2

conditional probability 25, 49–50, 181

connected neighbors 229

constraint-based structure prediction 243–6

core model 247–8, 259

covalent bond 3

Cro Magnon 135–6

crossover 61

cryptogenes 120–3

cyanobacteria 1 cytosine 8

cytosine 8

Dempster et al. theorem 186

deoxyribose 7

dinucleotide entropy 67–8

directed graph 144

discrete Markov model 175

distance matrix 94, 154

disulfide bonds 17

divergence 67
DNA 2, 8–12

DNA replication 21

DNA strand separation 213–23

Drosophilia 197

duplication 119

dynamic programming 112

dynamic programming algorithm 107

edit distance 88–90

edit operation 89

energy functions 213

energy matrix computation 210

enthalpy 66

entropy 61–72

exercises 75–6

information theoretic 62–3

equilibrium distribution 42, 45, 46

ergodic state 44

error distance 192

Escherichia coli 1

ester 5

Eukarya 2

eukaryotes 1, 20

eukaryotic DNA 214

promotor regions 195–7

promotor sequence 196

evolution rates 135–74

change rate 137–44

exercises 171–3

expectation maximization 180

expectation maximization algorithm 184–7

expected number of transitions 180, 182

exponential distribution 30, 33–4

extrachromosomal element (ECE) 9

Farris transformed distance method 154


fatty acids 4

Feller theorem 34

fibrinopeptides 140

fission 119

Fitch–Margoliash method 156–7

foldicity 231

folding 233–4

hydrophobic force 235

folding point 239–42

forward method 178

forward variable, definition 178–9

fusion 119

gap function 111

gap penalty 94–5, 111

Gaussian distribution 30

Geman–Geman theorem 51

gene 11

GENEMARK 47

genetic algorithms 60–1, 233–4


Page 283

genetic code 18, 19

fault tolerant 55–60

optimality 55–60

genome 11

genomic analysis 66–8

genomic rearrangements 118–20

genomic segmentation algorithm 69–72

genomic signature 68

geometric distribution 28–9

Gibbs distribution 47–9, 51

Gibbs free energy 38

Gibbs sampler 47–52, 112

global pairwise sequence alignment 88–111

Gotoh algorithm 82, 100–2

Gotoh theorem 96

gradient descent method 54, 180

GU base pairs 205, 209

guanine 8

guide RNA (gRNA) 13, 20, 120–3, 123–8

Haemophilus influenzae 67, 68

Hamming distance 205

Hart–Istrail approximation algorithm 234–42

heteropolymer protein model 231

hidden Markov models (HMM) 117, 175–99

applications 193–7

exercises 197–8

urn model 176

Homo erectus 135

Homo habilis 135

homologous modeling 201

homologous proteins 83–4

homology testing 81

hydrocarbon molecule 4

hydrogen bonds 3, 9, 17
hydrophilic amino acid 229

hydrophilic molecules 3

hydrophobic amino acid 229

hydrophobic force 4, 17

hydrophobic molecules 4

hydroxyl group 4, 5

hypergeometric distribution 32

information (entropy) 62

information flow 2

information theoretic entropy 62–3

interaction graph 248–9

inter-chromosomal events 119

internal energy 66

intra-chromosomal events 119

inversion 119

Jaccard's index 76

Jensen-Shannon divergence 69–70

Kececioglu, Li, Tromp algorithm 118

Kececioglu theorem 116

Kronecker δ-function 144, 158

L. tarenolae 121

Lagrange multipliers 53–4, 59, 63, 64, 132, 219, 264

lattice connectivity constant 236

lattice models of proteins 228–34

Lawrence, Altschul, Boguski, Liu, Neuwald, Wootton algorithm 113

least common ancestor 154

likelihood 177–80

recursive definition 160–2

linking number 214

local alignments 111

local move set 231–2


log odds ratios 86

majority consensus tree 170–1

Mamitsuka's MA algorithm 191–3

Mamitsuko's updates 192–3

Markov chain 38–43, 127, 140, 141, 220

definition 176

irreducible 39

reversible 42

stationary 39, 42

Markov chain Monte Carlo algorithm 43

Markov matrix 141

Markov model 125

definition 177

order 176

Markov process 140

Markov property 140–1, 176

Markov random fields 47–51

mathematical concepts 23–79

mathematical models 23

maximal entropy probability distribution 65


Page 284

maximum entropy 66

maximum likelihood estimation 52–3, 117, 157–66, 184

maximum-likelihood estimation, pair probabilities 132–3

maximum-weight trace 114–17

mean square difference 56

meiosis 12, 21

messenger RNA (mRNA) 13, 20, 120

Methanococcus jannaschii 1, 2, 9, 67–70, 266

methionine 21

metric 147

definition 90

Metropolis et al. theorem 46

Metropolis–Hastings algorithm 35, 37, 43–7

mitochondrial DNA (mtDNA) 136, 140

mitosis 12, 21

molecular biology

exercises 21–2

overview 1–22

molecular fossils 13

molecular fossils 13

Monte Carlo algorithm 43, 220

Monte Carlo applications 55–60

Moore automation 125, 127

motifs 16

multiloops 207

multinomial coefficients 24

multinomial distribution 28

multiple sequence alignment 111–18, 193

multiregional model 135

multivariate function 186–7

mutations 137, 138

Mycoplasma genitalia 68

Needleman–Wunsch algorithm 107

Needleman–Wunsch edit distance 91–4


neighbor relation 166

neighborhood system 44

net pairwise potential 225

neutral networks 203, 205

neutral substitutions 139

non-covalent bond 3

normal distribution 30–1

normalized specific amino acid distance frequency 225

NP-hardness 258

nuclear magnetic resonance (NMR) studies 226

nucleic acids 6–13

nucleotide entropy 66–8

nucleotide sequences 66, 139–44

nucleotides 4–8

forms 8

Nussinov–Jacobson matrix 208

odds ratio 86

oligonucleotides 6

open reading frame (ORF) 12

operational taxonomic unit (OTU) 137

ordering constraints 248

organic chemistry 3

overlay matrices 100

pair group method (PGM) 148

pair probabilities, maximum-likelihood estimation 132–3

PAM matrices 86–8, 139, 140

parallel ß sheets 16

parallel mutations 137, 138

partition function 43, 48, 65

PDB format 266

peptide bond 14

percent minimization 59

performance, definition 234–6

periodicity rules 122


persistence, definition 39

phosphodiester bond 8

phylogenetic trees 136, 145, 148

pivot moves 232

Poisson distribution 29–30, 34

Poisson process 138

polar requirement 17

polarity index 58

polymer, definition 4

polysaccharides 4

positive transition matrix 42

potential energy function 48

primary structure 17, 202

principle of insufficient reason 63

probability density function 25

probability distributions 27–38

probability function 24
Page 285

probability theory 23–53

exercises 72–3

prokaryotes 1, 19, 20

protein 2

protein data bank (PDB) 266

protein folding problem 201 see also folding

protein motifs 194–5

protein structure 15–17

prediction 201–62

protein threading 202, 246–59

definition 246–9

proteins 14–19

Protokarya 2

Pulley Principle 162

purines 8

pyramidines 8

quarternary structure 17

quartet puzzling step 166–70

quartet trees 166–8

Ramachandran plot 15

random boolean cellular automation 74

random sequence 118

random variables 25–6, 31

reciprocal translocation 119

record-to-record Travel algorithm (RRT) 55

recursion equation 92, 95, 104–7

re-estimation of parameters 180

reference amino acid distance frequency 224

relative threading 253

restriction enzymes 81–2

reverse transcriptases 83–4

reversible Markov process 158


ribose 7

ribosomal RNA (rRNA) 13, 21

ribosomes 21

RNA 2, 13

RNA polymerase 19, 195

RNA secondary structure 202–13

root mean square deviation (RMSD) 156

roulette wheel technique 61

Saccharomyces cerevisiae 266

saddlepoint 52

salt bridges 17

SCOP database 266

scoring a model 177–80

scoring function 249, 259

scoring matrices 84–6

scoring subsequence 111

secondary structure 17, 202

elements 16

segment algorithm 71

segmentation algorithm 32

selenocysteine 56

sequence alignment 81–134

example 83–4

exercises 128–32

sequence space 205

Shannon entropy function 64

Shannon's formula 62

shape space 205

shuffle algorithm 61

shuffled-codon codes 56, 58

similarity methods 110–11

simulated annealing 43–4, 46, 220

heuristics related to 54–5

Sinclair theorem 43

single-molecule DNA sequencing 117


small molecules 4–6

small nuclear (snRNA) 13

Smith–Waterman local sequence alignment 120

spacing constraints 248

specific amino acid distance frequencies 225

standard deviation 26

standard error 31

statistical model 175

statistical significance 69

StatSignificance algorithm 71

Steiner sequences 117–18

Stirling's approximation 146

Stirling's formula 24–5, 62

stochastic matrix 38

Strimmer, von Haeseler algorithm 168

structure prediction 201–62

constraint-based 243–6

exercises 259–62

sugar molecule 4

sugar transport proteins 195

sugars 6
Page 286

sum-of-pairs multiple sequence alignment problem 114

supercoiled DNA 218, 220

supersecondary structures 16

SWISS-PROT 266

synonymous substitutions 139

syntenic distance 119, 120

synteny 119

tandem duplication 99–110

TATA box 12, 19, 195–6

taxon 137

Taylor expansion 29, 143

tertiary structure 17, 201, 202

thermal luminescence 135

threading sets 253

threshold accepting (TA) algorithm 54–5

thymine 8

topological neighbors 229

total free energy 220

total probability formula 25

trace matrix 93, 98

traceback 93, 94, 98, 107, 179, 180

transcription 19–21

transfer RNA (tRNA) 13, 20–1

transition probability functions 141

transitional mutations 140

transitions 110, 127–8

translation 19–21

transposition 119

transversion 110

transversional mutations 140

tree 145

likelihood 159–60

topology 166

Trypanosoma brucei 1
trypanosomes 123–8

ultrametric trees 147–52

Unger–Moult hybrid genetic algorithm 233

unit evolutionary time 138

units of measurement 263–4

UPGMA 148–9, 152, 154–5, 157

uracil 8

variance 26

Viterbi algorithm 180

Viterbi score of a model 179

WAC matrix 139

water molecule 4

Waterman, Smith and Beyer theorem 95–6

Watson–Crick base pairs 121, 124, 205, 209

Watson–Crick model 8

Watson–Crick rules 8

web sites 266

WPGMA 151, 156

Wraparound Dynamic Programming 107, 108

wraparound step 101, 102

You might also like