Bioinformatics 
Programming 
(Perl Programming) 
2010 
Davide Pisani
Bioinformatics 
• Using computers to store, organise and 
interpret biological data 
• In particular, data from high-throughput 
technologies (-omics)
High-throughput technologies 
• DNA & Protein sequences and structure 
(genomics & Proteomics) 
• Yeast two-hybrid screens (interactomics) 
• Microarrays (transcriptomics) 
• Metabolic networks (metabolomics)
How much sequence data is 
there? 
 1371published complete genomes 
 188 ongoing archaeal genomes 
 4941 Bacterial ongoing genomes 
 1599 Ongoing eukaryotic genomes 
 242 metagenomes
How much data in each 
genome? 
ftp://ftp.ncbi.nih.gov/refseq/release/
The human genome 
ftp://ftp.ncbi.nih.gov/refseq/release/
The human genome 
ftp://ftp.ncbi.nih.gov/refseq/release/
The human genome 
ftp://ftp.ncbi.nih.gov/refseq/release/ 
etc.. 
(70 base pairs per line, 57 lines per page = 3990 bases/page 
Chromosome 1 is (about) 247,249,719 bases long 
i.e. 62,000 pages 
Whole genome (3.2 x 109) = 802,000 pages
Genome Base pairs No. of Genes 
Phi-X 174 5,386 10 
Nanoarchaeum equitans 490,885 552 
E. coli 4,639,221 4,377 
Saccharomyces 
12,495,682 5,800 
cerevisiae 
Drosophila 
melanogaster 
122,653,977 13,379 
Homo sapiens 3.2 x 109 30,000 
Protopterus aethiopicus 1.3 x 109 ? 
Psilotum nudum 2.5 x 1011 ?20-25,000 
Amoeba dubia 6.7 x 1011 ?
Genbank contains much more 
than just sequence data 
Information on the Organism, the 
gene, where it is expressed and so 
forth.
Protein Structure
PDB: Protein Structure

Lecture1 1 Perl for bioinformatics Davide Pisani & James Cotton

  • 1.
    Bioinformatics Programming (PerlProgramming) 2010 Davide Pisani
  • 2.
    Bioinformatics • Usingcomputers to store, organise and interpret biological data • In particular, data from high-throughput technologies (-omics)
  • 3.
    High-throughput technologies •DNA & Protein sequences and structure (genomics & Proteomics) • Yeast two-hybrid screens (interactomics) • Microarrays (transcriptomics) • Metabolic networks (metabolomics)
  • 4.
    How much sequencedata is there?  1371published complete genomes  188 ongoing archaeal genomes  4941 Bacterial ongoing genomes  1599 Ongoing eukaryotic genomes  242 metagenomes
  • 5.
    How much datain each genome? ftp://ftp.ncbi.nih.gov/refseq/release/
  • 6.
    The human genome ftp://ftp.ncbi.nih.gov/refseq/release/
  • 7.
    The human genome ftp://ftp.ncbi.nih.gov/refseq/release/
  • 8.
    The human genome ftp://ftp.ncbi.nih.gov/refseq/release/ etc.. (70 base pairs per line, 57 lines per page = 3990 bases/page Chromosome 1 is (about) 247,249,719 bases long i.e. 62,000 pages Whole genome (3.2 x 109) = 802,000 pages
  • 9.
    Genome Base pairsNo. of Genes Phi-X 174 5,386 10 Nanoarchaeum equitans 490,885 552 E. coli 4,639,221 4,377 Saccharomyces 12,495,682 5,800 cerevisiae Drosophila melanogaster 122,653,977 13,379 Homo sapiens 3.2 x 109 30,000 Protopterus aethiopicus 1.3 x 109 ? Psilotum nudum 2.5 x 1011 ?20-25,000 Amoeba dubia 6.7 x 1011 ?
  • 10.
    Genbank contains muchmore than just sequence data Information on the Organism, the gene, where it is expressed and so forth.
  • 11.
  • 12.