ESTReMo Code

An evolutionary simulator of transcription regulatory networks

Status: Alpha

Brought to you by: ivanerill, pon2, rforder2

Tree [r20] / 0.9.1 / History

HTTPS access

File	Date	Author	Commit
bin	2011-07-23	rforder1	[r7] Copying 0.9.0 back into trunk
doc	2011-07-23	rforder1	[r7] Copying 0.9.0 back into trunk
src	2011-10-09	rforder1	[r20] Branched 0.9.1
CHANGELOG	2011-10-09	rforder1	[r20] Branched 0.9.1
Makefile	2011-10-09	rforder1	[r20] Branched 0.9.1
README	2011-10-09	rforder1	[r20] Branched 0.9.1
example.conf	2011-10-09	rforder1	[r20] Branched 0.9.1
run.slurm	2011-10-09	rforder1	[r20] Branched 0.9.1

Read Me

Filename:   README
Created:    July 23, 2011
Authors:    Ivan Erill <erill@umbc.edu>, Bob Forder <rforder1@umbc.edu>

Estremo 0.9.1 Release Notes

This is a beta (non-public) release.  E-mail me with any bugs, annoying things,
or suggestions.

Contents:

    1. Introduction
    2. Organism components
    3. The genetic algorithm
    4. Using the program

1. Introduction

Estremo is an evolutionary simulation of regulated motifs.  Grunt uses a genetic
algorithm to simulate the co-evolution of transcription factors and their binding
sites. Grunt uses multi-layer perceptrons which are encoded in the genomes of
organisms to simulate the behavior of transcription factors.  This file is a crash
course in the usage and terminology employeed within the model.  More information
can be found in the sample configuration file and in the design document PDF which
should accompany this file.

2. Organism Components

Recognizer

The recognizer reads a sequence of numbers and outputs a score for that
sequence.  Presently, the recognizer is implemented as a multi-layer perceptron.

Translator

The translator is responsible for reading coding regions of the genome and
constructing recognizers from them.

Sampler

The sampler converts a sequence of bases in the genome (a non-coding region) and
converts that sequence to a sequence of numbers that can be interpreted by a
recognizer.  The window size of a sampler is the width of the sequence which
it converts.

Background

The background is the portion of the genome which does not encode any
recognizers and is not regulated by any recognizers (is not a non-coding region)
The scores which are assigned by a recognizer to background sequences are
of vital importance to the fitness function.  There are two ways of generating
the background.

    i.  One large genomic sequence is generated in advance from a provided GC%
        or CUB.  That sequence is then sampled at random positions.  This is the
        most memory intensive, but also the fastest.

    ii. "Online Mode"  Each time the recognizer evaluates a background sequence,
        we generate it on the fly using the provided GC% or CUB.  This uses
        very little extra memory (and ensures that the recognizer cannot
        memorize background sequences) but is slightly slower.

3. The Genetic Algorithm

Grunt creates a population of organisms and utilizes a genetic algorithm to
simulate their evolution.  It currently uses k-tournament parental selection
with total replacement.  The fitness of each organism is calculated in the
following way:

   i.   Each recognizer produces a score for each position in each non-coding
        region (NCR).  The binding level for that NCR is equal to the highest
        score assigned to any position in that NCR (the maximum score over all
        positions).

   ii.  The average output of each recognizer is determined by calculating the
        average score that the recognizer produces over a large number of
        background sequences.

   iii. The expression level of each NCR is determined by summing over all
        recognizers, the ratio of the binding level of that recognizer to the
        average score of that recognizer.  That is,

        Expression level =
            Sum over all recognizers (Binding level of recognizer / Average)

   iv.  The expression level of each NCR is then compared to a target value.
        This target value represents a desired transcription level of the
        protein associated with the NCR.

        If the expression level is greater than the target value, then
        f = expression level / target.  Otherwise f = target / expression level.
        The fitness of the organism is equal to the sum of f for all NCRs less
        the number of NCRs (to ensure that fitness = 0 is always perfect).

Larger fitness values indicate poorer regulation.  Thus, the goal of the GA is
to minimize the fitness values of the organisms.

4. Using the program

Input

Grunt is a command line application (no plans of pretty GUIs yet... ).  The
program is fairly simple to execute.  All input is passed to the program via a
configuration file.  An example configuration file is provided along with this
program which contains (some) explanations of the many parameters.  Grunt is
used like:

    grunt <config filename> <output filename>


Which brings us to ...

Output

For each generation, grunt outputs statistics on the fittest organism in the
population.  It writes this information to a user specified CSV file (optionally
including the binding motif associated with each recognizer as well).  The order
of the data in each row of the CSV file is as follows:

    IC-1, MI-1, AVG-1, IC-2, MI-2, AVG-2, ..., IC-n, MI-n, AVG-n, Fitness

Where IC-n is the information content (Rsequence) of the binding motif
associated with the n-th recognizer, MI-n is the sum of the pair-wise mutual
information of the binding motif associated with the n-th recognizer, AVG-n is
the average score assigned by the n-th recognizer over the background.  The
list ends with the fitness of the organism.

When the option is specified to output the binding motif of each recognizer as
well, those binding motifs immediately follow the numerical data.  Each site
in the motif is followed by three numbers.  The first is the position of the
binding site in the non-coding region (if the size of the NCR is the same as the
sampler window size, then this will always be zero and all recognizers will
have identical binding motifs).  The second is the binding level of the
recognizer at that site.  The third is the expression level of the NCR which
contains that site.  For example:

Site    NCR Pos.   Binding Lvl NCR Exp. Level
TCGAC   0          0.9529      1.1425
GCGGG   0          0.9493      1.1382
GCGAA   0          0.9828      1.1783
GGTTA   0          0.8070      0.9675
GTTTC   0          0.9322      1.1177
CGATT   0          0.9539      1.1437
ACCAC   0          0.8831      1.0588
ATAGG   0          0.9136      1.0953
GCCAC   0          0.9954      1.1934
CCATG   0          0.9292      1.1141

ESTReMo Code

An evolutionary simulator of transcription regulatory networks

Tree [r20] / 0.9.1 / Download Snapshot History

Read Me

Tree [r20] / 0.9.1 /

History