Read Me
Filename: README
Created: July 23, 2011
Authors: Ivan Erill <erill@umbc.edu>, Bob Forder <rforder1@umbc.edu>
Estremo 0.9.1 Release Notes
This is a beta (non-public) release. E-mail me with any bugs, annoying things,
or suggestions.
Contents:
1. Introduction
2. Organism components
3. The genetic algorithm
4. Using the program
1. Introduction
Estremo is an evolutionary simulation of regulated motifs. Grunt uses a genetic
algorithm to simulate the co-evolution of transcription factors and their binding
sites. Grunt uses multi-layer perceptrons which are encoded in the genomes of
organisms to simulate the behavior of transcription factors. This file is a crash
course in the usage and terminology employeed within the model. More information
can be found in the sample configuration file and in the design document PDF which
should accompany this file.
2. Organism Components
Recognizer
The recognizer reads a sequence of numbers and outputs a score for that
sequence. Presently, the recognizer is implemented as a multi-layer perceptron.
Translator
The translator is responsible for reading coding regions of the genome and
constructing recognizers from them.
Sampler
The sampler converts a sequence of bases in the genome (a non-coding region) and
converts that sequence to a sequence of numbers that can be interpreted by a
recognizer. The window size of a sampler is the width of the sequence which
it converts.
Background
The background is the portion of the genome which does not encode any
recognizers and is not regulated by any recognizers (is not a non-coding region)
The scores which are assigned by a recognizer to background sequences are
of vital importance to the fitness function. There are two ways of generating
the background.
i. One large genomic sequence is generated in advance from a provided GC%
or CUB. That sequence is then sampled at random positions. This is the
most memory intensive, but also the fastest.
ii. "Online Mode" Each time the recognizer evaluates a background sequence,
we generate it on the fly using the provided GC% or CUB. This uses
very little extra memory (and ensures that the recognizer cannot
memorize background sequences) but is slightly slower.
3. The Genetic Algorithm
Grunt creates a population of organisms and utilizes a genetic algorithm to
simulate their evolution. It currently uses k-tournament parental selection
with total replacement. The fitness of each organism is calculated in the
following way:
i. Each recognizer produces a score for each position in each non-coding
region (NCR). The binding level for that NCR is equal to the highest
score assigned to any position in that NCR (the maximum score over all
positions).
ii. The average output of each recognizer is determined by calculating the
average score that the recognizer produces over a large number of
background sequences.
iii. The expression level of each NCR is determined by summing over all
recognizers, the ratio of the binding level of that recognizer to the
average score of that recognizer. That is,
Expression level =
Sum over all recognizers (Binding level of recognizer / Average)
iv. The expression level of each NCR is then compared to a target value.
This target value represents a desired transcription level of the
protein associated with the NCR.
If the expression level is greater than the target value, then
f = expression level / target. Otherwise f = target / expression level.
The fitness of the organism is equal to the sum of f for all NCRs less
the number of NCRs (to ensure that fitness = 0 is always perfect).
Larger fitness values indicate poorer regulation. Thus, the goal of the GA is
to minimize the fitness values of the organisms.
4. Using the program
Input
Grunt is a command line application (no plans of pretty GUIs yet... ). The
program is fairly simple to execute. All input is passed to the program via a
configuration file. An example configuration file is provided along with this
program which contains (some) explanations of the many parameters. Grunt is
used like:
grunt <config filename> <output filename>
Which brings us to ...
Output
For each generation, grunt outputs statistics on the fittest organism in the
population. It writes this information to a user specified CSV file (optionally
including the binding motif associated with each recognizer as well). The order
of the data in each row of the CSV file is as follows:
IC-1, MI-1, AVG-1, IC-2, MI-2, AVG-2, ..., IC-n, MI-n, AVG-n, Fitness
Where IC-n is the information content (Rsequence) of the binding motif
associated with the n-th recognizer, MI-n is the sum of the pair-wise mutual
information of the binding motif associated with the n-th recognizer, AVG-n is
the average score assigned by the n-th recognizer over the background. The
list ends with the fitness of the organism.
When the option is specified to output the binding motif of each recognizer as
well, those binding motifs immediately follow the numerical data. Each site
in the motif is followed by three numbers. The first is the position of the
binding site in the non-coding region (if the size of the NCR is the same as the
sampler window size, then this will always be zero and all recognizers will
have identical binding motifs). The second is the binding level of the
recognizer at that site. The third is the expression level of the NCR which
contains that site. For example:
Site NCR Pos. Binding Lvl NCR Exp. Level
TCGAC 0 0.9529 1.1425
GCGGG 0 0.9493 1.1382
GCGAA 0 0.9828 1.1783
GGTTA 0 0.8070 0.9675
GTTTC 0 0.9322 1.1177
CGATT 0 0.9539 1.1437
ACCAC 0 0.8831 1.0588
ATAGG 0 0.9136 1.0953
GCCAC 0 0.9954 1.1934
CCATG 0 0.9292 1.1141