DEPARTMENT OF BIOTECHNOLOGY
KUMARAGURU COLLEGE OF
TECHNOLOGY
Each question should be addressed and codes for each section should be marked correctly
P18BTI2203L: Computational Biology Academic Year:2021-
Laboratory 22
1 - Unix Commands &
Instructor: Dr. Ram Scripting Scribes: R18-
K MBT1
Answer all the following [(CO1,5), (K2)]
1. (5 points) A listing of all processes that you are currently running on the machine you are using, sorted by the command
name in reverse alphabetical order. The output should consist only of the processes you are running.
2. (5 points) The number of words in the file /usr/dict/words (*) which contain all of the letters ”ass”,”bae”,”zer”. List
them individually1
3. (5 points) A ”long” listing of the largest 5 files in the /etc directory whose name contains the string ”.conf”, sorted by
decreasing file size.
4. (5 points) Create multiple folder with the starting name as 20MBTxxx followed by (001..018). Copy a file called
”sample” into all the folder
5. (2 points) List all files in the tmp directory owned by root
6. (3 points) Create a file ”detail” which contains the names of those files in the My Documents directory, which begins with
”a”, which have been modified in the last three days.
7. (5 points) Display the date in the mm/dd/yy format, along with the present time in AM/PM
8. (10 points) Create a file called places whose sample data is as follows and answer the questions below
2.
bombay india 45 asia
67
7
karachi pakistan 54 Asia
87
6
nairobi Kenya 32 africa
19
6
(a) List the details for the countries usa, kenya and canada
(b) list the detials for the continent asia ignoring case-sensitive
(c) Display the list of those countries whose population is between 40000 and 60000
(d) Extract the lines which end with ”fa”
Question: 1 2 3 4 5 6 7 8 Total
Points: 5 5 5 5 2 3 5 1 40
0
Score:
Course Coordinator
****
1
Note: On some Unix/Linux systems, the dictionary has the filename /usr/share/dict/words
2
Create as much as entries as you require for solutio
1
Experiment 1-UNIX commands
1 # Process monitoring2
3 >top -d 5 -b | grep -i "COMMAND" -A 154
5 PID USER PR NIVIRTRES SHR S %CPU %MEM TIME+ COMMAND
6 1685 bioinfo 20 0 1739020 140976 60364 S 1.0 3.6 0:44.08 cinnamon
7 4722 bioinfo 20 0 41784 3712 3120 R 0.4 0.1 0:00.02 top
8 2318 bioinfo 20 0 1321048 249192 117016 S 0.2 6.3 1:29.26 chrome
9 2379 bioinfo 20 0 585604 118740 57660 S 0.2 3.0 0:20.54 chrome
10 4081 bioinfo 20 0 1026908 242260 75716 S 0.2 6.1 0:34.96 chrome
11 4554 root 20 0 0 0 0 S 0.2 0.0 0:00.30 kworker/u16:0
12 1 root 20 0 119784 5908 3956 S 0.0 0.1 0:01.40 systemd
13 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
14 4 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
15 6 root 20 0 0 0 0 S 0.0 0.0 0:00.03 ksoftirqd/0
16 7 root 20 0 0 0 0 S 0.0 0.0 0:00.77 rcu_sched
17 8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
18 9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
19 10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-drain
20 11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
21
22
23 # Word count
24 > cd /usr/share/dict/
25 bioinfo@bioinfo-OptiPlex-380 /usr/share/dict $ ls
26 american-english british-english cracklib-small [Link]-wordlist words
[Link]-dictionaries-common
27 bioinfo@bioinfo-OptiPlex-380 /usr/share/dict $ grep -i "ass" words | wc -w28 710
29 bioinfo@bioinfo-OptiPlex-380 /usr/share/dict $ grep -i "bae" words | wc -w30 8
31 bioinfo@bioinfo-OptiPlex-380 /usr/share/dict $ grep -i "zer" words | wc -w32 144
33
34
35 # Long listing
36 bioinfo@bioinfo-OptiPlex-380 ~ $ cd /etc/
37 bioinfo@bioinfo-OptiPlex-380 /etc $ ls -l -a -s -s /etc/*.conf | head -5
38 4 -rw-r--r-- 1 root root 3028 Nov 24 2017 /etc/[Link]
39 4 -rw-r--r-- 1 root root 112 Jan 10 2014 /etc/[Link]
40 24 -rw-r--r-- 1 root root 23444 Apr 28 2016 /etc/[Link]
41 8 -rw-r--r-- 1 root root 6488 Nov 24 2017 /etc/[Link]
42 4 -rw-r--r-- 1 root root 429 Nov 24 2017 /etc/casper.conf43
44 # Create multiple files
45 bioinfo@bioinfo-OptiPlex-380 ~ $ mkdir test01
46 bioinfo@bioinfo-OptiPlex-380 ~ $ cd test01/
47 bioinfo@bioinfo-OptiPlex-380 ~/test01 $ ls
48 bioinfo@bioinfo-OptiPlex-380 ~/test01 $ mkdir 20MBT{001..018}
49 bioinfo@bioinfo-OptiPlex-380 ~/test01 $ ls
50 20MBT001 20MBT003 20MBT005 20MBT007 20MBT009 20MBT011 20MBT013 20MBT015
20MBT017
51 20MBT002 20MBT004 20MBT006 20MBT008 20MBT010 20MBT012 20MBT014 20MBT016
20MBT018
52
53 # Files owned by root
54 bioinfo@bioinfo-OptiPlex-380 ~ $ ls -l /tmp/ | grep "root"55
drwx------ 3 root root 4096 Feb 11 2016
[Link]-LkiuAp
56 drwx------ 3 root root 4096 Feb 11 2016
[Link]-Qsm9kT
57 bioinfo@bioinfo-OptiPlex-380 ~ $58
59 # List only files starting with "A"
60 bioinfo@bioinfo-OptiPlex-380 ~/Documents $ ls -d a* > [Link]
61 bioinfo@bioinfo-OptiPlex-380 ~/Documents $ cat [Link]
62 [Link]
63 [Link]
64 bioinfo@bioinfo-OptiPlex-380 ~/Documents $
65
2
P18BTI2203-Computational biology 21MBT011
CBL-001
66 # date function
67 bioinfo@bioinfo-OptiPlex-380 ~/Documents $ date +%D%r
68 08/17/[Link] PM IST
69
70 # Grep function
71 bioinfo@bioinfo-OptiPlex-380 ~/Documents $ egrep "usa | Kenya | canada " [Link]
72 nairobi Kenya 32196 africa
73 bioinfo@bioinfo-OptiPlex-380 ~/Documents $
74
75 # Remove case sensitive grep
76 bioinfo@bioinfo-OptiPlex-380 ~/Documents $ grep -i "asia" [Link]
77 bombay india 45677 asia
78 karachi pakistan 54876 Asia
79 bioinfo@bioinfo-OptiPlex-380 ~/Documents $
80
81
82 # population between 40K to 60K
83 bioinfo@bioinfo-OptiPlex-380 ~/Documents $ cat [Link] | grep
"[4-6][0-9][0-9][0-9][0-9]"
84 bombay india 45677 asia
85 karachi pakistan 54876 Asia
86 bioinfo@bioinfo-OptiPlex-380 ~/Documents $
87
88 # Search lines with "ia"
89 bioinfo@bioinfo-OptiPlex-380 ~/Documents $ grep ia [Link]
90 bombay india 45677 asia
91 karachi pakistan 54876 Asia
92 bioinfo@bioinfo-OptiPlex-380 ~/Documents $
93
94
95
96
3
21MBT011
P18BTI2203 Computational biology CBL-002
P18BTI2203L: Computational Biology Academic Year:2021-
Laboratory 22
8–Molecular Visualization
Instructor: Dr. Ram f \ v v v v Visualization Scribes: R18-
K MBT1
Each question should be addressed and codes for each section should be marked correctly
Answer all the following (CO5), (K2)]
1. (10 points) Perform the following to represent the out membrane surface protein of [Link] [OmpV] which is been
structurally solved using x-ray crystallography.
(a) Compare the close homologous structure. Align them and highlight the structural difference
(b) Identify the cofactor/ ligand bound to the protein and represent them
(c) 2D-Label the regions of interaction and represent them
Questio 1 Total
n:
Points: 10 10
Score:
Course Coordinator
****
4
21MBT011
P18BTI2203 Computational biology CBL-002
Experiment 2 – Molecular visualization
A) Compare the close homologue structure Align them and highlight the Structural
difference.
Fig 1.1 comparision of two homologous protein
B) Identify the cofactors/Ligand bound to protein and represent them
Fig 1.2 Ligand of the protein structure Fig 1.3 Distance between two HOH molecules
C) 2D label the regions of interaction and represent them.
Fig 1.4 two labeling of two protein structures along with the completely aligned sequence (green )
5
21MBT011
P18BTI2203 Computational biology CBL-002
Inference:-
Outer membrane surface protein of vibrio cholerae [OmpV] which is
structurally solved using x-ray crystallography.
It is found that 2WK7 and 2WK9 are homologous in structure and sequence they are
very close in evolution as they found in similar organism.
2WK7 - Structure of APO form of vibrio cholerae CqsA
2WK9 - Structure of Plp-Thr aldimine form of Vibrio cholerae CqsA
In Fig 2.1 wk7 is represented in grey colour and 2WK9 is represented in light
bluecolour. Two proteins were aligned.
In Fig 2.2 the Ligands are represented in red colour. Which is PLG 600 B c5 and
PLP600 A c5
In Fig 2.3 the distance between two atoms has been calculated. One HOH to
anotherHOH atom is 4.304 Å.
In Fig 2.4 2D labelling has been done for the visualization along with the
representation of Portion of similar sequence of two protein. It is identified that the A-
chain of 2WK7 and B-chain of 2WK9 and nearly identical. In Fig 4 Representation is
done in green colour starts from ASN 47A to PRO 4A.
6
P18BTI2203-Computational Biology 21MBT011
CBL-003
P18BTI2203L: Computational Biology Academic Year:2021-
Laboratory 22
3 - Sequence Similarity using BLAST
Program
Instructor: Dr. Ram K Scribes: R18-
MBT1
Each question should be addressed and codes for each section should be marked correctly
Answer all the following [(CO6,5), (K4)]
1. (10 points) Obtain the human HBA and HBB protein sequences. Perform pairwise alignment at the
NCBI BLAST website.
(a) Use a comparison tool from the EBI website.
(b) Vary the scoring matrix (e.g. try different PAM and BLOSUM matrices) and record the effects
on the score, the number of gaps, the percent identity, and the length of the aligned region.
(c) For the NCBI BLASTP program note that the output of a pairwise alignment includes a dot
matrix view.
Questio 1 Total
n:
Points: 10 10
Score:
Course Coordinator
***
*
7
P18BTI2203-Computational Biology 21MBT011
CBL-003
Experiment 3- BLAST
1. Identify all homologous protein of human retinol binding protein of Human Retinol-
binding protein 4 (RBP4; NP_006735) using blast P
Fig 3.1 Identification of homologue sequence using Blast P
Fig 3.2 Graphical summary for the selected sequence
8
P18BTI2203-Computational Biology 21MBT011
CBL-003
Fig 3.3 phylogenetic tree and Multiple alignment results for the
Blast P has been done and Homolog sequence of different organism has been found. Along
with that blast tree has been viewed. Multiple sequence has been for top 10 sequences
9
P18BTI2203-Computational Biology 21MBT011
CBL-003
2. Identify a distant homolog of the above protein using PSI-BLAST
Fig 3.4 Distant homolog after running 4 iteration using PSI Blast
Fig 3.5 distant homologue protein is found Which has high scoring
Fig 3.6 amino acid sequence of the distant homologue protein
10
P18BTI2203-Computational Biology 21MBT011
CBL-003
Fig 3.7 Graphical summary for the selected sequence in PSI BLAST
Fig 3.8 Multiple alignment results and Phylogenetic tree for the selected sequence in PSI BLAST
In Web logo height of the stack indicates the sequence conservation at that position while
height of symbols within stack indicates the relative frequency of each amino acid at that
position.
11
P18BTI2203-Computational Biology 21MBT011
CBL-003
[Link] the signature and search for related proteins using PHI Blast
We found the conserved sequence domain for RBP4 Protein which is
DCRVSSFRVKE Red marked are hydrophobic amino acids. These hydrophobic amino acids
helps in stabilizing the structure of protein.
Fig 3.9 Searching in PHI Blast with the conserved sequence
Fig 3.10 Conserved sequence found in web logo
12
P18BTI2203-Computational Biology 21MBT011
CBL-003
Inference:
2. PSI-BLAST provides a distant relationship between given protein. PSSM is used to
further search database for new matches, and is updated for subsequent iterations with the
newly detected sequences.
The identified distant homolog of this protein is retinol binding protein 4 (phyllostomus
discolor). GenBank common name: pale spear-nosed bat
Kingdom: Animalia Phylum: chordata Class: Mammalia Order: Chiroptera family:
phyllostomidae Genus: phyllostomus
Distribution and habitat: The species found in southern Mexico to northern Peru and Bolovia
when we query a database, our sequence gets compared to every other sequence until top hits
are found and reported in results with quality metrices.
Some hits may report the same scores and so differentiating the varying levels of confidence
that each parameter describes is necessary to choose sequence for the next phase of analysis.
The results defined as
Maximum bit score: 398, is the highest alignment score (bit-score)
between the query sequence and the database segments. It is inversely proportional to the e-
value. The higher the bits core, the better the sequence similarity Total score: 398, is the sum of
the alignment scores of all sequences from the same database
Percent query coverage: here it is 90% to 100% after three iterations, it
describes how similar the query is to the aligned sequence. The e value is observed as 2𝑒−146.
It is the number of expected hits of similar quality(Score) that could be found just by chance,
given the same size of random database. it is the first quality filter for the BLAST search
result, to obtain only results equal to or better than the number given by the e value option.
The BLAST hits with E-value smaller than 1𝑒−50 includes database
matches of very high quality, Blast hits with E-value smaller than 0.01 can still be considered
as good hit for homology [Link] PSSM captures the conservation pattern in alignment
and stores it as a matrix of scoresand weakly conserved position receives scores as zero.
The newly detected sequences from second round of search, which are above specified score
(e value) threshold is again added to alignment and the profile is refined for another round of
searching.
This process is iteratively continued until desired or until convergence, which is the state
where no new sequence is detected above the defined threshold.
13
P18BTI2203 Computational biology 21MBT011
CBL-004
Experiment 4 - Artificial Neural Network
P18BTI2203L: Computational Biology Academic Year:2021-
Laboratory 22
4- Artificial Neural Network
Instructor: Dr. Ram Scribes: R18-
K MBT1
Each question should be addressed and codes for each section should be marked correctly
Answer all the following (CO5),
(K2)]
1. (5 points) Construct an artificial neural network for predicting percentage of adsorption for the biochar used.
The training data is given in table below
(a) Where X1 is Temperature, X2 in ◦C, X2 , X3 and X4 are pH, initial concentration (mg/L)
and biochar dose(g)
2. (5 points) Given the seed dataset for miRNA-mRNA interaction. Prediction have been made to detect
whether the interaction would result in Class 1 (Oncogene) or Class 2 (Tumour suppressor gene). Construct
an ANN architecture and report your inference.
Question: 1 2 Total
Points: 5 5 10
Score:
Course Coordinator
****
14
P18BTI2203 Computational biology 21MBT011
CBL-004
1. Construct an artificial neural network for predicting percentage of adsorption for the biocharused. The
training data is given in table below
a) Where X1 is Temperature, X2 in˚C, X2, X3 and X4 are pH, initialconcentration
(mg/L) and biochar dose(g)
construction of artificial neural network:
Fig 4.1 Neural network structure for the inputs
Fig 4.2 best Validation Performance at epoch 3
15
P18BTI2203 Computational biology 21MBT011
CBL-004
Fig 4.3 Regression line of Training, Validation, Test and All with respect to Target
Fig 4.4 Final out put after simulating the test value
The input data given to the train and test predicts the expected target output i.e 68.1348 for the given set of
data. The first 70% of data is taken as train values and target values and tested withthe remaining data to
predict the expected target. The expect target with minimal error is obtained in the 7th iteration.
Information flows through the neural network constructed with input, output and output layer in two ways.
Patterns of information are fed into the network via input units, which trigger the layers of hidden units, and
these in turn arrive at output units. It
16
P18BTI2203 Computational biology 21MBT011
CBL-004
takes the input and computes the wighted sum of inputs and includes a bias. This computation is
represented in the form of a transfer function. It determines weighted total is passed as an input to an
activation function to produce output. In the first iteration the values of R are: Training: 0.,81779,
validation: 0.32095, test: 0.78845 All: 0.85211 Here validation and test valuesare nowhere near 0.9, the
output obtained is 57.52 but our expected target value should be nearer to 68. Therefore, some iteration
has been run to get out expected target output value. In the seventh iteration the values of R are: Training:
0.98651, validation: 0.93437, test: 0.99996 All:0.98932 Here all the values of training, test and target
values are nearest to 0.9 which has minimal error, the final result obtained is 68.1348 which is same as our
expected target value which is 68.18. As the neural network has arrived to the closest expected target
value, the iteration can be stopped right here.
2. Given the seed dataset for miRNA-mRNA interaction. Prediction have been made to detectwhether
the interaction would result in Class 1 (Oncogene) or Class 2 (Tumour suppressor gene). Construct an
ANN architecture and report your inference.
Fig 4.5 Neural network structure for the inputs
17
P18BTI2203 Computational biology 21MBT011
CBL-004
Fig 4.6 Best Validation Performance at epoch 4
Fig 4.7 Regression line of Training, Validation, Test and All with respect to Target
18
P18BTI2203 Computational biology 21MBT011
CBL-004
Fig 4.8 Final out put after simulating the test value
for the given set of data. The first 70% of data is taken as train values and target values andtested with the
remaining data(30%) to predict the expected target. The expect target with minimal error is obtained in the
4th iteration.
Prediction have been made to detected as Class 2 (Tumour suppressor gene). The ANN process input data
by looping over time steps and updating the network state. The network contains information remembered
over all previous steps. In each time step of input sequence,the network learns to predict the value of the
next step. The training progress displayed in formof plot. The prepared test data use the same steps of the
training data. For the training/evaluation/test dataset splitting, the model was trained on the training
dataset with enough epochs, evaluated on the evaluation dataset and finally the performance was tested on
the test dataset. Three iterations were done to predict the targeted value.
In the first iteration the values of R are: Training: 0.4783, validation: 0.4252, test: 0.3818 All: 0.4410 Here
training, validation and test values are nowhere near 0.9, the output obtained is
1.18. Therefore, another iteration can be run to get out expected target output value.
In the 4th iteration the values of R are: Training: 0.99033, validation: 0.8365, test: 0.8471 All: 0.9486.
Here test value is near to, the output (2) which is the same obtained in the 4th iteration.
Therefore, expected target value is obtained. ANN automatically extracts pattern from canonical and non-
canonical pairing between the miRNAs and its targets which is Class 2(Tumour suppressor gene).
19
P18BTI2203-Computational biology 21MBT011
CBL-005
Answer all the following [(CO4), (K4)]
1. (10 points) Perform a multiple sequence alignment of beta globins of plant origin and construct a
phylogenetic tree for the same.
Quest 1 Tot
ion: al
Point 10 10
s:
Score
:
Course Coordinator
****
Generate a multiple sequence alignment of beta globin among various species toprove that the"Regions
around the home-binding regions are highly conserved
(a) Identify the conserved domain & hypervariable regions and infer the changes from the
alignment
(b) Structurally is there any changes among the aligned homologs?
20
P18BTI2203-Computational biology 21MBT011
CBL-005
Solution:
Globin is an singular structural unit of haemoglobin which involves binding gaseous ligandssuch as O2,
NO and CO. there are different types of globin such as heme, myoglobin, cytoglobin etc. in this experiment
we are going to take 6 different globin from different species to find theconserved domain.
Name Source ID Amino acid length
beta-globin Podocnemis unifilis BAJ46574.1 147 aa
beta-globin A subunit Archilochus alexandri APA23495.1 147 aa
HBB protein Urocynchramuspylzowi NWU01539.1 147 aa
hemoglobin subunitbeta Catharus ustulatus XP_032907297.1 147 aa
hemoglobin beta Aegithalos caudatus AVA16350.1 147 aa
subunit A
beta-globin A subunit Schistes geoffroyi APA23487.1 147 aa
Fig 5.1 Conserved domain for selected globin sequence
By using MEGAX software tool the sequence has been aligned and the conserved region is found
conserved one is marked with(*) we found five conserved regions which can be shownwith the help of
Web logo tool where the dominant sequence is shown in big letter. In Fig 2 the
21
P18BTI2203-Computation biology 21MBT011
CBL:005
conserved regions are marked.
In Fig 3 the Plot con graph has also done for the aligned sequence. It is found that there is aparticular peak
is found which represent the highly conserved domain. It lies in 26 to 43.
Fig 2: Highly conserved domain R1, R2, R3, R4, R5 in Web logo
R R
R R
Fig 5.2: Highly conserved domain R1, R2, R3, R4, R5 in Web logo
Fig 5.3: Plot con graph for aligned sequence
22
P18BTI2203-Computation biology 21MBT011
CBL:005
Fig 5.4: Protein structure of beta-globin (Podocnemis unifilis) with their conserved region
Inference:
By Doing MSA for the selected sequences with MEGAX tool and doing the Plot con
graph it is found that the highly conserved region lies between 20 to 50 amino acids. In word logo we
also found that there are four other conserved regions. From the fig 4 we interpret thatin different species
the amino acids surrounding the active sites are highly conserved (R1, R2,R3). Some other parts of the
sequence are also conserved (R4, R5).
23