Note to COMP 204 students:
We have not covered all the material necessary to write this exam. 2017
December
I have highlighted in YELLOW the questions you should be able to answer
Final Examination
for our midterm exam, and in PINK the questions I feel are a bit harder but
that you should still be able to answer. VERSION #2
Questions that are not highlighted are not relevant for us.
Computer Tools for Life Sciences
COMP 364 SEC 001
18:30 PM December 12, 2017
EXAMINER: Christopher J.F. Cameron ASSOC. EXAMINER: Carlos G. Oliver
STUDENT NAME: McGILL ID:
INSTRUCTIONS
CLOSED BOOK X OPEN BOOK
SINGLE-SIDED PRINTED ON BOTH SIDES OF THE PAGE X
MULTIPLE CHOICE X
Note: The Examination Security Monitor Program detects pairs of students with unusually similar answer patterns
on multiple-choice exams. Data generated by this program can be used as admissible evidence, either to initiate
EXAM: or corroborate an investigation or a charge of cheating under Section 16 of the Code of Student Conduct and
Disciplinary Procedures.
ANSWER IN BOOKLET EXTRA BOOKLETS PERMITTED: YES NO X
ANSWER ON EXAM X
SHOULD THE EXAM BE: RETURNED X KEPT BY STUDENT
NOT PERMITTED PERMITTED X e.g. one 8 1/2X11 handwritten double-sided sheet
CRIB SHEETS:
Specifications: Single double-side page, 8.5 inches x 11
inches
DICTIONARIES: TRANSLATION ONLY X REGULAR NONE
CALCULATORS: NOT PERMITTED X PERMITTED (Non-Programmable)
If you think that none of the given choices is correct, select the choice that is the closest
to being correct. This exam contains 31 questions on 21 pages.
ANY SPECIAL
INSTRUCTIONS:
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page number: 1 / 21
Multiple choice (40 points)
Please indicate the correct answer by CIRCLING YOUR CHOICE OF A-E.
1. (2 points) What is the output of the following Python code?
1 import random
2 print(random.choice([12.321, 32, 65.0, 79.0347, 86.1]))
A. Either 12.321, 32, 65.0, 79.0347, or 86.1
B. TypeError: list must contain items of the same type
C. 32 only
D. Any number other than 12.321, 32, 65.0, 79.0347, or 86.1
E. None of the above
2. (2 points) Which of the following choices is the correct expansion for the list comprehension of
B = [expr(i) for i in A if func(i)]
Assuming A and the functions (expr() and func()) are properly defined beforehand.
A. B = []
for i in A:
if func(i):
B.append(i)
B. for i in A:
if func(i):
B.append(expr(i))
C. B = []
for i in A:
if func(i):
B.append(expr(i))
D. A, B = [], []
B = []
for i in A:
if func(i):
B.append(expr(i))
E. None of the above
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 2 Page number: 2 / 21
3. (2 points) What would be the expected output for the following Python code below?
1 word = "revenge"
2 def func(arg): print(arg)+word
3 func("revenue")
A. revenue
B. revenge
C. revenge
TypeError: unsupported operand type(s) for +
D. revenuerevenge
E. revenue
TypeError: unsupported operand type(s) for +
4. (2 points) What is the output of the Python built-in print() on line 4?
1 string = "comp 364"
2 for i in range(len(string)):
3 string[i].upper()
4 print(string)
A. COMP 364
B. SyntaxError: invalid syntax
C. comp 364
D. [‘C’, ‘O’, ‘M’, ‘P’, ‘ ’, ‘3’, ‘6’, ‘4’]
E. None of the above
5. (2 points) What is the output of the following Python code below?
1 class A:
2 def __init__(self):
3 self.x = 0
4 class B(A):
5 def __init__(self):
6 A.__init__(self)
7 self.y = 1
8 class_obj = B()
9 print(class_obj.x, class_obj.y)
A. 0 1
B. None 1
C. 0 0
D. None None
E. AttributeError: ’B’ object has no attribute ’x’
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 3 Page number: 3 / 21
6. (2 points) What is the output of the following Python code?
1 def func():
2 try:
3 expr(x**4)
4 finally:
5 print("after expr(), within finally")
6 print("after expr(), outside finally")
7 func()
A. after expr(), within finally
B. after expr(), outside finally
C. after expr(), within finally
NameError: name ’x’ is not defined
D. after expr(), within finally
NameError: name ’expr’ is not defined
E. None of the above
7. (2 points) Assume file_obj is properly defined using "r" mode, which of the following methods
will read all characters from a character stream?
A. file_obj.read()
B. file_obj.readall()
C. file_obj.readchar()
D. file_obj.readchars()
E. file_obj.readcharacters()
8. (2 points) Which of the following statements about Python dictionaries is false?
A. More than one key can have the same value
B. Values of a dictionary must be unique
C. The values of the dictionary can be accessed as dictionary_obj[key]
D. Values of a dictionary can be a mixture of different types.
E. Values of a dictionary can be keys of a dictionary.
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 4 Page number: 4 / 21
9. (2 points) What is the output of the Python code shown below?
1 def func():
2 print("Hello")
3 print(func)
4 func()
A. <function func at 0x10d71bf28>
Hello
B. Hello
<function func at 0x10d71bf28>
C. NameError: name ’func’ is not defined
D. Hello
Hello
E. Hello
10. (2 points) Consider the following Python file mylib.py
1 def foo(x):
2 return x
Assume that we are working in the same directory as mylib.py.
To use the foo() function from the Python console as described below:
1 >>> foo(5)
Which one of the following import statements is correct?
A. import mylib.py
B. import foo
C. import mylib
D. from mylib import *
E. None of the above
11. (2 points) What is the output of Python’s print() statement on line 6 below?
1 class Foo:
2 def x_set(self, x):
3 self.x = x
4 my_foo = Foo()
5 x = my_foo.x_set(5)
6 print(x)
A. 5
B. None
C. Nothing is printed
D. TypeError: Cannot assign name to NoneType
E. NameError: __init__ function not defined
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 5 Page number: 5 / 21
12. (2 points) Which of the following Boolean expressions is not logically equivalent to the other four?
A. not (-1 < 0 or -1 > 1)
B. -1 >= 0 and -1 <= 1
C. not (-1 < 1 or -1 == 1)
D. not (-1 > 1 or -1 == 1)
E. 1 and 0
13. (2 points) Which one of the following statements is false?
A. All Python classes (except object) are subclasses of object
B. BioPython Seq objects have a SeqRecord attribute
C. Attributes of objects are also objects
D. The built in print() function accesses the __str__ attribute of its arguments
E. Subclasses inherit methods defined in the base class
14. (2 points) What is the output of the following Python code?
1 class Animal:
2 def __str__(self):
3 return "BOO"
4 class Dog(Animal):
5 def dog_print(self):
6 print("WOOF")
7 d = Dog()
8 print(d)
A. TypeError: Dog class __init__ method not defined
B. WOOF
C. BOO
D. <__main__.Dog object at 0x56239da>
E. None
15. (2 points) Which statement best describes the output of the following Python code?
1 import matplotlib.pyplot as plt
2
3 x = [1, 2, 3, 4, 5]
4 y = [1, 1, 1, 1]
5 plt.plot(x, y)
6 plt.show()
A. A continuous horizontal line at y = 1
B. TypeError: plot expected at most 1 arguments, got 2
C. A dotted horizontal line along y = 1
D. ValueError: x and y must have same first dimension
E. IndexError: list index out of range
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 6 Page number: 6 / 21
16. (2 points) What shape best describes the output of the following Python code?
1 width = height = 9
2 mid = int(height / 2)
3 char = "#"
4 for i in range(height):
5 line = [" " for _ in range(width)]
6 for j in range(width):
7 if i <= mid:
8 cut = int(width/2)
9 line[cut-i] = char
10 line[cut+i] = char
11 else:
12 delta = i - mid
13 line[delta] = char
14 line[width - 1 - delta] = char
15 print(" ".join(line))
A.
B.
C.
D.
E. IndexError: list assignment index out of range
17. (2 points) What does the Python print() statement on line 3 produce?
1 a = [1, 2, 3, 4]
2 b = [1, 2, 3, 4]
3 print(f"{a is b}, {a == b}")
A. False, True
B. False, False
C. True, True
D. True, False
E. SyntaxError: EOL while scanning string literal
18. (2 points) What is the output of the following Python code?
1 x = "PYTHON"
2 print(x[1:-1:2])
A. "YH"
B. "PYTHON"
C. "YTHO"
D. "OHTY"
E. "P"
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 7 Page number: 7 / 21
19. (2 points) Which of the following Python statements stops the current iteration of a loop and skips
to the next iteration?
A. break
B. return
C. continue
D. pass
E. except
20. (2 points) A file containing valid Python code is called a .
A. package
B. class
C. object
D. method
E. module
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 8 Page number: 8 / 21
Short answer (20 points)
21. (4 points) In three or four sentences at most, describe what the concept of ‘overfitting’ is in ma-
chine learning and give an example of a way to prevent it.
Overfitting describes the situation where a predictor yields much more accurately on the
data used for its training than on test data unseen during the training. This occurs when
the predictor has too much flexibility in comparison to the amount of training data.
22. (4 points) Binary number systems
a) In one to two sentences at most, define the terms least and most significant bits (LSB/MSB)
Most significant bit: Left-most bit
Least-significant bit: Right-most bit
b) Convert 2110 to a 6-bit binary representation using the repeated division-by-2 method (show your
work) and circle the LSB & MSB
010101
23. (2 points) Write a one line Python code to perform a linear search on integers to produce a
sorted Python list of all item indices that have the value 42.
integers = [42, 60, 70, 30, 88, 10, 42, 12, 19, 5, 42, 73, 58]
[ i for i in range(len(integers)) if integers[i]==42]
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 9 Page number: 9 / 21
24. (4 points) Draw a tree that represents the inheritance hierarchy implemented by the following Python
code (make sure to include the object class). Draw arrows from each class pointing to its base
class.
Object -> B -> C -> D
1 class B:
\-> A
2 pass
3 class A:
4 pass
5 class C(B):
6 pass
7 class D(C):
8 pass
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 10 Page number: 10 / 21
25. (6 points) Write a function called is_gene() that takes as input an RNA sequence as a string
and returns True if the sequence is a valid protein coding gene (False otherwise). is_gene()
should check whether the RNA sequence contains at least one start codon. Here, we will only use
"AUG" as a start codon. This function also checks for the "UAA" stop codon.
A sequence is a valid protein coding gene if it contains at least one start codon and at most one
stop codon after the first start codon (start and stop cannot be overlapping). The number of bases
in the sequence between the first start codon and the first stop codon must be a multiple of 3 since
genes are translated per codon (which are sequences of 3 bases). NOTE: you may not use any
third-party Python packages (i.e., you cannot use BioPython, scikit-learn, etc.)
Expected behaviour:
1 >>> myseq = "AAAUGGCAGCAUUGUUGUAAGG"
2 >>> is_gene(my_seq)
3 True
4 >>> myseq = "AAAUGGCAGCAUGUUGUAAGG"
5 False #not multiple of 3
start=myseq.find(“AUG”)
if start==None: return False
stop = myseq[start+3:].find(“UAA”)
return stop!=None and (stop-start)%3==0
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 11 Page number: 11 / 21
Long Answer (40 points)
Social networks/graphs (13 marks)
In computer science, a graph is an abstract data type that consists of nodes and edges. A
real world example of a graph is a social network (i.e., Facebook, Twitter, LinkedIn, Google+, etc.),
where nodes are users and edges are the connections (or friendships) between users. Consider the
following simplistic social network graph:
This graph consists of 6 nodes (or users): A, B, C, D, E, and F. We say that A is friends with B
and C because they are connected by a line (a.k.a edge). In Python, the graph displayed above can
be represented as the following set of edges between node names:
1 edges = set([set([’A’, ’B’]), set([’A’, ’C’]), set([’B’, ’D’]),
2 set([’B’, ’E’]), set([’C’, ’F’]), set([’E’, ’F’])])
where for each set([i, j]), nodes i and j are friends.
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 12 Page number: 12 / 21
26. (4 points) Given the following Node and Graph class definitions below, complete the
convert() Graph instance method to convert edges to a Graph object containing a list of
Node objects.
The Node class object has three features:
1. name - a string representing the name of the node
2. num_friends - an integer count of the nodes connected by exactly one edge
3. friends - a list of strings representing the names of nodes connected by exactly one edge
The Graph class object has one feature:
1. nodes - a list of Node objects
1 class Node:
2 def __init__(self, name, friends):
3 self.name = name
4 self.num_friends = len(friends)
5 self.friends = friends
6 class Graph:
7 def __init__(self):
8 self.nodes = []
9 def convert(self, edges):
10 # insert your code here
for a,b in edges:
# search for a and b in nodes
a_node=None
b_node=None
for n in self.nodes:
if n.name==a:
a_node=n
if n.name==b:
b_node=n
if a_node==None:
a_node=Node(a,[])
self.nodes.append(a_node)
if b_node==None:
b_node=Node(b,[])
self.nodes.append(b_node)
a_node.friends.append(b_node)
a_node.num_friends+=1
b_node.friends.append(a_node)
b_node.num_friends+=1
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 13 Page number: 13 / 21
27. (8 points) Complete the following Graph instance method determine_FoF(). This method
takes one string argument (the name of a node) and returns a sorted list of node names that
represent the "Friends of Friends" (or FoF) for the provided node. The FoF of node ‘X’ is the set of
nodes that are not friends of ‘X’ but are friends of friends of ‘X’. For example, the FoF for ‘E’ would
be ‘A’, ‘C’, and ‘D’. The node name provided and immediate friends should not be included in the
returned list.
1 def determine_FOF(self,node_name):
2 # insert your code here
# search for node_name in nodes
first_node = None
for n in self.nodes:
if n.name == node_name:
first_node = n
if first_node==None: # node_name doesn't exist
return []
fof_list = []
print(first_node)
for f in first_node.friends:
print(f)
for fof in f.friends:
if fof!= first_node and fof not in first_node.friends:
fof_list.append(fof.name)
return sort(fof_list)
28. (1 point) What would be the returned FoF list for ’D’ given the graph illustration above.?
[‘A’,’E’]
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 14 Page number: 14 / 21
Sorting algorithms (7 marks)
Consider the following unsorted list:
unsorted_list = [54, 26, 93, 17, 77]
You will implement a new sorting algorithm to sort unsorted_list. This algorithm will be-
gin by comparing the first two items of the unsorted list and swapping them if they are not in
ascending order. For example, since 54 is greater than 26, your algorithm will swap 26 and 54 to
have the following unsorted list:
unsorted_list = [26, 54, 93, 17, 77]
Your algorithm will then move on to the next pair of items (54 and 93) and sort them appropriately,
then the next pair, and the next, and so on and so on... until the end of the list is reached. Once
the end of the list is reached, the algorithm then begins again at the start of the list and this repeats
until the list is sorted. For example, the first six steps of the proposed sorting algorithm applied to
unsorted_list would be the following:
[26, 54, 93, 17, 77] # step 1: sort 26 and 54
[26, 54, 93, 17, 77] # step 2: sort 54 and 93
[26, 54, 17, 93, 77] # step 3: sort 93 and 17
[26, 54, 17, 77, 93] # step 4: sort 93 and 77
# first pass complete, return to the beginning of the list
[26, 54, 17, 77, 93] # step 5: sort 26 and 54
[26, 17, 54, 77, 93] # step 6: sort 54 and 17
The algorithm will continue comparing pairs of integers until the list has been sorted. The list will
be considered to be sorted if no swaps have been made after making one complete pass on all items
of the list.
29. (2 points) Write out the remaining steps of the proposed sorting algorithm (following the format
above) and finish sorting the list to demonstrate that you understand how the algorithm works. To
save on time (and trees...), you do not have to write steps where no swaps have been made (e.g.,
steps two and five above).
[26, 17, 54, 77, 93] # step 7: sort 54 and 77
[26, 17, 54, 77, 93] # step 8: sort 77 and 93
[17, 26, 54, 77, 93] # step 9: sort 17 and 26
… (no more changes happen)
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 15 Page number: 15 / 21
30. (5 points) Complete the following Python code to implement the proposed sorting algorithm. Your
function should return a sorted list.
1 def new_sorting_algorithm(unsorted_list):
2 # insert your code here
L = unsorted_list.copy()
keep_going = True
while keep_going:
keep_going=False
for i in range (len(L)-1):
if L[i]>L[i+1]:
L[i], L[i+1] = L[i+1], L[i]
keep_going = True
return L
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 16 Page number: 16 / 21
Data Analysis Workflow (20 marks)
You are hired as a consultant for ObeseNoMore Inc., a startup company working on identifying genes
linked to obesity. The company wants to know what your plan is for solving this problem.
The company has access to a Web API endpoint www.api.100genomes.com/seqs.json
that contains annotations and sequence information in FASTA format for any gene in the human
genome for a collection of 1000 individuals. Each FASTA sequence is annotated in its header with
the person’s Body Mass Index (BMI), which is a measure of body fat based on height and weight.
1 > 001 | H1B492 | 23.3 | 165
2 AAGGATATATUTUAUAUTUUGGGGGCCCAAGA
3 > 002 | H1B492 | 21.3 | 131
4 AAGGATATATTTUAUAUTUUGGGGGCCCAAGA
Here is the API documentation:
Function: GET seqs
Endpoint: www.api.100genomes.com/seqs.json
Example URL: www.api.100genomes.com/seqs.json?q=H1B492
Responds with a string representation of a dictionary in JSON format where key ’seqs’ contains
a string in fasta form of all 1000 sequences for the query gene, in this case H1B492. Below we only
show two sequences.
"{’status’: ’OK’,
’seqs’: ’> 001 | H1B492 | 23.3 | 165\nAAGGATATATUTUAUAUTUUG
GGGGCCCAAGA\n> 002 | H1B492 | 21.3 | 131\nAAGGATATATTTU
AUAUTUUGGGGGCCCAAGA’}"
The FASTA header fields represent respectively: individual ID, Gene ID, BMI, height in centimetres.
More specifically, ObeseNoMore Inc. wants you to look at a couple of features of the data and see
their effect on BMI.
• sequence length
• sequence GC content (percentage of G or C bases)
• The company suspects that the number of times the following sequence patterns occur in the
sequence could have an impact on a person’s BMI:
– AATT
– GCGCGC
31. (20 points) Outline briefly in point form, the steps you would take to:
• Access and store the data as Python objects (5 points)
• Extract the features (5 points)
• Visualize the features (5 points)
• Propose a machine learning technique to make predictions (5 points)
Feel free to make diagrams.
At each step state, which packages (third party, standard library, or built-in are all valid) you would
use, which functions of these packages would be best suited, and which data types you are working
with.
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 17 Page number: 17 / 21
Example:
• Use BioPython’s PDB module to load a structure into a Structure object
• Iterate over all the residues of a structure to collect list of Serine amino acids as Residue objects.
Describe what kinds of plots you would make and what datatypes you would use to make the plots.
NOTE: this question is not asking for working code, just an outline of your approach, what tools
you would use, and how they should be used.
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 18 Page number: 18 / 21
This page was left blank intentionally
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 19 Page number: 19 / 21
This page was left blank intentionally
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 20 Page number: 20 / 21
This page was left blank intentionally
Course: COMP 364 SEC 001 Computer Tools for Life Sciences Page 21 Page number: 21 / 21