Lab 1: Introduction To Python Programming: 1/20/17 Slide Credits: Nicole Rockweiler!
Lab 1: Introduction To Python Programming: 1/20/17 Slide Credits: Nicole Rockweiler!
Python Programming
1/20/17
Slide credits:
Nicole Rockweiler!
1
A few preliminary words…
2
Overview
• Schedule
• Logistics
• Getting Started
• Into to Unix
• Intro to Python
• Assignment 1
3
Getting the most out of this course
1. Start the homework EARLY
2. Collaborate
3. Use your resources – tutors, TAs, professors, labmates, discussion
groups, and most of all, the internet.
4. Think big
4
Logistics
• Register for 4 credits
• Labs are a continuation of the concepts learned from lectures
• Lab material is generally not tested on exams
• Course website: http://genetics.wustl.edu/bio5488/
• Bring your laptop to every lab
5
Where to get help
(a.k.a. how to maintain your sanity)
• Come to office hours
• Mondays after class (11:30am-12:30 pm) in the 4th floor classroom 4515
McKinley/area outside the classroom and by appointment
• Come to tutoring sessions
• Tuesdays 5:30-7pm in 6001B* Scott McKinley Building
• *4/4 will be in 5001B
• FREE FOOD!!
• Use the google docs to ask/answer questions -
https://docs.google.com/spreadsheets/d/11KW_lu9mE59LBtF0X8Etr
CJfHQZ22fQwz8AC3AMZSs8/edit?usp=sharing
• Email [email protected]
• Work in groups
6
Where to get help
(a.k.a. how to maintain your sanity)
7
Assignments
• Assignments are posted on the course website Wednesdays at 10am
• Assignments are due the following Wednesday at 10am
• Assignment format
• Given a bioinformatics problem
• Write/complete a Python script
• Analyze data with your script
• Answer biological questions about your results
• Turn in format
• More on this in a bit J
8
Schedule
9
Schedule (cont.)
Assignment Released Due Topic
1 1/18 1/27 Introduction
2 1/25 2/1 Sequence Comparison
3 2/1 2/8 Next Gen Sequencing
4 2/8 2/15 Gene Expression
5 2/15 2/22 Epigenomics
6 2/22 3/1 Motif Finding
7 3/1 3/22 Synthetic Gene Assembly 2 labs over spring
8 3/1 3/22 Metagenomics break
9 3/22 3/29 Genetic Variation
10 3/29 4/5 Wright-Fisher Model
11 4/5 4/12 TBD
12 4/12 4/19 Substitution Rates
13 4/19 4/26 Cis Regulatory Evolution 10
Assignment policies
• See the Course Information à Assignment policies document on course website
• There are 13 assignments
• You must turn in all assignments
• All assignments are weighted equally
• Late policy
• 25% penalty for turning in assignment 1 day late
• Assignments that are > 1 day late will given a 0
• Email us (early) to request an extension
• Auditors
• We’ll give comments on your programs, but won’t grade the short answer questions
• Same late policy applies
• Collaboration
• Group work is encouraged, but plagiarism is unacceptable
• Try to “Google it” first
• Cite your sources
• Work on the assignment before coming to lab 11
Grading
• Each assignment is out of 10 points
• Graded on
• Does the code work?
• It doesn’t have to be the “fastest” or “most efficient” to get full credit
• If doesn’t work, describe where you had problems
• Is the code well commented and readable? (more on commenting later J)
• Are the answers correct?
• Grades will be returned in a file called grades.txt on the class server
• Only you and the TAs will be able to read this file
12
Getting started
13
Remote computers
• We will be doing all of our work on a remote computer with the hostname
genomic.wustl.edu
• This is a Unix-based computer that we can securely connect to through a protocol
called secure shell (SSH).
14
What is the shell?
• The shell is a program that takes commands from the keyboard and
gives them to the operating system to execute
• There are many different shell programs
• We’ll be using the most common shell: the Bourne-Again Shell (bash)
15
How do I access the shell?
• Most of us are familiar with graphical user
interfaces (GUI) to control our computers
• Another way is with command-line
interfaces (CLI)
• A terminal emulator is a program that
allows you to interact with the shell A Window’s GUI
A PuTTY window
through a CLI
• There are many different terminal programs
that vary across OSs
• We’ll be using PuTTY (Windows) and Terminal
(Mac)
16
A Terminal window
Why should I learn how to use shells and
terminals?
• CLIs are common in scientific computing à get used to them!
• The shell is a really powerful way of interacting with your computer
à become a super user!
17
Bio5488 command convention
• We highly recommend that you type all of the command/code yourself rather
than copy and pasting
• Here's an example of a command line "snippet“
This is called the command prompt. It
means, “I’m ready for a command!” Don’t type
the “<>”
Template: Don’t type the “$.”
$ type_me_exactly <modify_me>
output
Example:
$ ls <assignment>
README.txt 18
How to log onto the remote computer
(Windows users)
1. Launch Putty
2. In the host name field, enter
genomic.wustl.edu
3. Enter a session nickname, e.g.,
bio5488
4. Click Save
5. Click Open
19
How to log onto the remote computer
(Mac users)
1. Open Terminal (found in /Applications/Utilities)
2. SSH to the remote computer. Type:
ssh <username>@genomic.wustl.edu
where <username> is replaced with your username
20
How to log onto the remote computer
(Mac users)
4. Enter your password - it will not show that you are typing! Hit enter.
21
A couple of notes
• When you log onto the class server you will be located in YOUR home
directory.
• Every command that you run after logging onto a remote computer
will be run on that computer.
22
Sublime Text
• Sublime Text is a text editor for writing and editing scripts
• We’ll use Sublime to edit both local and remote files
• Documentation: http://www.sublimetext.com/support
23
Cyberduck
• Cyberduck is a secure file transfer client and will allow you to transfer
files from your local computer to a remote computer
24
Exercise: setting up Cyberduck
• Create a bookmark
• Launch the Cyberduck application
• Click Bookmark à New Bookmark
• Select SFTP (SSH File Transfer Protocol) from the drop down menu
• Enter a nickname for the bookmark, e.g., bio5488
• Enter genomic.wustl.edu as the server name
• Click the X
• Set the default text editor
• Click Cyberduck/Edit à Preferences à Editor
• Select sublime text from the drop down menu. (You may need browse your
computer for the editor)
• Check Always use this application
• Restart Cyberduck
25
Exercise: transferring files with Cyberduck
• To download a file to your local computer
• Drag and drop a file from Cyberduck to your Finder/File Explorer
window
• Or, double-click
• To upload a file to the remote computer
• Drag and drop a file from Finder/File Explorer to Cyberduck
26
Exercise: editing remote files with
Sublime Text and Cyberduck
• New files
• Click File à New file
• Enter a filename
• Click edit
• Sublime Text should now launch
• Add some text to the file
• Click File à Save or ctrl+s
• Existing files
• Select the file by clicking the filename 1X
• Click the Edit button in the navigation bar
• Edit the file
• Click File à Save or ctrl+s
27
Basic Unix
28
The file system
• The file system is the part of the operating system (OS)
responsible for managing files and folders
• In Unix, folders are called directories.
• Unix keeps files arranged in a hierarchical structure
• The topmost directory is called the root directory
• Each directory can contain
• Files
• Subdirectories
• You will always be “in” a directory
• When you open a terminal you will be in your own home aclem
directory. ens
• Only you can modify things in your home directory
29
Determining where you are
(pwd)
• If you get lost in the file system, you can determine where you are by
typing:
$ pwd
/home/aclemens
• pwd stands for print working directory
• pwd prints the full path of the current working directory
30
Listing directory contents
(ls)
• To list the contents of a directory:
$ ls
assignment1 foo
• ls stands for list directory contents
31
Changing directories
(cd)
• To change to different directory
$ cd <directory_name>
where
<directory_name> = the path you want to move to
• A path is a location in the file system
• cd stands for change directory
• To get back to your home directory
$ cd ~
• ~ is shorthand for your home directory
32
Changing directories (cont.)
• To move one directory above the current directory
$ cd ../
• To move two directories above the current directory
$ cd ../../
• You can string as many ../ as you need to
33
Making directories
(mkdir)
• To make a directory
$ mkdir <new_directory_name>
where
<new_directory_name> = name of the directory to create
• mkdir stands for make directory
• Do not use spaces or “/” in directory or file names
34
Exercise: create some directories
Try to create this directory structure:
Hints
• Use pwd to determine where you are in the
directory structure
• Use cd to navigate through the directory
structure.
• Use mkdir to create new directories
35
Copying things
(cp)
• To create a copy of a file
$ cp –i <filename> <copy_of_filename>
where
<filename> = file you want to copy
<copy_of_filename> = name of copied file
The -i flag is a safety feature to make sure you do not overwrite a file that already
exists (interactive)
• To create a copy of a directory
$ cp -r <directory> <copy_of_directory>
where
<directory> = directory you want to copy
<copy_of_directory> = name of copied directory
The -r flag is required to copy all of the directory’s files and subdirectories 36
Copying things (cont.)
(cp)
• cp stands for copy files/directories
• To create a copy of file and keep the name the same
$ cp –i <filename> .
where
<filename> = file you want to copy
• The shortcut is the same for directories, just remember to include the -r flag
37
Exercise: copying things
Copy /home/assignments/assignment1/README.txt to your
work directory. Keep the name the same.
38
Renaming/moving things
(mv)
• To rename/move a file/directory
$ mv -i <original_filename> <new_filename>
where
<original_filename> = name of file/dir you want to rename
<new_filename> = name you want to rename it to
• mv stands for move files/directories
39
Printing contents of files
(cat)
• To print a file
$ cat <filename>
where
<filename> = name of file you want to print
• cat stands for concatenate file and print to the screen
• Other useful commands for printing parts of files:
• more
• less
• head
• tail
40
Exercise: printing contents of files
Print the contents of your README.txt
Experiment with using different commands, e.g., cat, head, and tail.
How do the commands differ?
41
Deleting Things
(rm)
• To delete a file
TIP: Check that you’re
$ rm <file_to_delete> going to delete the
where correct files by first
<file_to_delete> = name of the file you want to delete testing with 'ls' and then
committing to 'rm'
• To delete a directory
$ rm –r -i <directory_to_delete>
where
<directory_to_delete> = name of the directory you want to delete
• rm stands for remove files/directories
IMPORTANT: there is no recycle bin/trash folder on Unix!!
Once you delete something, it is gone forever.
Be very careful when you use rm!! 42
Exercise: deleting things
Delete the test directory that you created in a previous exercise.
43
Saving output to files
• Save the output to a file
$ <cmd> > <output_file>
where
<cmd> = command
<output_file> = name of output file
• WARNING: this will overwrite the output file if it already exists!
• Append the output to the end of a file
$ <cmd> >> <output_file>
44
Learning more about a command
(man)
• To view a command’s documentation
$ man <cmd>
where
<cmd> = command
• man stands for manual page
• Use the and arrow keys to scroll through the manual page
↑
↑
45
Exercise: reading documentation
Determine what the following command does
$ cal
46
Getting yourself out of trouble
• Abort a command
47
Unix commands cheatsheet--your new bestie
48
https://ubuntudanmark.dk/filer/fwunixref.pdf
Assignment 1
49
How to complete & “turn in” assignments
1. Create a separate directory for each assignment
2. Create “submission” and “work” subdirectories
• Work = scratch work
• Submission = final version
• The TAs will only grade content that is in your submission
directory
3. Copy the starter scripts and README to your work
directory
4. Copy the final version of the files to your submission
directory
• Don’t touch the submission folder again! Timestamps of the
files are used to determine if the assignment was turned in
on time 50
README files
• A README.txt file contains information on how to run your code and answers to any of the
questions in the assignment
• A template will be provided for each assignment
• Copy the template to your work folder
• Replace the text in {} with your answers
• Leave all other lines alone J
52
53
Assignment 1 TODOs
• Download chr20 via FTP (here we use wget)
• You will be given a starter script (nuc_count.py) that counts the total
number of A, C, G, T nucleotides
• Modify the script to calculate the nucleotide frequencies
• Modify the script to calculate the dinucleotide frequencies
• Modify a starter script (make_seq.py) to generate a random sequence
given nucleotide frequencies
• Use make_seq.py to generate random sequence with the same
nucleotide frequencies as chr20
• Compare the chr20 di/nucleotide frequencies (observed) with the random
model (expected)
54
Fasta file format
• A standard text-based file format used to Example fasta file
55
Requirements
• Due next Friday (1/27) at 10am
• Your submission folder should contain:
□ A Python script to count nucleotides (nuc_count.py)
□ A Python script to make a random sequence file
(make_seq.py)
□ An output file with a random sequence
(random_seq_1M.txt)
□ A README.txt file with instructions on how to run your
programs and answers to the questions.
• Remember to comment your script!
56
Python basics
57
What is Python?
• Python is a widely used programming language
• First implemented in 1989 by Guido van Rossum
• Free, open-source software with community-based
development
• Trivia: Python is named after the BBC show “Monty Python’s
Van Rossum is known as
Flying Circus” and has nothing to do with reptiles a "Benevolent Dictator
For Life" (BDFL)
Which Python?
• There are 2 widely used versions of Python: Python2.7 and
Python3.x
• We’ll use Python3
• Many help forums still refer to Python2, so make sure
you’re aware which version is being referenced 58
Interacting with Python
There are 2 main ways of interacting with Python:
60
Variables (cont.)
• To save a variable, use =
>>> x = 2
The value of the variable
The name of the variable
Cheatsheet
63
Collections of things
• Why is this concept useful?
• We often have collections of things, e.g.,
• A list of genes in a pathway
• A list of gene fusions in a cancer cell line
• A list of probe IDs on a microarray and their intensity value
• We could store each item in a collection in a separate variable, e.g.,
gene1 = ‘SUCLA2’
gene2 = ‘SDHD’
...
• A better strategy is to put all of the items in one container
• Python has several types of containers
• List (similar to arrays)
• Set
• Dictionary
64
Lists: what are they?
• Lists hold a collection of things in a specified order
• The things do not have to be the same type
• Many methods can be used to manipulate lists.
Index a list
<listname>[<position>] 'SDHD'
65
Lists: where can I learn more?
• Python.org tutorial:
https://docs.python.org/3.4/tutorial/datastructures.html#more-on-
lists
• Python.org documentation:
https://docs.python.org/3.4/library/stdtypes.html#list
66
Doing stuff to variables
• There are 3 common tools for manipulating variables
• Operators
• Functions
• Methods
67
Operators
• Operators are a special type of function:
• Operators are symbols that perform some mathematical or logical operation
• Basic mathematical operators:
Operator Description Example
+ Addition >>> 2 + 3
5
- Subtraction >>> 2 - 3
-1
* Multiplication >>> 2 * 3
6
/ Division >>> 2 / 3
0.6666666666666666 68
Operators (cont.)
You can also use operators on strings!
Operator Description Example
Is it a bird? Is it a
+ Combine strings together >>> 'Bio' + '5488' plane? No it’s a
'Bio5488' string!
Strings and ints
>>> 'Bio' + 5488 cannot be combined
Traceback (most recent call
last):
File "<stdin>", line 1, in
<module>
TypeError: Can't convert
'int' object to str
implicitly
* Repeat a string multiple times >>> 'Marsha' * 3
'MarshaMarshaMarsha' 69
Relational operators
• Relational operators Operator Description Example
compare 2 things < Less than >>> 2 < 3
True
• Return a boolean <= Less than or equal to >>> 2 <= 3
True
> Greater than >>> 2 > 3
False
>= Greater than or equal to >>> 2 >= 3
False
== is used to test
for equality == Equal to >>> 2 == 3
= is used to assign False
a value to a
variable
!= Not equal to >>> 2 != 3
True 70
Logical operators
• Perform a logical function on 2 things
• Return a boolean
Operator Description Example
and Return True if both arguments are true >>> True and True
True
>>> True and False
False
or Return True if either arguments are true >>> True or False
True
>>> False or False
False
71
Functions: what are they?
• Why are functions useful?
• Allow you to reuse the same code
• Programmers are lazy!
• A block of reusable code used to perform a specific task
Take in Return
Do
arguments something
something
(optional) (optional)
73
Python functions: where can I learn more?
• Python.org tutorial
• User-defined functions:
https://docs.python.org/3/tutorial/controlflow.html#defining-functions
• Python.org documentation
• Built-in functions: https://docs.python.org/3/library/functions.html
74
Methods: what are they?
• First a preamble...
• Methods are a close cousin of functions
• For this class we’ll treat them as basically the
same
• The syntax for calling a method is different
than for a function
• If you want to learn about the differences,
google object oriented programming (OOP)
• Why are functions methods useful?
• Allow you to reuse the same code
75
String methods
Syntax Description Example
<str>.upper() • Returns the string with all letters uppercased >>> x = "Genomics"
>>> x.upper()
'GENOMICS'
<str>.lower() • Returns the string with all letters lowercased >>> x.lower()
'genomics'
<str>.find(<pattern>) • Returns the first index of <pattern> in the string >>> x.find('nom')
• Returns -1 if the if <pattern> is not found 2
77
Conditional statement syntax
Syntax Example Output
If
if <condition>: x is positive
# Do something
If/else
if <condition>: x is NOT positive
# Do something
else:
# Do something else
If/else if/else
if <condition1>: x is negative
# Do something
Indentation matters!!!
elif <condition2>:
Indent the lines of code
# Do something else
that belong to the same
else:
code block
# Do something else
Use 1 tab
78
Commenting your code
• Why is this concept useful?
• Makes it easier for--you, your future self, TAs J, anyone unfamiliar with your
code--to understand what your script is doing
• Comments are human readable text. They are ignored by Python.
• Add comments for
The how The why
• What the script does • Biological relevance
• How to run the script • Rationale for design and methods
• What a function does • Alternatives
• What a block of code does
80
Commenting your code (cont.)
• Commenting is extremely important!
• Points will be deducted if you do not comment your code
81
Comment syntax
Syntax Example
Block comment
# <your_comment>
# <your_comment>
In-line comment
<code> # <your_comment>
82
Python modules
• A module is file containing Python definitions and statements for a
particular purpose, e.g.,
• Generating random numbers
• Plotting
• Modules must be imported at the beginning of the script
• This loads the variables and functions from the module into your script, e.g.,
import sys
import random
• To access a module’s features, type <module>.<feature>, e.g.,
sys.exit()
83
Random module
• Contains functions for generating random numbers for various
distributions
• TIP: will be useful for assignment 1
Function Description
random.choice Return a random element from a list
84
https://docs.python.org/3.4/library/random.html
How to repeat yourself
(for loops)
• Why is this useful?
• Often, you want to do the same thing over
and over again
• Calculate the length of each chromosome in a
genome
• Look up the gene expression value for every gene
• Align each RNA-seq read to the genome
• A for loop takes out the monotony of doing
something a bazillion times by executing a
block of code over and over for you
• Remember, programmers are lazy!
• A for loop iterates over a collection of things
• Elements in a list
• A range of integers
• Keys in a dictionary
85
Indentation matters!!!
Indent the lines of code
that belong to the same
For loop syntax code block
Use 1 tab
Syntax Example Output
for <counter> in <collection_of_things>: Hello!
Hello!
# Do something Hello!
Hello!
Hello!
Hello!
• The <counter> variable is the value Hello!
of the current item in the collection Hello!
of things Hello!
Hello!
• You can ignore it
• You can use its value in the loop 0
• All code in the for loop’s code block 1
2
is executed at each iteration 3
• TIP: If you find yourself repeating 4
something over and over, you can 5
6
probably convert your code to a for 7
loop! 8
9
86
A
Which option would
you rather do?
87
How to repeat yourself (cont.)
• For loops have a close cousin called while loops
• The major difference between the 2
• For loops repeat a block of code a predetermined number of times (really, a
collection of things)
• While loops repeat a block of code as long as an expression is true
• e.g., while it’s snowing, repeat this block of code
• While loops can turn into infinite while loops à the expression is never false so the loop
never exits. Be careful!
• See http://learnpythonthehardway.org/book/ex33.html for a tutorial on while loops
88
Command-line arguments
• Why are they useful?
• Passing command-line arguments to a Python script allows a script to be
customized
• Example
• make_nuc.py can create a random sequence of any length
• If the length wasn’t a command-line argument, the length would be hard-
coded
• To make a 10bp sequence, we would have to 1) edit the script, 2) save the script, and 3)
run the script.
• To make a 100bp sequence, we’d have to 1) edit the script, 2) save the script, and 3) run
the script.
• This is tedious & error-prone
• Remember: be a lazy programmer!
89
90
Command-line arguments
• Python stores the command-line arguments as a list called sys.argv
• sys.argv[0] # script name
• sys.argv[1] # 1st command-line argument
• …
• IMPORTANT: arguments are passed as strings!
• If the argument is not a string, convert it, e.g., int(), float()
• sys.argv is a list of variables
• The values of the variables, e.g., the A frequency, are not “plugged in” until the script
is run
• Use the A_freq to stand for the A frequency that was passed as a command-line
argument
91
Reading (and writing) to files in Python
Why is this concept useful?
• Often your data is much larger than just a
few numbers:
• Billions of base pairs
• Millions of sequencing reads
• Thousands of genes
• It’s may not feasible to write all of this data
in your Python script
• Memory
• Maintenance
How do we solve this problem?
92
Reading (and writing) to files in Python
Input file
The solution:
• Store the data in a separate file Python
script 1
• Then, in your Python script
• Read in the data (line by line)
Output
• Analyze the data file 1
• Write the results to a new output file or print
them to the terminal
Python
• When the results are written to a file, other script 2
scripts can read in the results file to do more
analysis Output
file 2
93
Reading a file syntax
Syntax Example
with open(<file>) as <file_handle>:
for <current_line> in open(<file>) , ‘r’):
<current_line> = <current_line>.rstrip()
# Do something
Output
>chr1
ACGTTGAT
ACGTA
94
The anatomy of a (simple) script
• The first line should always be
#!/usr/bin/env python3
• This special line is called a shebang
• The shebang tells the computer
how to run the script
• It is NOT a comment
95
The anatomy of a (simple) script
96
The anatomy of a (simple) script
• This is a comment
• Comments help the reader better
understand the code
• Always comment your code!
97
The anatomy of a (simple) script