Project 1

BIOL/CMPU 353 - Bioinformatics
Smith and Schwarz
Fall 2008

Assigned: Mon, Sep. 15
Due: Mon, Sep. 22

DNA: Playing with Strings

Write a Perl program to print a report about a specific DNA sequence1):

cgccatataatgctcgtccgcgcccta

This sequence consists of the first 27 nucleotides of a gene sequence. Your goal is to find the beginning of the protein-coding segment of the gene. Since all protein-coding regions of genes begin with the three nucleotide code ATG, we can easily find the start of a gene. All the sequence to the left (upstream) of ATG is referred to as “upstream sequence” and the sequence to the right (and including) ATG is the protein-coding sequence.
Note: the entire gene sequence is not included in this small sequence.

A gene starting with the 3bp (3 base pair) codon ATG:

...upstream sequence......ATGCTCGTCCGCGCCCTA......

Within a gene, every three nucleotides is called a codon. ATG is the initial codon in every gene, but other codons vary depending on the particular gene. In this example, CTC is the second codon and GTC is the third codon and so on.

...CGCCATATAATGCTCGTCCGCGCCCTA...
^^^^^^^^^^^^                      <-- upstream sequence
...CGCCATATAATGCTCGTCCGCGCCCTA...
            ^^^                   <-- initial codon: ATG
...CGCCATATAATGCTCGTCCGCGCCCTA...
               ^^^                <-- second codon: CTC
...CGCCATATAATGCTCGTCCGCGCCCTA...
                  ^^^             <-- third codon: GTC

"Understanding the Biology" Assignment

Provide written answers

  1. Why do you think the coding region of all proteins starts with “ATG”? There are 64 possible codons, so why wouldn’t organisms evolve different possible “Start” codons?
  2. Once you have written and tested your program, speculate on what biological questions your program could help solve.

Programming Assignment

Your task is to locate and report the position of the ATG codon within the starting sequence and then take substrings (substr) of the entire sequence to form two other strings that hold the upstream and genic sequences. In addition to printing out the starting sequence and length, locations and three-character codes of the first three codons in the gene, upstream and genic sequences and their respective lengths (see a sample report below), your program will also print out the reverse complement of the genic sequence.
Note: your program should not only work on this sequence but on any sequence that contains ATG, therefore, use variables (e.g., $positionATG) rather than 9.

Finally (once you have all of the above working), print the original sequence with only the ATG start codon in capital letters (uppercase) while the rest of the sequence is in lowercase. Also, to try and get a feel for whether a region is AT-rich, print a count of the number of As and Ts in the upstream region.

Sample Output

A sample output is shown below. Your program’s output need not be exactly identical but you should print out the information in the same order and obviously your answers should agree.

+++++++++++++ Upstream and Genic Report +++++++++++++++++++++++++

Starting sequence is: cgccatataatgctcgtccgcgcccta
Converted to uppercase: CGCCATATAATGCTCGTCCGCGCCCTA

Length of starting sequence is: 27
-----------------------------------------------------------------

ATG start codon begins in position (bp) 10
    followed by codon CTC in position (bp) 13
    followed by codon GTC in position (bp) 16
-----------------------------------------------------------------

Upstream sequence is: CGCCATATA

Upstream length (bp): 9

-----------------------------------------------------------------

Gene sequence is: ATGCTCGTCCGCGCCCTA

Gene length (bp): 18

-----------------------------------------------------------------

Gene + Strand: ATGCTCGTCCGCGCCCTA

Gene - Strand: TAGGGCGCGGACGAGCAT

: you complete the last steps...
-----------------------------------------------------------------

What to Submit:

The written answers to the biological questions and your completed program are due on Tue, February 12. You must submit your program both electronically and hand in a hardcopy in class.

Specifically:

  1. one hardcopy of your answers to the biological questions
  2. one electronic copy of your Perl program
  3. one hardcopy of your Perl program (use a good name, e.g., Smith_a1.pl)
  4. one page that shows the OUTPUT of your program.
  5. You must staple your Perl (Smith_a1.pl) and your output together.

Starter Code

Here is some help getting started...

#!/usr/bin/perl
use strict;
use warnings;
#================================================================
#
# Summary: This Perl program isolates the upstream and genic
#          regions of a sequence. A report is printed, a sample
#          of which is shown below:
#
#          (you paste a sample of your program's output here)
#
# Programmer: Ima Intergenic
#
# Date Last Modified:
# 09/15/2008 -- started program, finished length of sequence
# 09/20/2008 -- trouble with getting correct location of ATG
# 09/22/2008 -- fixed ATG location, finished program
#
#===============================================================
 
print "+++++++++ Upstream and Genic Report ++++++++++++++++\n\n";
 
my $someSequence; # upstream and start of a gene ...
 
$someSequence = "cgccatataatgctcgtccgcgcccta";
 
print "Starting sequence is: $someSequence \n";
 
# convert all nucleotides to uppercase
:
 
print "Converted to uppercase: $someSequence \n\n";
:
 
print "Length of starting sequence is: $seqLength \n";
 
print "----------------------------------------------------\n\n";
 
# get the position of the start codon "ATG"
my $ATGPosition;
$ATGPosition = index( ...
:
 
print "----------------------------------------------------\n\n";
 
my $upStream;
$upStream = substr( ...
 
print "Upstream sequence is: $upStream \n\n";
:

You should copy/paste the above code into jEdit, then save it via secure FTP on the bioinf cluster:

  • under the Plugins menu, choose FTP, then select “Save to Secure FTP Server” and specify bioinf.cs.vassar.edu and your login id and password, then
  • from the File System Browser window that pops up when you go to save, navigate from your home directory to your course directory (cs353), then
  • create a new directory named project1 (or another well-named directory of your choice), then
  • navigate into your newly-created project directory to save your file, finally
  • save your program and give it a good name.

Even if you originally saved your file under your cs353 directory, you can use jEdit and the above instructions to do a “Save As...” and resave it in your project directory.

Once your program works properly, use the submit353 script to submit your program electronically. As a reminder, here’s what to do:

  • from your ssh connection to the bioinf cluster, change directories to your course directory: cd ~/cs353
  • use the submit353 command to submit your project1 directory (or whatever you named your project directory, if different from project1:

‘‘submit353 project1”

1) Assignment based on LeBlanc and Dyer, © Fall 2007
courses/cs353-200803/assigns/assign01.txt · Last modified: 2008/09/15 09:53 by mlsmith
VCCS Top Events Extended Site Search Vassar Science Web Vassar Home Driven by DokuWiki Valid XHTML 1.0