Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Mac OS X > programming help (DNA analysis)

programming help (DNA analysis)
Thread Tools
Junior Member
Join Date: Apr 2001
Status: Offline
Reply With Quote
Feb 23, 2004, 11:40 AM
 
For all you CS people who wanted to try your hand at bioinformatics, here's a tiny little problem I am working on. I am an inexperienced programmer, but might enough to fake my way through. I just need a little help getting started.

I will have about 200 strings of 23 letters long, and each letter can be either, A, C, G, or T.

I want to "score" these strings based on a certain letter at a certain positions.

Example string
nnnnXnnnnnnXnnXnnXXXXXnn

In this example I want to add a point for the letter A in the 5th position, and add a point for a T in the 11th position. I also want to subtract a point for a G in the 13th position. All of the n's can be any of the four letters.

I have a website that will search a longer string and pull these patterns out, but I can't figure out a command for this. Does anyone know a simple command to do something like this. I don't really have a language preference, but I know perl is regularly used for DNA sequence analysis.

Thanks.

-MS
     
Dedicated MacNNer
Join Date: Apr 2001
Location: Bethesda, MD
Status: Offline
Reply With Quote
Feb 23, 2004, 02:44 PM
 
Well, I'm a geek and I got sick of my work, so I hacked up a perl script. It should do what you want. The scoring loop is kind of gross. It takes the strings from standard input.

Code:
#! /usr/bin/perl while(<>) { # take a 23 character string and make a 23 element array, @letters. each element # of @letters is a number, the ascii value of the character. @letters = unpack("c23", $_); $score = 0; for ($i=0; $i<=$#letters; $i++) { if (($i==4) && ($letters[4] == 65)) { # found 'A' in the 5th letter $score++; print chr($letters[$i]), " "; } elsif (($i==10) && ($letters[10] == 84)) { # found 'T' in the 11th letter $score++; print chr($letters[$i]), " "; } elsif (($i==12) && ($letters[12] == 71)) { # found 'G' in the 13th letter $score--; print chr($letters[$i]), " "; } else { # print the non-matching characters in lower case print chr($letters[$i]+32), " "; } } print "\n"; print "score = $score\n"; }
     
Professional Poster
Join Date: Oct 1999
Location: :ИOITAↃO⅃
Status: Offline
Reply With Quote
Feb 23, 2004, 04:28 PM
 
Also, bear in mind that this is a pretty simplistic framework, if you're trying to find sequences homologous to some ancestral sequence, or bearing some target sequence of interest. Might I suggest you read the first two chapters of Durbin's Biological Sequence Analysis? You can use Amazon's search-inside feature to read the useful bits.

You might look into either
(a) a simple string edit distance measure
or
(b) scoring with log odds ratios against a background model

as discussed in the book.
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 08:54 PM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2