Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Mac OS X > Help a poor sod grep (or something)

Help a poor sod grep (or something)
Thread Tools
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Apr 3, 2003, 11:49 AM
 
I'm trying to do a little bioinformatics and I'm sure there is a simple way to do what I want using unix utilities that I'm just not aware of. Perhaps grep or something similar. Basically, I want to take a text sequence (dna) where some of the letters are variable and try to match a six-letter string to it.

Here's an example:

TG[T C]TG[T C]TG[T C]GG[T C A G]GC[T C A G]ATGCC[T C A G]CA[A G][C A]G[T C A G]

Letters in brackets can be any one of those letters. I'd then like to search for matches on this string using the following strings:

GGATCC
CCTAGG
GAATTC
CTTAAG

I'm sure this is fairly easy using one of the builtin unix string utilities, but I'm at a loss. Can someone help me out?

thanks,
kman
     
kman42  (op)
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Apr 3, 2003, 12:08 PM
 
Originally posted by kman42:
I'm trying to do a little bioinformatics and I'm sure there is a simple way to do what I want using unix utilities that I'm just not aware of. Perhaps grep or something similar. Basically, I want to take a text sequence (dna) where some of the letters are variable and try to match a six-letter string to it.

Here's an example:

TG[T C]TG[T C]TG[T C]GG[T C A G]GC[T C A G]ATGCC[T C A G]CA[A G][C A]G[T C A G]

Letters in brackets can be any one of those letters. I'd then like to search for matches on this string using the following strings:

GGATCC
CCTAGG
GAATTC
CTTAAG

I'm sure this is fairly easy using one of the builtin unix string utilities, but I'm at a loss. Can someone help me out?

thanks,
kman
I tried typing in the long sequence as a regular expression and saving as a file and then using grep to search for the short sequences, but that doesn't work. Can you only use regular expressions in the pattern, not the file?

kman
     
kman42  (op)
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Apr 3, 2003, 12:21 PM
 
Perhaps the perl m// function?

kman
     
Fresh-Faced Recruit
Join Date: Apr 2003
Status: Offline
Reply With Quote
Apr 3, 2003, 01:13 PM
 
this little perl example may help:

#!/usr/bin/perl

$_ = 'TGTCTGTCTGTCGGTCAGGCTCAGATGCCTCAGCAAGCAGTCAG';

$pat1 = 'GGATCC';
$pat2 = 'CCTAGG';
$pat3 = 'GAATTC';
$pat4 = 'CTTAAG';

/($pat1)|($pat2)|($pat3)|($pat4)/ && print "matched $1\n";
     
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status: Offline
Reply With Quote
Apr 7, 2003, 11:11 PM
 
If the source is one long DNA string (and not just the 20 or so characters you describe), neither grep nor perl is going to do this for you.

The reason is that both models work at the line level - read a line, check for match, move onto next line.

If you have one long DNS sequence, it will read in the entire sequence as one 'line' and report if there's any match - anywhere in the string.

Assuming you can read the entire sequence into RAM, there's likely to be at least one match somewhere in the string, but both models will simply tell you there's a match, not where it is, nor if there's more than one match in the sequence.

You'd have to change the code to something that reads a few bytes at a time, checks for matches, moves along a byte, check for more matches, etc, etc. ad infinitum.

In short, there's a reason why there are commercial gene sequencing programs costing $$$$$s.
Gods don't kill people - people with Gods kill people.
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 11:14 PM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2