Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Mac OS X > A regular expression question

A regular expression question
Thread Tools
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Jan 15, 2003, 07:13 PM
 
I'm trying to strip out a bunch of text file names from a file containg lots of other stuff like file size and permissions, etc. Can someone help me with the regular expression so that I can use sed to replace all the miscellaneous garbage with commas? I'd like the final output to look something like this:

seq1.txt,rgrseq2.txt,sequence3.txt,seq4.txt

Not all of the file names have convenient lengths, but they all end in .txt and are all preceded by whitespace.

The starting file looks something like this:

seq1.txt 42k 01012003
rgrseq2.txt 4k 02022003
sequence3.txt 4k 02022003
seq4.txt 16k 01012003


Thanks!
kman
     
Senior User
Join Date: Nov 2001
Location: State of Denial
Status: Offline
Reply With Quote
Jan 15, 2003, 07:33 PM
 
Find:

(?<=\.txt).+\r?

Replace with:

,

This works for me in BBEdit 7, anyway....

You might need to change \r to \n if you're running it through Perl (and the file has UNIX line-endings)...

It'll add a comma at the very end of the file too, but that's removed by hand easily enough.
[Wevah setPostCount:[Wevah postCount] + 1];
     
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status: Offline
Reply With Quote
Jan 15, 2003, 07:33 PM
 
assuming the list of files is in a file called 'inputfile', this will do exactly what you want:

awk '{ORS=","; print $1}' inputfile

To dissect the command:

awk <- name of the program used to munge the list

ORS="," = Output Record Separator, i.e. the string to use between each record in the output. Normally a return, but here replaced with a comma.

print $1 = print the first field (whitespace delimited). If you want another field just change the digit - $1 = first field, $2 = second field, etc.

Note that the awk script is enclosed in curly braces { and }, and is single quoted to avoid problems with the shell parsing it before awk gets it.

inputfile = name of the file to read. Change as appropriate. Alternatively, if working from an ls output, you could pipe the ls output into this awk command and not use any intermediate file.
Gods don't kill people - people with Gods kill people.
     
Senior User
Join Date: Nov 2001
Location: State of Denial
Status: Offline
Reply With Quote
Jan 15, 2003, 07:34 PM
 
Or use awk.

;p
[Wevah setPostCount:[Wevah postCount] + 1];
     
kman42  (op)
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Jan 15, 2003, 07:40 PM
 
Thanks! I hadn't heard of awk before.

Now how about some curl help? I'm trying to take my newly parsed list and download some files from a server at work using curl, but I can't seem to get the -o (that's little o) to work.

The syntax I am using is

curl http://workserver.edu/mydirectory/{parsed sequence list here} -o "seq_#1"

It gets the first one okay, but then it just starts dumping the second file to stdout. I thought the #1 variable would get incremented with every new file. At least that's my reading of this section of the man file:

-o/--output <file>
Write output to <file> instead of stdout. If you
are using {} or [] to fetch multiple documents, you
can use '#' followed by a number in the <file>
specifier. That variable will be replaced with the
current string for the URL being fetched. Like in:

curl http://{one,two}.site.com -o "file_#1.txt"

or use several variables like:

curl http://{site,host}.host[1-5].com -o "#1_#2"

You may use this option as many times as you have
number of URLs.


thanks again,
kman
     
kman42  (op)
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Jan 15, 2003, 07:45 PM
 
Originally posted by Camelot:
assuming the list of files is in a file called 'inputfile', this will do exactly what you want:

awk '{ORS=","; print $1}' inputfile

To dissect the command:

awk <- name of the program used to munge the list

ORS="," = Output Record Separator, i.e. the string to use between each record in the output. Normally a return, but here replaced with a comma.

print $1 = print the first field (whitespace delimited). If you want another field just change the digit - $1 = first field, $2 = second field, etc.

Note that the awk script is enclosed in curly braces { and }, and is single quoted to avoid problems with the shell parsing it before awk gets it.

inputfile = name of the file to read. Change as appropriate. Alternatively, if working from an ls output, you could pipe the ls output into this awk command and not use any intermediate file.

What if there are no record delimiters in the input file? For example, what if the input file looks like this:

seq1.txt 42k 01012003 rgrseq2.txt 4k 02022003 sequence3.txt 4k 02022003 seq4.txt 16k 01012003



Then I'm back to using sed, right?

Sorry, I'm new at the whole shell scripting thing and am trying to force myself to do things with the cli that would undoubtedly be easier with afp That is, until I am proficient at the cli


kman
     
Senior User
Join Date: Nov 2001
Location: State of Denial
Status: Offline
Reply With Quote
Jan 15, 2003, 10:09 PM
 
For the curl question, wrap the URL in single quotes. Change

curl http://workserver.edu/mydirectory/{parsed sequence list here} -o "seq_#1"

to

curl 'http://workserver.edu/mydirectory/{parsed sequence list here}' -o "seq_#1"
[Wevah setPostCount:[Wevah postCount] + 1];
     
kman42  (op)
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Jan 16, 2003, 10:16 AM
 
Originally posted by Wevah:
For the curl question, wrap the URL in single quotes. Change

curl http://workserver.edu/mydirectory/{parsed sequence list here} -o "seq_#1"

to

curl 'http://workserver.edu/mydirectory/{parsed sequence list here}' -o "seq_#1"
Thanks! And I think I worked out the other bit on my own.

Is there a good regular expressions tutorial? I find it difficult to go from what I want to do to an actual regular expression to do it. I think it is just a matter of practice so that I get used to thinking in terms of regular expressions. A tutorial would be quite helpful.

kman
     
Dedicated MacNNer
Join Date: Jul 2001
Location: NC
Status: Offline
Reply With Quote
Jan 16, 2003, 05:09 PM
 
Originally posted by kman42:
What if there are no record delimiters in the input file? For example, what if the input file looks like this:

seq1.txt 42k 01012003 rgrseq2.txt 4k 02022003 sequence3.txt 4k 02022003 seq4.txt 16k 01012003

Then I'm back to using sed, right?
&nbsp;&nbsp;&nbsp;Actually, no. Awk would be even more useful then. Sed and Awk both edit line by line. If you created a file with no newlines, then both sed and awk would try to operate on the whole file at once. At that point, I would consider the looping and testing capabilities of awk to be the best way to look at each field in turn.

&nbsp;&nbsp;&nbsp;I don't have my references in front of me now but I would guess that it the command would look something like:

awk '{for (i=1; i<=NF; i++) if ($i ~ ".*\.txt" ) print $i;}'

Sorry, I'm new at the whole shell scripting thing and am trying to force myself to do things with the cli that would undoubtedly be easier with afp That is, until I am proficient at the cli

kman
&nbsp;&nbsp;&nbsp;"until I am proficient" you say? Hah! You have the sound of someone who's close to hooked already! I'll bet you never stop getting better. Fortunately there need be no end to it. The best part is, it gets more and more interesting. (at least when these dumb machines do what you say!) Of course there's the ancient Chinese curse to consider, "May you lead an interesting life."
Gary
A computer scientist is someone who, when told to "Go to Hell", sees the
"go to", rather than the destination, as harmful.
     
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status: Offline
Reply With Quote
Jan 17, 2003, 01:48 PM
 
Originally posted by kman42:
What if there are no record delimiters in the input file? For example, what if the input file looks like this:

seq1.txt 42k 01012003 rgrseq2.txt 4k 02022003 sequence3.txt 4k 02022003 seq4.txt 16k 01012003


Then I'm back to using sed, right?
Not at all. In this case you simply override awk's record delimiter from a return to a space.

Just like ORS in my previous example sets the Output record separator, the RS command in awk lets you use any character string as the Input record separator.

Then the /string/ command lets you restrict the awk commands to records that match the /string/

This command breaks the list into separate records based on spaces, then tells it to only print records that contain ".txt":

awk 'BEGIN {RS=" ";} /txt/ {print $1};' inputfile

the BEGIN {RS=" ";} tells awk that before it starts reading the file, the Record Separator should be set to a space, then the /txt/ part tells awk to only run the following command (print $1) on any record that matches the /txt/ search.
Gods don't kill people - people with Gods kill people.
     
kman42  (op)
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status: Offline
Reply With Quote
Jan 17, 2003, 02:14 PM
 
Originally posted by Camelot:
Not at all. In this case you simply override awk's record delimiter from a return to a space.

Just like ORS in my previous example sets the Output record separator, the RS command in awk lets you use any character string as the Input record separator.

Then the /string/ command lets you restrict the awk commands to records that match the /string/

This command breaks the list into separate records based on spaces, then tells it to only print records that contain ".txt":

awk 'BEGIN {RS=" ";} /txt/ {print $1};' inputfile

the BEGIN {RS=" ";} tells awk that before it starts reading the file, the Record Separator should be set to a space, then the /txt/ part tells awk to only run the following command (print $1) on any record that matches the /txt/ search.
Thank you. Your breakdown of the command was very helpful. I tried reading the man for awk, but the shear number of options was a little overwhelming.

I've realized that I really need to invest in some sort of tutorial/reference for advanced beginners of UNIX. I know the basics of file system management and a few of the simple commands, but there are so many small utilities that it's almost impossible to know what is available without someone prompting me in the right direction (as to the existence of awk). Any suggestions? An online tutorial would be great, but I wouldn't mind investing in a quality reference manual either.

thanks,
kman
     
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status: Offline
Reply With Quote
Jan 17, 2003, 03:24 PM
 
Originally posted by kman42:
Thank you. Your breakdown of the command was very helpful. I tried reading the man for awk, but the shear number of options was a little overwhelming.

I've realized that I really need to invest in some sort of tutorial/reference for advanced beginners of UNIX. I know the basics of file system management and a few of the simple commands, but there are so many small utilities that it's almost impossible to know what is available without someone prompting me in the right direction (as to the existence of awk). Any suggestions? An online tutorial would be great, but I wouldn't mind investing in a quality reference manual either.

thanks,
kman
Ahh... the chicken and egg syndrome of the computer world - knowing which utility to use requires knowing what utilities are available, which requires knowing what they do.

As a starting point, consider "Learning UNIX for Mac OS X (2nd edition)" from O'Reilly ( http://www.oreilly.com/catalog/lunixmacosx2/ ) or Mac OS X in a Nutshell ( http://www.oreilly.com/catalog/macosxian/desc.html )

Both are good all-round introductions to some of the utilities that are included in Mac OS X.

Neither are a complete references, but there's no substitute for practice - and asking questions, of course.
Gods don't kill people - people with Gods kill people.
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 08:16 PM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2