 |
 |
A regular expression question
|
 |
|
 |
|
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status:
Offline
|
|
I'm trying to strip out a bunch of text file names from a file containg lots of other stuff like file size and permissions, etc. Can someone help me with the regular expression so that I can use sed to replace all the miscellaneous garbage with commas? I'd like the final output to look something like this:
seq1.txt,rgrseq2.txt,sequence3.txt,seq4.txt
Not all of the file names have convenient lengths, but they all end in .txt and are all preceded by whitespace.
The starting file looks something like this:
seq1.txt 42k 01012003
rgrseq2.txt 4k 02022003
sequence3.txt 4k 02022003
seq4.txt 16k 01012003
Thanks!
kman
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Nov 2001
Location: State of Denial
Status:
Offline
|
|
Find:
(?<=\.txt).+\r?
Replace with:
,
This works for me in BBEdit 7, anyway....
You might need to change \r to \n if you're running it through Perl (and the file has UNIX line-endings)...
It'll add a comma at the very end of the file too, but that's removed by hand easily enough.
|
|
[Wevah setPostCount:[Wevah postCount] + 1];
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status:
Offline
|
|
assuming the list of files is in a file called 'inputfile', this will do exactly what you want:
awk '{ORS=","; print $1}' inputfile
To dissect the command:
awk <- name of the program used to munge the list
ORS="," = Output Record Separator, i.e. the string to use between each record in the output. Normally a return, but here replaced with a comma.
print $1 = print the first field (whitespace delimited). If you want another field just change the digit - $1 = first field, $2 = second field, etc.
Note that the awk script is enclosed in curly braces { and }, and is single quoted to avoid problems with the shell parsing it before awk gets it.
inputfile = name of the file to read. Change as appropriate. Alternatively, if working from an ls output, you could pipe the ls output into this awk command and not use any intermediate file.
|
|
Gods don't kill people - people with Gods kill people.
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Nov 2001
Location: State of Denial
Status:
Offline
|
|
|
|
|
[Wevah setPostCount:[Wevah postCount] + 1];
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status:
Offline
|
|
Thanks! I hadn't heard of awk before.
Now how about some curl help? I'm trying to take my newly parsed list and download some files from a server at work using curl, but I can't seem to get the -o (that's little o) to work.
The syntax I am using is
curl http://workserver.edu/mydirectory/{parsed sequence list here} -o "seq_#1"
It gets the first one okay, but then it just starts dumping the second file to stdout. I thought the #1 variable would get incremented with every new file. At least that's my reading of this section of the man file:
-o/--output <file>
Write output to <file> instead of stdout. If you
are using {} or [] to fetch multiple documents, you
can use '#' followed by a number in the <file>
specifier. That variable will be replaced with the
current string for the URL being fetched. Like in:
curl http://{one,two}.site.com -o "file_#1.txt"
or use several variables like:
curl http://{site,host}.host[1-5].com -o "#1_#2"
You may use this option as many times as you have
number of URLs.
thanks again,
kman
|
|
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status:
Offline
|
|
Originally posted by Camelot:
assuming the list of files is in a file called 'inputfile', this will do exactly what you want:
awk '{ORS=","; print $1}' inputfile
To dissect the command:
awk <- name of the program used to munge the list
ORS="," = Output Record Separator, i.e. the string to use between each record in the output. Normally a return, but here replaced with a comma.
print $1 = print the first field (whitespace delimited). If you want another field just change the digit - $1 = first field, $2 = second field, etc.
Note that the awk script is enclosed in curly braces { and }, and is single quoted to avoid problems with the shell parsing it before awk gets it.
inputfile = name of the file to read. Change as appropriate. Alternatively, if working from an ls output, you could pipe the ls output into this awk command and not use any intermediate file.
What if there are no record delimiters in the input file? For example, what if the input file looks like this:
seq1.txt 42k 01012003 rgrseq2.txt 4k 02022003 sequence3.txt 4k 02022003 seq4.txt 16k 01012003
Then I'm back to using sed, right?
Sorry, I'm new at the whole shell scripting thing and am trying to force myself to do things with the cli that would undoubtedly be easier with afp  That is, until I am proficient at the cli
kman
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Nov 2001
Location: State of Denial
Status:
Offline
|
|
For the curl question, wrap the URL in single quotes. Change
curl http://workserver.edu/mydirectory/{parsed sequence list here} -o "seq_#1"
to
curl 'http://workserver.edu/mydirectory/{parsed sequence list here}' -o "seq_#1"
|
|
[Wevah setPostCount:[Wevah postCount] + 1];
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status:
Offline
|
|
Originally posted by Wevah:
For the curl question, wrap the URL in single quotes. Change
curl http://workserver.edu/mydirectory/{parsed sequence list here} -o "seq_#1"
to
curl 'http://workserver.edu/mydirectory/{parsed sequence list here}' -o "seq_#1"
Thanks! And I think I worked out the other bit on my own.
Is there a good regular expressions tutorial? I find it difficult to go from what I want to do to an actual regular expression to do it. I think it is just a matter of practice so that I get used to thinking in terms of regular expressions. A tutorial would be quite helpful.
kman
|
|
|
| |
|
|
|
 |
|
 |
|
Dedicated MacNNer
Join Date: Jul 2001
Location: NC
Status:
Offline
|
|
Originally posted by kman42:
What if there are no record delimiters in the input file? For example, what if the input file looks like this:
seq1.txt 42k 01012003 rgrseq2.txt 4k 02022003 sequence3.txt 4k 02022003 seq4.txt 16k 01012003
Then I'm back to using sed, right?
Actually, no. Awk would be even more useful then. Sed and Awk both edit line by line. If you created a file with no newlines, then both sed and awk would try to operate on the whole file at once. At that point, I would consider the looping and testing capabilities of awk to be the best way to look at each field in turn.
I don't have my references in front of me now but I would guess that it the command would look something like:
awk '{for (i=1; i<=NF; i++) if ($i ~ ".*\.txt" ) print $i;}'
Sorry, I'm new at the whole shell scripting thing and am trying to force myself to do things with the cli that would undoubtedly be easier with afp That is, until I am proficient at the cli
kman
"until I am proficient" you say? Hah! You have the sound of someone who's close to hooked already! I'll bet you never stop getting better. Fortunately there need be no end to it. The best part is, it gets more and more interesting. (at least when these dumb machines do what you say!) Of course there's the ancient Chinese curse to consider, "May you lead an interesting life."
|
|
Gary
A computer scientist is someone who, when told to "Go to Hell", sees the
"go to", rather than the destination, as harmful.
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status:
Offline
|
|
Originally posted by kman42:
What if there are no record delimiters in the input file? For example, what if the input file looks like this:
seq1.txt 42k 01012003 rgrseq2.txt 4k 02022003 sequence3.txt 4k 02022003 seq4.txt 16k 01012003
Then I'm back to using sed, right?
Not at all. In this case you simply override awk's record delimiter from a return to a space.
Just like ORS in my previous example sets the Output record separator, the RS command in awk lets you use any character string as the Input record separator.
Then the /string/ command lets you restrict the awk commands to records that match the /string/
This command breaks the list into separate records based on spaces, then tells it to only print records that contain ".txt":
awk 'BEGIN {RS=" ";} /txt/ {print $1};' inputfile
the BEGIN {RS=" ";} tells awk that before it starts reading the file, the Record Separator should be set to a space, then the /txt/ part tells awk to only run the following command (print $1) on any record that matches the /txt/ search.
|
|
Gods don't kill people - people with Gods kill people.
|
| |
|
|
|
 |
|
 |
|
Professional Poster
Join Date: Sep 2000
Location: San Francisco
Status:
Offline
|
|
Originally posted by Camelot:
Not at all. In this case you simply override awk's record delimiter from a return to a space.
Just like ORS in my previous example sets the Output record separator, the RS command in awk lets you use any character string as the Input record separator.
Then the /string/ command lets you restrict the awk commands to records that match the /string/
This command breaks the list into separate records based on spaces, then tells it to only print records that contain ".txt":
awk 'BEGIN {RS=" ";} /txt/ {print $1};' inputfile
the BEGIN {RS=" ";} tells awk that before it starts reading the file, the Record Separator should be set to a space, then the /txt/ part tells awk to only run the following command (print $1) on any record that matches the /txt/ search.
Thank you. Your breakdown of the command was very helpful. I tried reading the man for awk, but the shear number of options was a little overwhelming.
I've realized that I really need to invest in some sort of tutorial/reference for advanced beginners of UNIX. I know the basics of file system management and a few of the simple commands, but there are so many small utilities that it's almost impossible to know what is available without someone prompting me in the right direction (as to the existence of awk). Any suggestions? An online tutorial would be great, but I wouldn't mind investing in a quality reference manual either.
thanks,
kman
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status:
Offline
|
|
Originally posted by kman42:
Thank you. Your breakdown of the command was very helpful. I tried reading the man for awk, but the shear number of options was a little overwhelming.
I've realized that I really need to invest in some sort of tutorial/reference for advanced beginners of UNIX. I know the basics of file system management and a few of the simple commands, but there are so many small utilities that it's almost impossible to know what is available without someone prompting me in the right direction (as to the existence of awk). Any suggestions? An online tutorial would be great, but I wouldn't mind investing in a quality reference manual either.
thanks,
kman
Ahh... the chicken and egg syndrome of the computer world - knowing which utility to use requires knowing what utilities are available, which requires knowing what they do.
As a starting point, consider "Learning UNIX for Mac OS X (2nd edition)" from O'Reilly ( http://www.oreilly.com/catalog/lunixmacosx2/ ) or Mac OS X in a Nutshell ( http://www.oreilly.com/catalog/macosxian/desc.html )
Both are good all-round introductions to some of the utilities that are included in Mac OS X.
Neither are a complete references, but there's no substitute for practice - and asking questions, of course.
|
|
Gods don't kill people - people with Gods kill people.
|
| |
|
|
|
 |
 |
|
 |
|
|
|
|
|

|
|
 |
Forum Rules
|
 |
 |
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is Off
|
|
|
|
|
|
 |
 |
 |
 |
|
 |
|