 |
 |
Best way to bulk remove blank lines in a doc?
|
 |
|
 |
|
Senior User
Join Date: Sep 2002
Location: Canastota, New York
Status:
Offline
|
|
I have a bunch of plain text documents that are practice questions for an exam. We're talking thousands of questions here. After each question, there are a series of 5 choices:
A. Blah Blah Blah
B. Blah Blah Blah
...
E. Blah Blah Blah
Between each choice, there is a line break. What I'm trying to figure out is the easiest way to remove the line breaks between these choices.
I'm guessing my options are Applescript, Perl, or maybe a shell script.
I can outline (in English) what I want the script to do:
If line begins with "A. ", "B. ", "C. ", "D. ", or "E. " then delete next line
Simple, no? I just have no idea where to start with this.
I'd appreciate any opinions on the best/most logic tool for the job, and perhaps a snippet of code or two. In the meantime, I really should get back to studying :-(
Thanks a bunch
|
|
|
| |
|
|
|
 |
|
 |
|
Dedicated MacNNer
Join Date: Dec 2002
Location: someplace
Status:
Offline
|
|
|
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Sep 2002
Location: Canastota, New York
Status:
Offline
|
|
Unfortunately, it's not that simple. While you are correct in that it would remove the empty lines between the answer choices, it would also remove any other blank lines, which messes up the question and answer explanation formating.
Thanks for the input though
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Enthusiast
Join Date: Nov 2001
Location: Adelaide, South Australia
Status:
Offline
|
|
Give this a go:
perl -pi.bak -e '$a=<> if /^[A-Z]\./' filename
where filename is your set of questions. Original is in filename.bak should it all go horribly wrong!
Cheers,
Paul
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Jun 1999
Location: Las Vegas, NV, USA
Status:
Offline
|
|
Are you sure you want the line after E. to be gone too?
Just do a simple search and replace as previously suggested, but instead of replacing /r/r with /r, replace /r/rB. with /rB. and do the same for C. D. and E.
It will be harder to delete the line after E. (again, are you sure you want this one gone?) but if each question starts with a number, you just search for /r/r[0-9].
Chris
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Sep 2002
Location: Canastota, New York
Status:
Offline
|
|
Hey Paul,
Once again you come through. Works like a charm. I'm still using that eBroadcast.com TV guide extraction script you made last year.
Guess it's time to learn those regular expressions or whatever they're called.
Thanks again bubba,
-J
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Enthusiast
Join Date: Nov 2001
Location: Adelaide, South Australia
Status:
Offline
|
|
Guess it's time to learn those regular expressions or whatever they're called.
Learning regexes will always hold you in good stead, whether you end up applying them in grep, awk, perl, python, sed or any of the myriad other apps that now embed the capability to munge text in this way.
Your problem was rather attractive in that it let me use a nice trick that had been waiting for an application. the "$a=<>" piece just throws away the line after the one that matches the regular expression (ie after any line beginning with a capital letter and then a literal period). Not very defensive, but given that you'd guaranteed the next line to be blank I didn't think it was worth checking!
(Oh yeah: You're welcome)
Cheers,
Paul
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Jun 1999
Location: Las Vegas, NV, USA
Status:
Offline
|
|
How does $a=<> mean "the line after the one I just found. I think I see that it replaces the line with a null, but what is $a?
Chris
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Enthusiast
Join Date: Jun 2000
Location: New Jersey, USA
Status:
Offline
|
|
The -p option tells perl to read an input line (assigning value to variable $_), then execute the given program, then print the variable $_. Without any alteration, that will simply print the input line. Do this for all lines in the input files.
So Paul's script basically says:
For each input line, see if it starts with a capital letter and a dot. If so, read the next input line into a garbage variable $a (which will be discarded). Then print the original input line.
If you only wanted to delete the lines between the options (and not the one after E.) you could replace A-Z with A-D.
|
|
|
| |
|
|
|
 |
|
 |
|
Addicted to MacNN
Join Date: Jun 1999
Location: Las Vegas, NV, USA
Status:
Offline
|
|
Thanks for the explanation. That was interesting.
Chris
|
|
|
| |
|
|
|
 |
 |
|
 |
|
|
|
|
|

|
|
 |
Forum Rules
|
 |
 |
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is Off
|
|
|
|
|
|
 |
 |
 |
 |
|
 |
|