Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Developer Center > Batch HTML formatting

Batch HTML formatting
Thread Tools
Fresh-Faced Recruit
Join Date: Apr 1999
Location: Vancouver BC, Canada
Status: Offline
Reply With Quote
Feb 5, 2002, 09:28 AM
 
Hi there, new to the boards. I apologize if there is an obvious solution to my problem that hasn't occurred to me.

I'm writing a perl script that extracts data from HTML files. I plan to post a few perl questions in another topic.

The HTML files are already in a standard format, but not one conducive to my planned method of extraction.

Omniweb's 'reformat' button does a fine job for my purposes.

Eg. I'd like it to take the following:
<p><b>Area:</b>
<br><i>total:&l t;/i>
652,000 sq km
<br><i>land:&lt ;/i>
652,000 sq km
<br><i>water:&l t;/i>
0 sq km

and turn it into:
<b>Area:</b> <br>
<i>total:</i> 652,000 sq km <br>
<i>land:</i> 652,000 sq km <br>
<i>water:</i> 0 sq km

This way, I can create a big array of strings before and after the desired information and run something along these lines:
if ($line =~ s/$searchy[$itr]//) {
substr($line, -length($choppy[$itr])-2) = "";
print "<$taggy[$itr]>$line<\\$taggy[$itr]>\n";
}

If anyone knows of a CLI (or otherwise for that matter) program that could batch process HTML formatting, I'd be greatly appreciative.

I apologize for being long winded and appreciate your help.

kdavis@uvic.ca
     
Junior Member
Join Date: Nov 2001
Location: Seattle
Status: Offline
Reply With Quote
Feb 6, 2002, 10:02 AM
 
Profit,

I don't have omniweb, so I don't know exactly how it reformats html files. However, this sort of task is exactly what perl was made for, so I would make a perl script to reformat your files. (Of course, I probably would try to combine these perl programs so you can simply read unformatted html files..)

Here's a perl script you could use as a starting point. Good luck!

<font face = "courier">#!/usr/bin/perl

# reformats html files -- puts newlines after each &lt;br&gt; and
# before and after each &lt;p&gt;. Usage: "reformat.pl &lt;html files&gt;"

foreach $file (@ARGV)
{
open(IN, $file);
open(OUT,"&gt;$file.reformatted");

foreach $line (&lt;IN&gt
{
chomp($line);
$line =~ s/&lt;p&gt;/\n&lt;p&gt;\n/g;
$line =~ s/&lt;br&gt;/&lt;br&gt;\n/g;

print OUT "$line";
}
print OUT "\n";

#`mv $file.reformatted $file`;
}</font>

--Juggle5
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 12:39 PM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2