Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Developer Center > Any good free html parsing code?

Any good free html parsing code?
Thread Tools
Mac Elite
Join Date: Sep 2000
Location: Norfolk, Va
Status: Offline
Reply With Quote
Jan 29, 2003, 02:51 PM
 
In the app I'm writing, the user supplies a web address, which's contents are then loaded into a string. What I'm trying to do is extract links from this string, that is:

<a href="http://www.google.com">Google</a>

becomes:

Google
http://www.google.com

The problem is -- as I didn't know before I adapted Apple's sample NSScanner code (which works fine for this format of link) -- is that there are about a dozen or more different kinds of links, as near as I can tell (that is, it almost never looks like the example above).
Yet every web browser displays links the same way, as a segment of underlined text linking to an address, and omniWeb even has a little utility window which extracts all the links from a page.

My question: before I struggle with reinventing the wheel to deal with each and every kind of link, is there any free html scanning code which I can pass a string and get this information out of?

Thanks. I've been looking on CocoaDev and CDC, with no avail.
you are not your signature
     
Mac Elite
Join Date: Feb 2001
Location: Vancouver, WA
Status: Offline
Reply With Quote
Jan 29, 2003, 06:39 PM
 
All the code we use to do that (the parsing, not the window displaying) is available as open source.
Rick Roe
icons.cx | weblog
     
Gametes  (op)
Mac Elite
Join Date: Sep 2000
Location: Norfolk, Va
Status: Offline
Reply With Quote
Jan 30, 2003, 10:15 AM
 
Thanks Rick. Actually, I went to your dev site right after I wrote that. I've found OWF, and am currently looking through there for this code. You guys are such philanthropists; this stuff is amazing.
Question: What files are responsible for this activity? Is it even possible for me to extract pieces of the OWF and retain functionality?
They look to be intertwined.

There's little documentation in there, and I'm having trouble deciphering if the tokenizer files are what I want to use. Optimally, I'd like to instance some object that I can pass a webpage source string and get back an array of dictionaries, each with a Title string and an Address string.

I sincerely appreciate your illuminating the OWF's files for me.
you are not your signature
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 02:07 PM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2