Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Developer Center > Web Crawlers and Database-Driven Sites

Web Crawlers and Database-Driven Sites
Thread Tools
Mac Elite
Join Date: Nov 2003
Location: Rockville, MD
Status: Offline
Reply With Quote
Jun 22, 2005, 10:31 AM
 
I'm curious about something. In the early-90s, before we got wise to buidling database-driven websites, we used to code every single HTML page of our site by hand (with the aid sometimes of Dreamweaver templates or simply copying, pasting, and modifying). This was very laborious and inferior to database-driven design in many ways, but at least one thing could be relied upon: eventually, the web crawlers that gathered metadata on behalf of the search engines would discover the existence of the pages on our server and index them.

For example, if I hand-coded a HTML page for every item in my catalog, theoretically each item could get indexed by the search engines individually. Nice if you sell purple rayon T-shirts and somebody types "purple rayon t-shirt" in Google (please, don't send me orders for the ghastly things).

Fast forward to 2003 or so. We're doing the same thing but instead of dozens of HTML pages, we have a single PHP page and a mySQL database from which all the info is drawn. With the data on our items for sale inside the database and recalled as the pages are loaded, how do the search engines gain specific knowledge about our products? What shift in strategy is required to make sure people know we sell purple rayong T-shirts and also green canvas sneakers (these are only joke examples, ok?), etc.
     
Clinically Insane
Join Date: Nov 1999
Status: Offline
Reply With Quote
Jun 22, 2005, 12:12 PM
 
Search engines have never been able to just get onto someone's machine and spider the hard drive. Instead, they rely mostly on two things: links and indices.

I assume you already know about links. A crawler downloads a page, grabs all the URLs and links from it, downloads those pages, and repeats in a cycle. Indices came about when spiders got a little smarter, and started looking for extremely common filenames like "index.html" in sites even when there were no links to such files, or looking in bare directories (http://www.example.com/something/ instead of http://www.example.com/something/page.html). Many Web servers, such as Apache, see URLs like this and -assuming no default page for that directory is set- generate an HTML page containing links to everything in that directory. Spider writers took advantage of this and used it to get a look at pages they might not otherwise have seen.

Of course, indices like this don't work well with database-driven sites, since there are no directories to search. Therefore, the best strategy is to fall back on links. If you're running a shop, there are three things to keep in mind:
  • Every single product has its own page (in addition to any lists of products that you might have). You probably already do this anyway.
  • Every product's page has a link to it somewhere on the site. It doesn't have to be on the front page, but you need a link somewhere.
  • Someone starting from your front page should be able to get to every single product's page using nothing but links and the Back button.
The other thing to keep in mind is that spiders can't use JavaScript and they can't see images (they can download images, but they can't see what's in them). Therefore, you need to be careful to make sure that when you test for my third technique, I recommend that you use iCab and turn off both JavaScript and images. This will let you see the page more or less the same way a search engine sees it (not quite the same, but close enough for most purposes). When you do it this way, you see all the text a search engine can see, and you can follow all the links that a search engine can. I suggested iCab because it has a "link toolbar" which is used with a special kind of link that you put in your page's headers; most browsers don't see them but the search engines do. iCab sees these links too, so you can use it to check that those links are done properly. That's pretty much the minimum required to get onto a search engine. If you want to get a better ranking, there are several techniques you can use.

Semantic markup (marking things up by what they mean, not by what they look like) is a very good idea. Humans can usually figure out what something means by what it looks like, but search engines can't see the page, so they don't know what it looks like, and this means they have to figure stuff out in other ways (which usually means just treating your page as a stream of text). If you use semantic markup, though, then search engines will know more about what they page says. For example, if you use real HTML headers instead of FONT and B tags, then search engines will know that they're headers, and that the text inside them is probably more important than other text. If you use EM instead of I to emphasize text, then search engines will know you've emphasized that text, and so you must especially want to get that point across. If you use proper lists (UL, OL, and DL) instead of linebreaks and * characters, then search engines will know you've arranged things into lists, and they have ways of interpreting those lists differently (this is particularly true of DL, the Definition List). All of this lets search engines get a better sense of what your page is about, and lets them index it with more confidence. Because search engines present their results according to how confident they are that the page will fit the user's search terms, the extra confidence that comes from semantic markup translates directly into higher rankings.

Having other sites link to your products is also a good way to boost your rankings. This does not mean that you should go out and spam blogs and message boards with links to your product; search engines are developing ways of detecting that. However, if you can get other sites to review or approve of your product, search engines will take these links into account. Even bad reviews help you in this regard (as long as you still rank ahead of those reviews! )
You are in Soviet Russia. It is dark. Grue is likely to be eaten by YOU!
     
Mac Elite
Join Date: Nov 2003
Location: Rockville, MD
Status: Offline
Reply With Quote
Jun 22, 2005, 03:43 PM
 
Thank you, Millennium, for these very interesting, insightful, and detailed remarks.

I think I have grasped what you're saying about the importance of semantic markup; as Jeffrey Zeldman eloquently put it, the key to effective web design is to separate presentation from content to the greatest extent possible. I have learned to do this by combining valid XHTML and valid CSS, which does indeed mean (as you suggest) that I've replaced <b> with <strong> and <i> with <em> and all of the other expectations for valid XHTML. I think my pages are better organized and more effective as a result.

I wonder, then, if in the case of this website, individual artists and works of art would eventually make it into Google's directories/indices even though there no page like picasso.htm, but rather just an index.php page that draws data out of a mySQL database. Does that work?
(Last edited by selowitch; Jun 22, 2005 at 04:07 PM. )
     
Clinically Insane
Join Date: Nov 1999
Status: Offline
Reply With Quote
Jun 22, 2005, 04:21 PM
 
Originally Posted by selowitch
I wonder, then, if in the case of this website, individual artists and works of art would eventually make it into Google's directories/indices even though there no page like picasso.htm, but rather just an index.php page that draws data out of a mySQL database. Does that work?
If every page on the site can be reached through the front page -and it looks like they can- then they'll get into the search engines, but I'm afraid I don't think they'll be ranked high as they are now. The images are very nice, but Google can't actually see the images, and without any other data on the page it won't know what to do with it.

That said, I think I see a fairly simple way to improve its search-engine rankings. On the main page, you include a fair bit of data about the image, including its creator, title (where applicable), and so on. If enlarge.php were to include that same data somewhere on the page -possibly under the image- search engines would be better able to understand what the image was for. People looking at the page would also probably like having the data right on the same page. As another possible optimization, use the title of each painting as the image's alt="" attribute, and something like "(title) by (painter) - Enlarged" as the title of the page instead of just "Selection Enlarged".

If you really want to get fancy, you could arrange the metadata in a definition list, something like this:
Code:
<dl> <dt>Artist</dt> <dd>Antonio Zucchi*</dd> <dt>Period</dt> <dd>It. 1726–1796</dd> <dt>Title</dt> <dd>Virgil Reading the Aeneid to Augustus and Octavia</dd> <dt>Medium</dt> <dd>Oil on canvas</dd> <dt>Size</dt> <dd>37 1/2" × 50 1/2" (95.3 cm × 128.3 cm)</dd> <dt>Price</dt> <dd>$12,000/18,000</dd> </dl>
...and then style that with CSS to look nice on the page. This would get you pretty much everything search engines -including image-search engines- look for, except possibly a relevant filename for the image, but I'm guessing that's not possible without some major recoding on your part. Even the first step I mentioned -including the metadata on the enlarged page- should give a boost, though.
You are in Soviet Russia. It is dark. Grue is likely to be eaten by YOU!
     
Mac Elite
Join Date: Nov 2003
Location: Rockville, MD
Status: Offline
Reply With Quote
Jun 22, 2005, 05:24 PM
 
Originally Posted by Millennium
I think I see a fairly simple way to improve its search-engine rankings. On the main page, you include a fair bit of data about the image, including its creator, title (where applicable), and so on. If enlarge.php were to include that same data somewhere on the page -possibly under the image- search engines would be better able to understand what the image was for. People looking at the page would also probably like having the data right on the same page. As another possible optimization, use the title of each painting as the image's alt="" attribute, and something like "(title) by (painter) - Enlarged" as the title of the page instead of just "Selection Enlarged".

If you really want to get fancy, you could arrange the metadata in a definition list, something like this:
Code:
<dl> <dt>Artist</dt> <dd>Antonio Zucchi*</dd> <dt>Period</dt> <dd>It. 1726–1796</dd> <dt>Title</dt> <dd>Virgil Reading the Aeneid to Augustus and Octavia</dd> <dt>Medium</dt> <dd>Oil on canvas</dd> <dt>Size</dt> <dd>37 1/2" × 50 1/2" (95.3 cm × 128.3 cm)</dd> <dt>Price</dt> <dd>$12,000/18,000</dd> </dl>
...and then style that with CSS to look nice on the page. This would get you pretty much everything search engines -including image-search engines- look for, except possibly a relevant filename for the image, but I'm guessing that's not possible without some major recoding on your part. Even the first step I mentioned -including the metadata on the enlarged page- should give a boost, though.
This is an excellent set of suggestions. The recoding wouldn't be bad because the layout table is generated via a PHP loop through the rows of a mySQL table, so it's only a few minutes of work. I'm going to try it! Thank you.

If I'm going to go with the definition list, I'll definitely want to figure out how to keep the <dt>s in the code but make them invisible to the site visitor.
(Last edited by selowitch; Jun 22, 2005 at 08:42 PM. )
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 09:17 AM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2