Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Mac OS X > How to mirror a web site with curl or wget?

How to mirror a web site with curl or wget?
Thread Tools
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status: Offline
Reply With Quote
Mar 23, 2003, 06:55 PM
 
Im having trouble downloading an entire web site.

Can somone please post the corect switches for wget and curl? The only way I have been able to download entire web sites is with Anarchy. I have 4.1 and its fairly old now. id rather not buy 6.1 and just use curl.

I have tried:

curl www.website.com/~mysite/
curl -o www.website.com/~mysite/
curl -O www.website.com/~mysite/

No go.

Also tried to do it with wget:

wget -m www.website/~mysite/

downloads index.html and a robot.txt file it creates and no more...

The man pages are of little help, I can't find my mistake...

Thanks in advance.
     
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status: Offline
Reply With Quote
Mar 23, 2003, 08:34 PM
 
wget -m -p http://www.site.com/

works fine for me here. There are two things that might be catching you out.

First, your examples show 'www.website.com' rather than 'http://www.website.com/", so I'm not sure of the lack of protocol is affecting you.

Secondly, and maybe more importantly is that wget downloads a robots.txt file. The robots.txt file is specific instructions to web spiders (like wget's mirror) on how to mirror the site, and it may include instructions to NOT download the site.

I don't know of a wget option to ignore robots.txt, but if the robots.txt is in your own directory, you should be able to change it so that wget can mirror the site.

What does the robots.txt file contain?
Gods don't kill people - people with Gods kill people.
     
Avon  (op)
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status: Offline
Reply With Quote
Mar 23, 2003, 09:20 PM
 
Originally posted by Camelot:

Secondly, and maybe more importantly is that wget downloads a robots.txt file. The robots.txt file is specific instructions to web spiders (like wget's mirror) on how to mirror the site, and it may include instructions to NOT download the site.

I don't know of a wget option to ignore robots.txt, but if the robots.txt is in your own directory, you should be able to change it so that wget can mirror the site.

What does the robots.txt file contain?
Yes this is the problem. When i try to get another site like macnn.com, it works great.

This is the site I wanted to get before...
http://www.liv.ac.uk/~aicooper/

It downloads the robot.txt and stops..
here it is:

-----------
User-Agent: htdig-3.1.5
User-Agent: Harvest-NG
User-Agent: Harvest
Disallow: /cgi-bin/
Disallow: /private_html/
Disallow: /Liverpool/Private/

User-Agent: Lycos
User-Agent: Lycos_Spider_(T-Rex)
User-Agent: Gulliver
User-Agent: Scooter
User-Agent: InfoSeek
User-Agent: Ultraseek
User-Agent: fido
User-Agent: NetMechanic
User-Agent: ArchitextSpider
User-Agent: Mirago
User-Agent: Slurp
User-Agent: Googlebot
User-Agent: Google
User-Agent: WebCrawler
User-Agent: ia_archiver
Disallow: /cgi-bin/
Disallow: /private_html/
Disallow: /local_html/
Disallow: /Admin/
Disallow: /harvest/
Disallow: /Harvest/
Disallow: /Liverpool/Private/

User-Agent: BioCrawler
Disallow: /a
Disallow: /b
Disallow: /ca
Disallow: /cb
Disallow: /cc
Disallow: /cd
Disallow: /ce
Disallow: /cf
Disallow: /cg
Disallow: /ch
Disallow: /cia
Disallow: /cib
Disallow: /cic
Disallow: /cid
Disallow: /cie
Disallow: /cif
Disallow: /cig
Disallow: /cih
Disallow: /cii
Disallow: /cij
Disallow: /cik
Disallow: /cim
Disallow: /cin
Disallow: /cio
Disallow: /cip
Disallow: /ciq
Disallow: /cir
Disallow: /cis
Disallow: /cit
Disallow: /ciu
Disallow: /civ
Disallow: /ciw
Disallow: /cix
Disallow: /ciy
Disallow: /ciz
Disallow: /cj
Disallow: /ck
Disallow: /cl
Disallow: /cm
Disallow: /cn
Disallow: /co
Disallow: /cp
Disallow: /cq
Disallow: /cr
Disallow: /cs
Disallow: /ct
Disallow: /cu
Disallow: /cv
Disallow: /cw
Disallow: /cx
Disallow: /cy
Disallow: /cz
Disallow: /d
Disallow: /e
Disallow: /f
Disallow: /g
Disallow: /h
Disallow: /i
Disallow: /j
Disallow: /k
Disallow: /l
Disallow: /m
Disallow: /n
Disallow: /o
Disallow: /p
Disallow: /q
Disallow: /r
Disallow: /s
Disallow: /t
Disallow: /u
Disallow: /v
Disallow: /w
Disallow: /x
Disallow: /y
Disallow: /z
Disallow: /A
Disallow: /B
Disallow: /C
Disallow: /D
Disallow: /E
Disallow: /F
Disallow: /G
Disallow: /H
Disallow: /I
Disallow: /J
Disallow: /K
Disallow: /L
Disallow: /M
Disallow: /N
Disallow: /O
Disallow: /P
Disallow: /Q
Disallow: /R
Disallow: /S
Disallow: /T
Disallow: /U
Disallow: /V
Disallow: /W
Disallow: /X
Disallow: /Y
Disallow: /z
Disallow: /cgi-bin/
Disallow: /private_html/
Disallow: /local_html/
Disallow: /Admin/
Disallow: /harvest/
Disallow: /Harvest/
Disallow: /Liverpool/Private/

User-agent: *
Disallow: /

--------------

I guess that means they dont want me download the entire directory? But I want to!!! How can I make wget, or curl ignor the robots.txt file?
     
Avon  (op)
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status: Offline
Reply With Quote
Mar 23, 2003, 09:35 PM
 
Got it!

The trick is to create a .wgetrc file in your home directory. In it include:

robots = off
     
Avon  (op)
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status: Offline
Reply With Quote
Mar 23, 2003, 09:37 PM
 
now....

Does anyone know how to get curl to work? Thanks...
     
Mac Elite
Join Date: Dec 2001
Location: Atlanta, GA, USA
Status: Offline
Reply With Quote
Mar 24, 2003, 12:07 PM
 
Originally posted by Avon:
now....

Does anyone know how to get curl to work? Thanks...
From the cURL FAQ :


1.3 What is cURL not?
Curl is *not* a wget clone even though that is a very common misconception.
Never, during curl's development, have we intended curl to replace wget or
compete on its market. Curl is targeted at single-shot file transfers.
_
Curl is not a web site mirroring program. If you wanna use curl to mirror
something: fine, go ahead and write a script that wraps around curl to make
it reality (like curlmirror.pl does).
_
Curl is not an FTP site mirroring program. Sure, get and send FTP with curl
but if you want systematic and sequential behavior you should write a
script (or write a new program that interfaces libcurl) and do it.
_
Mac Pro 2x 2.66 GHz Dual core, Apple TV 160GB, two Windows XP PCs
     
Dedicated MacNNer
Join Date: Mar 2002
Status: Offline
Reply With Quote
Mar 24, 2003, 12:30 PM
 
Yip, wget -m and robots=off is the solution
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 07:36 AM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2