 |
 |
How to mirror a web site with curl or wget?
|
 |
|
 |
|
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status:
Offline
|
|
Im having trouble downloading an entire web site.
Can somone please post the corect switches for wget and curl? The only way I have been able to download entire web sites is with Anarchy. I have 4.1 and its fairly old now. id rather not buy 6.1 and just use curl.
I have tried:
curl www.website.com/~mysite/
curl -o www.website.com/~mysite/
curl -O www.website.com/~mysite/
No go.
Also tried to do it with wget:
wget -m www.website/~mysite/
downloads index.html and a robot.txt file it creates and no more...
The man pages are of little help, I can't find my mistake...
Thanks in advance.
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: May 1999
Location: San Jose, CA
Status:
Offline
|
|
wget -m -p http://www.site.com/
works fine for me here. There are two things that might be catching you out.
First, your examples show 'www.website.com' rather than 'http://www.website.com/", so I'm not sure of the lack of protocol is affecting you.
Secondly, and maybe more importantly is that wget downloads a robots.txt file. The robots.txt file is specific instructions to web spiders (like wget's mirror) on how to mirror the site, and it may include instructions to NOT download the site.
I don't know of a wget option to ignore robots.txt, but if the robots.txt is in your own directory, you should be able to change it so that wget can mirror the site.
What does the robots.txt file contain?
|
|
Gods don't kill people - people with Gods kill people.
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status:
Offline
|
|
Originally posted by Camelot:
Secondly, and maybe more importantly is that wget downloads a robots.txt file. The robots.txt file is specific instructions to web spiders (like wget's mirror) on how to mirror the site, and it may include instructions to NOT download the site.
I don't know of a wget option to ignore robots.txt, but if the robots.txt is in your own directory, you should be able to change it so that wget can mirror the site.
What does the robots.txt file contain?
Yes this is the problem. When i try to get another site like macnn.com, it works great.
This is the site I wanted to get before...
http://www.liv.ac.uk/~aicooper/
It downloads the robot.txt and stops..
here it is:
-----------
User-Agent: htdig-3.1.5
User-Agent: Harvest-NG
User-Agent: Harvest
Disallow: /cgi-bin/
Disallow: /private_html/
Disallow: /Liverpool/Private/
User-Agent: Lycos
User-Agent: Lycos_Spider_(T-Rex)
User-Agent: Gulliver
User-Agent: Scooter
User-Agent: InfoSeek
User-Agent: Ultraseek
User-Agent: fido
User-Agent: NetMechanic
User-Agent: ArchitextSpider
User-Agent: Mirago
User-Agent: Slurp
User-Agent: Googlebot
User-Agent: Google
User-Agent: WebCrawler
User-Agent: ia_archiver
Disallow: /cgi-bin/
Disallow: /private_html/
Disallow: /local_html/
Disallow: /Admin/
Disallow: /harvest/
Disallow: /Harvest/
Disallow: /Liverpool/Private/
User-Agent: BioCrawler
Disallow: /a
Disallow: /b
Disallow: /ca
Disallow: /cb
Disallow: /cc
Disallow: /cd
Disallow: /ce
Disallow: /cf
Disallow: /cg
Disallow: /ch
Disallow: /cia
Disallow: /cib
Disallow: /cic
Disallow: /cid
Disallow: /cie
Disallow: /cif
Disallow: /cig
Disallow: /cih
Disallow: /cii
Disallow: /cij
Disallow: /cik
Disallow: /cim
Disallow: /cin
Disallow: /cio
Disallow: /cip
Disallow: /ciq
Disallow: /cir
Disallow: /cis
Disallow: /cit
Disallow: /ciu
Disallow: /civ
Disallow: /ciw
Disallow: /cix
Disallow: /ciy
Disallow: /ciz
Disallow: /cj
Disallow: /ck
Disallow: /cl
Disallow: /cm
Disallow: /cn
Disallow: /co
Disallow: /cp
Disallow: /cq
Disallow: /cr
Disallow: /cs
Disallow: /ct
Disallow: /cu
Disallow: /cv
Disallow: /cw
Disallow: /cx
Disallow: /cy
Disallow: /cz
Disallow: /d
Disallow: /e
Disallow: /f
Disallow: /g
Disallow: /h
Disallow: /i
Disallow: /j
Disallow: /k
Disallow: /l
Disallow: /m
Disallow: /n
Disallow: /o
Disallow: /p
Disallow: /q
Disallow: /r
Disallow: /s
Disallow: /t
Disallow: /u
Disallow: /v
Disallow: /w
Disallow: /x
Disallow: /y
Disallow: /z
Disallow: /A
Disallow: /B
Disallow: /C
Disallow: /D
Disallow: /E
Disallow: /F
Disallow: /G
Disallow: /H
Disallow: /I
Disallow: /J
Disallow: /K
Disallow: /L
Disallow: /M
Disallow: /N
Disallow: /O
Disallow: /P
Disallow: /Q
Disallow: /R
Disallow: /S
Disallow: /T
Disallow: /U
Disallow: /V
Disallow: /W
Disallow: /X
Disallow: /Y
Disallow: /z
Disallow: /cgi-bin/
Disallow: /private_html/
Disallow: /local_html/
Disallow: /Admin/
Disallow: /harvest/
Disallow: /Harvest/
Disallow: /Liverpool/Private/
User-agent: *
Disallow: /
--------------
I guess that means they dont want me download the entire directory? But I want to!!! How can I make wget, or curl ignor the robots.txt file?
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status:
Offline
|
|
Got it!
The trick is to create a .wgetrc file in your home directory. In it include:
robots = off
|
|
|
| |
|
|
|
 |
|
 |
|
Senior User
Join Date: Mar 1999
Location: Livingston NJ USA
Status:
Offline
|
|
now....
Does anyone know how to get curl to work? Thanks...
|
|
|
| |
|
|
|
 |
|
 |
|
Mac Elite
Join Date: Dec 2001
Location: Atlanta, GA, USA
Status:
Offline
|
|
Originally posted by Avon:
now....
Does anyone know how to get curl to work? Thanks...
From the cURL FAQ :
1.3 What is cURL not?
Curl is *not* a wget clone even though that is a very common misconception.
Never, during curl's development, have we intended curl to replace wget or
compete on its market. Curl is targeted at single-shot file transfers.
_
Curl is not a web site mirroring program. If you wanna use curl to mirror
something: fine, go ahead and write a script that wraps around curl to make
it reality (like curlmirror.pl does).
_
Curl is not an FTP site mirroring program. Sure, get and send FTP with curl
but if you want systematic and sequential behavior you should write a
script (or write a new program that interfaces libcurl) and do it.
_
|
|
Mac Pro 2x 2.66 GHz Dual core, Apple TV 160GB, two Windows XP PCs
|
| |
|
|
|
 |
|
 |
|
Dedicated MacNNer
Join Date: Mar 2002
Status:
Offline
|
|
Yip, wget -m and robots=off is the solution 
|
|
|
| |
|
|
|
 |
 |
|
 |
|
|
|
|
|

|
|
 |
Forum Rules
|
 |
 |
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
|
HTML code is Off
|
|
|
|
|
|
 |
 |
 |
 |
|
 |
|