Posted by: nowiremember | April 23, 2008

Create a mirror of a website with Wget

One of the more advanced features in wget is the mirror feature. This allows you to create a complete local copy of a website, including any stylesheets, supporting images and other support files. All the (internal) links will be followed and downloaded as well (and their resources), until you have a complete copy of the site on your local machine.

In its most basic form, you use the mirror functionality like so:

$ wget -m http://www.example.com/

There are several issues you might have with this approach, however.

First of all, it’s not very useful for local browsing, as the links in the pages themselves still point to the real URLs and not your local downloads. What that means is that, if, say, you downloaded http://www.example.com/, the link on that page to http://www.example.com/page2.html would still point to example.com’s server and so would be a right pain if you’re trying to browse your local copy of the site while being offline for some reason.

To fix this, you can use the -k option in conjunction with the mirror option:

$ wget -mk http://www.example.com/

Now, that link I talked about earlier will point to the relative page2.html. The same happens with all images, stylesheets and resources, so you should be able to now get an authentic offline browsing experience.

There’s one other major issue I haven’t covered here yet – bandwidth. Disregarding the bandwidth you’ll be using on your connection to pull down a whole site, you’re going to be putting some strain on the remote server. You should think about being kind and reduce the load on them (and you) especially if the site is small and bandwidth comes at a premium. Play nice.

One of the ways in which you can do this is to deliberately slow down the download by placing a delay between requests to the server.

$ wget -mk -w 20 http://www.example.com/

This places a delay of 20 seconds between requests. Replace that number, and optionally you can add a suffix of m for minutes, h for hours, and d for … yes, days, if you want to slow down the mirror even further.

Originally published on FOSSwire

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Categories

%d bloggers like this: