Blogging in Lisp


Mirroring a Blogspot Site

Submitted by Bill St. Clair on Thu, 13 Apr 2017 10:58:31 GMT

William Norman Grigg died yesterday. RIP.

When an important blogger passes, I often mirror their web site(s). I've been doing that with Mr. Grigg's Pro Libertate. site. It's on Blogspot, so doing a simple "wget -mk" pulls a separate file for each comment to each post, even though all those copies are identical. I finally figured out how to tell wget to NOT keep those files. It still downloads them all, scans for links that it already knows about, and then deletes them, but at least they don't stay to waste disk space. I have found no way to tell it to completely ignore those files. Mirroring would be much faster if that were possible. It took a few minutes to pull the 1,030 html files, and then a long time to pull and discard all the "?showComment" files.

I named the script that does this mirror-blogspot. The important line is:

wget -mk -R "*?showComment*" -pH
 -D "$DOMAIN,,,," $1

-m is the standard wget mirror command. It enables recursive download, disables limits on that, and ensures that no links outside of the initial argument will be followed.

-k is --convert-links. It causes internal links to be changed from absolute to relative, so <a href='$1/foo'> becomes <a href='foo'>, with all the right stuff done to make that work correctly. Unfortunately, there's no way to tell wget to do that process on an existing mirror, so if your mirror quits before it's done, you're SOL.

-R is --reject. It's the important thing I learned yesterday. It tells wget to reject files whose names match the argument, which is either a list of file types or a pattern (not a regular expression).

-p is --page-requisites. It tells wget to download inline images, but will NOT by itself make it download from another domain.

-H is --span-hosts.

-D is --domains. If not specified with -r, then only files from the domain mentioned as the final argument will be downloaded. It allows you to add other domains, but also requires that you include the domain on the command line; hence $DOMAIN in that list. The * domains are where Blogspot stores images.

Add comment   Edit post   Add post

Comments (1)

Maybe the next time I

Submitted by Bill St. Clair on Thu, 13 Apr 2017 13:15:25 GMT

Maybe the next time I need to do this I'll try HTTrack.

Edit comment

Add comment   Edit post   Add post

Previous Posts:

Xossbow Progress Report
Diceware Passphrase Generator Updated
Xossbow Baby Steps
Web Pages from JSON Templates in Elm
Cryptographically-Secure Random Numbers in Elm
Elm Digital Ocean Interface
Fixed: Slow Macintosh File Dialogs
Kokuro Dojo for Android
Android Version Hell
Kakuro Dojo iOS App Ships!