Submitted by Bill St. Clair on Thu, 13 Apr 2017 10:58:31 GMT
William Norman Grigg died yesterday. RIP.
When an important blogger passes, I often mirror their web site(s). I've been doing that with Mr. Grigg's Pro Libertate. site. It's on Blogspot, so doing a simple "
wget -mk" pulls a separate file for each comment to each post, even though all those copies are identical. I finally figured out how to tell
wget to NOT keep those files. It still downloads them all, scans for links that it already knows about, and then deletes them, but at least they don't stay to waste disk space. I have found no way to tell it to completely ignore those files. Mirroring would be much faster if that were possible. It took a few minutes to pull the 1,030 html files, and then a long time to pull and discard all the "?showComment" files.
I named the script that does this mirror-blogspot. The important line is:
wget -mk -R "*?showComment*" -pH -D "$DOMAIN,1.bp.blogspot.com,2.bp.blogspot.com,3.bp.blogspot.com,4.bp.blogspot.com" $1
-m is the standard wget mirror command. It enables recursive download, disables limits on that, and ensures that no links outside of the initial argument will be followed.
-k is --convert-links. It causes internal links to be changed from absolute to relative, so
<a href='$1/foo'> becomes
<a href='foo'>, with all the right stuff done to make that work correctly. Unfortunately, there's no way to tell
wget to do that process on an existing mirror, so if your mirror quits before it's done, you're SOL.
-R is --reject. It's the important thing I learned yesterday. It tells
wget to reject files whose names match the argument, which is either a list of file types or a pattern (not a regular expression).
-p is --page-requisites. It tells wget to download inline images, but will NOT by itself make it download from another domain.
-H is --span-hosts.
-D is --domains. If not specified with -r, then only files from the domain mentioned as the final argument will be downloaded. It allows you to add other domains, but also requires that you include the domain on the command line; hence
$DOMAIN in that list. The
*.bp.blogspot.com domains are where Blogspot stores images.