Download all pages linked on a page in a certain region

Question

I would like to download my users activity page of this Unix&Linux with wget and all pages that are linked in the activity list.

I tried

wget -m -l 2

which should mirror the site recursively but only max one level in recursion depth, but that is not a good solution. especially the stylesheet is not downloaded correctly.

Is there a solution that also downloads all needed CSS and images and keeps links between those questions locally intact? A perfect solution would be if on those downloaded questions were shown with everything intact, i.e. comments, etc.

slm · Answer 1 · 2014-11-07 19:17:01Z

Something like this with httrack will do what you want.

$ httrack \
    'http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all' \
    -* +*question* -r2

This won't proceed past the first page of pagination of that page. You could likely modify it so that it could. It could also be looped through the pagination pages too.

The above downloads 2 levels (-r2) and ignores all pages that do not include the path *question*.

Commentary on this approach

With this type of download you'll likely have to run a more complex command a couple of times to determine that you've gotten everything that's required to keep the pages locally. Not to worry though, you can keep running httrack in the same directory and it will detect that its already download various pieces and either skip them, or update them where appropriate.

NOTE: This is a by-product of the approach were using, where we've explicitly excluded everything with the -*, and then selectively added things back in with a +.... You could always cast the net wider and tell httrack to download more, but then you'll be pulling in a lot more data too.

Iterating the download

For example, here I'm running it multiple times as I identify other files that I want to have it pull down.

run #1

$ httrack \
    'http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all' \
    -* +*question* +*sstatic.net* -r2 

There is an index.html and a hts-cache folder in the directory 
A site may have been mirrored here, that could mean that you want to update it
Be sure parameters are ok

Press <Y><Enter> to confirm, <N><Enter> to abort
Y
Mirror launched on Fri, 07 Nov 2014 14:01:35 by HTTrack Website Copier/3.48-19 [XR&CO'2014]
mirroring http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all -* +*question* +*sstatic.net* with the wizard help..
Done.: unix.stackexchange.com/questions/163334/connecting-to-irc-and-log-all-conversations (62646 bytes) - OK
Thanks for using HTTrack!

run #2

$ httrack \
    'http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all' \
    -* +*question* +*sstatic.net* +*googleapis* -r2 

There is an index.html and a hts-cache folder in the directory 
A site may have been mirrored here, that could mean that you want to update it
Be sure parameters are ok

Press <Y><Enter> to confirm, <N><Enter> to abort
Y
Mirror launched on Fri, 07 Nov 2014 14:03:05 by HTTrack Website Copier/3.48-19 [XR&CO'2014]
mirroring http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all -* +*question* +*sstatic.net* +*googleapis* with the wizard help..
Done.: unix.stackexchange.com/questions/163334/connecting-to-irc-and-log-all-conversations (62646 bytes) - OK
Thanks for using HTTrack!

In the above I identified that Stack Exchange makes use of GoogleAPI's so I needed to add that into the filter chain so that httrack knows to download files from that site as well.

I generally either use grep to look through the files to make sure I have everything, or I use my web browser's "view source" feature to see what URLs are still coming from other sites, vs. my local system.

NOTE: You can open the resulting downloads in Chrome, using the file:///path/to/httrack/download/index.html, and navigate the contents.

References

Httrack Users Guide

@rubo77 - try what I showed. It downloaded the first page just fine and I could navigate it locally. — slm♦, Nov 7 '14 at 6:05
@rubo77 - ha, I'm going to bed now. I'm pretty confident that httrack can do the pagination as well, the URL you provided has your activity spread across multiple pages, i.e. the 1,2,3,...Last bit at the bottom. — slm♦, Nov 7 '14 at 6:09
httrack works just great! Only one problem I couldn't solve: Mirroring stackexchange including external images — rubo77, Nov 7 '14 at 9:16
@rubo77 - please don't use exclamation points, simply pointing this out is sufficient. You need to add in additional pathing is all, since the approach I specified excludes everything (-*) and then adds paths back in (+...). So add a rule such as +*.css, etc. Also the approach I'm showing is a model. You'll have to play with it to get exactly the results you're after, but it definitely works, as I've used it for a variety of projects over the years to achieve similar results. — slm♦, Nov 7 '14 at 18:49

Hackaholic · Answer 2 · 2014-11-07 19:22:31Z

up vote 0 down vote

you can use a software called black widow: its GUI and it will download it to hardrive

Black Widow

answered Nov 7 '14 at 19:22

Hackaholic

68727

add a comment |

asked	2 years ago
viewed	221 times
active	2 years ago

current community

your communities

more stack exchange communities

Download all pages linked on a page in a certain region

2 Answers 2

Commentary on this approach

Iterating the download

References

Your Answer

Not the answer you're looking for? Browse other questions tagged backup wget download or ask your own question.

Linked

Hot Network Questions

Linked

Hot Network Questions

your communities

2 Answers 2

Commentary on this approach

Iterating the download

References

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged backup wget download or ask your own question.

Linked

Related

Linked

Related