I would like to download my users activity page of this Unix&Linux with wget and all pages that are linked in the activity list.

I tried

wget -m -l 2 

which should mirror the site recursively but only max one level in recursion depth, but that is not a good solution. especially the stylesheet is not downloaded correctly.

Is there a solution that also downloads all needed CSS and images and keeps links between those questions locally intact? A perfect solution would be if on those downloaded questions were shown with everything intact, i.e. comments, etc.

share|improve this question

Something like this with httrack will do what you want.

$ httrack \
    'http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all' \
    -* +*question* -r2

This won't proceed past the first page of pagination of that page. You could likely modify it so that it could. It could also be looped through the pagination pages too.

The above downloads 2 levels (-r2) and ignores all pages that do not include the path *question*.

Commentary on this approach

With this type of download you'll likely have to run a more complex command a couple of times to determine that you've gotten everything that's required to keep the pages locally. Not to worry though, you can keep running httrack in the same directory and it will detect that its already download various pieces and either skip them, or update them where appropriate.

NOTE: This is a by-product of the approach were using, where we've explicitly excluded everything with the -*, and then selectively added things back in with a +.... You could always cast the net wider and tell httrack to download more, but then you'll be pulling in a lot more data too.

Iterating the download

For example, here I'm running it multiple times as I identify other files that I want to have it pull down.

run #1
$ httrack \
    'http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all' \
    -* +*question* +*sstatic.net* -r2 

There is an index.html and a hts-cache folder in the directory 
A site may have been mirrored here, that could mean that you want to update it
Be sure parameters are ok

Press <Y><Enter> to confirm, <N><Enter> to abort
Y
Mirror launched on Fri, 07 Nov 2014 14:01:35 by HTTrack Website Copier/3.48-19 [XR&CO'2014]
mirroring http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all -* +*question* +*sstatic.net* with the wizard help..
Done.: unix.stackexchange.com/questions/163334/connecting-to-irc-and-log-all-conversations (62646 bytes) - OK
Thanks for using HTTrack!
run #2
$ httrack \
    'http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all' \
    -* +*question* +*sstatic.net* +*googleapis* -r2 

There is an index.html and a hts-cache folder in the directory 
A site may have been mirrored here, that could mean that you want to update it
Be sure parameters are ok

Press <Y><Enter> to confirm, <N><Enter> to abort
Y
Mirror launched on Fri, 07 Nov 2014 14:03:05 by HTTrack Website Copier/3.48-19 [XR&CO'2014]
mirroring http://unix.stackexchange.com/users/20661/rubo77?tab=activity&sort=all -* +*question* +*sstatic.net* +*googleapis* with the wizard help..
Done.: unix.stackexchange.com/questions/163334/connecting-to-irc-and-log-all-conversations (62646 bytes) - OK
Thanks for using HTTrack!

In the above I identified that Stack Exchange makes use of GoogleAPI's so I needed to add that into the filter chain so that httrack knows to download files from that site as well.

I generally either use grep to look through the files to make sure I have everything, or I use my web browser's "view source" feature to see what URLs are still coming from other sites, vs. my local system.

NOTE: You can open the resulting downloads in Chrome, using the file:///path/to/httrack/download/index.html, and navigate the contents.

References

share|improve this answer
    
@rubo77 - try what I showed. It downloaded the first page just fine and I could navigate it locally. – slm Nov 7 '14 at 6:05
    
Will try when iam out of bed later ☺ – rubo77 Nov 7 '14 at 6:07
    
@rubo77 - ha, I'm going to bed now. I'm pretty confident that httrack can do the pagination as well, the URL you provided has your activity spread across multiple pages, i.e. the 1,2,3,...Last bit at the bottom. – slm Nov 7 '14 at 6:09
    
httrack works just great! Only one problem I couldn't solve: Mirroring stackexchange including external images – rubo77 Nov 7 '14 at 9:16
    
@rubo77 - please don't use exclamation points, simply pointing this out is sufficient. You need to add in additional pathing is all, since the approach I specified excludes everything (-*) and then adds paths back in (+...). So add a rule such as +*.css, etc. Also the approach I'm showing is a model. You'll have to play with it to get exactly the results you're after, but it definitely works, as I've used it for a variety of projects over the years to achieve similar results. – slm Nov 7 '14 at 18:49

you can use a software called black widow: its GUI and it will download it to hardrive

Black Widow

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.