Download full web page and save without a deep directory structure? Also, bypass paywall?
0
votes
1
answer
166
views
So, I want to be able to download a web page in a way similar to what https://archive.is does.
Using
wget -p -E -k
usually produces a decent result - but this result is somewhat hard to handle. For example, after wget -p -E -k https://news.sky.com/story/yazidi-woman-kidnapped-by-islamic-state-freed-from-gaza-after-decade-in-captivity-13227540
I got a directory names news.sky.com
and the page was available as news.sky.com/story/yazidi-woman-kidnapped-by-islamic-state-freed-from-gaza-after-decade-in-captivity-13227540.html
while other necessary files for the page were scattered around in this same news.sky.com
directory.
I'd prefer to have something similar to how browsers can "save a page" - the page file in the current directory plus a "something_files" subdirectory where the necessities are. I understand I can kinda do that by moving the site directory structure into that files subdirectory and creating a redirect page next to it, but I'd prefer to do it properly if possible.
There are also cases pf paywalls that archive.is successfully bypasses but wget -p -E -k
does not. For example, with https://www.nytimes.com/2014/10/28/magazine/theo-padnos-american-journalist-on-being-kidnapped-tortured-and-released-in-syria.html
, archive.is produced a perfect paywall-less copy, while wget -p -E -k
produced the start of the article hanging on "verifying access". I'd like to be doing what archive.is does.
Advice on how to change these things would be much appreciated.
Asked by Mikhail Ramendik
(538 rep)
Oct 14, 2024, 02:02 PM
Last activity: Nov 8, 2024, 04:59 PM
Last activity: Nov 8, 2024, 04:59 PM