Will anyone find me a decent web archival program? [M$20,000 bounty]
8
1kṀ1553
2026
58%
chance

I want something like Pappet: very simple, doesn't waste my time reading documentation and debugging, just lets me save a website in its entirely. Pappet does not work for me because it cannot handle websites that are behind a login page, and any links in the .mhtlm files it creates point to the live site rather than other archived files. I also haven't tested its robustness, I imagine it might fail on JS-rendered content, miss "links" that aren't <a> elements, etc.

I only need a single snapshot, not a constantly-updated archive.

The website I want to archive right now is https://bugs-legacy.mojang.com/, so while I'd ideally like the tool to be more broadly applicable, I'll resolve the market based on whether the tool can archive that particular website to my satisfaction.

Will give you 20k mana if you point me to a suitable tool. Resolves NO if no one does before close.

Get
Ṁ1,000
to start trading!
Sort by:

Try HTTrack perhaps: https://www.httrack.com/

Instructions here for archiving a site that requires login: https://superuser.com/questions/157331/mirroring-a-web-site-behind-a-login-form#1274008

@A Seems to require me to browse the site manually?

@IsaacKing No, that's after you download it, they are explaining one way to view the downloaded site. But the download itself is automated -- you specify a page to start from and some settings about how deep to follow links recursively and it proceeds automatically from that point.

@A Seems broken, the resulting files display raw HTML or a blank page that says "click here". Did it work for you?

@IsaacKing I used it ~10 years ago and it worked then! I have not tried it recently though.

I've been using Archivebox running on a RaspberryPi. It's the opposite though of your non-functional requirements

@AlexanderTheGreater Hmm, seems complicated. I'll count it if you can lay out a simple series of terminal commands that get me the end result I want.

I think you can use the singlefile CLI for this. See the last two examples in the linked readme.

@GarrettBaker eg single-file https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"

@GarrettBaker Can this handle login info?

@IsaacKing My understanding is that it acts through your already downloaded chrome or chromium, and that it should work if you get automatically logged into the relevant websites through the cookies saved there

@GarrettBaker Same linking issue as Pappet; all internal links are becoming external links in the archive.

It also seems to not be loading Javascript properly, all javascript functionality breaks in the archived files.

Confirmed that it can successfully get past a login wall, so that's an improvement over Pappet.

Ah, looks like it has a --crawl-replace-URLs flag to fix the first problem.

Hmm, started getting "Load timeout" errors on a bunch of URLs when I increased the crawl depth.

Ok it appears the --crawl-replace-URLs flag doesn't actually do anything, links are still broken.

So this doesn't qualify unless someone can provide a fix for that issue, and it can successfully archive a large site like Mojira without leaving out pages due to errors.

i want to know one too

© Manifold Markets, Inc.TermsPrivacy