I want something like Pappet: very simple, doesn't waste my time reading documentation and debugging, just lets me save a website in its entirely. Pappet does not work for me because it cannot handle websites that are behind a login page, and any links in the .mhtlm files it creates point to the live site rather than other archived files. I also haven't tested its robustness, I imagine it might fail on JS-rendered content, miss "links" that aren't <a> elements, etc.
I only need a single snapshot, not a constantly-updated archive.
The website I want to archive right now is https://bugs-legacy.mojang.com/, so while I'd ideally like the tool to be more broadly applicable, I'll resolve the market based on whether the tool can archive that particular website to my satisfaction.
Will give you 20k mana if you point me to a suitable tool. Resolves NO if no one does before close.
[Snigus: VERY SORRY FOR SPAM. HAVE PUT UP LIMITS INCL ONLY ONE TOP-LEVEL COMMENT PER POST, AND 1COMMENT / HOUR RATE LIMIT]
I see you already tried Browsertrix Crawler and hit two issues: broken internal links and load timeouts. Here are fixes for both:
Broken links fix: The --crawl-replace-URLs flag is known to be unreliable. Instead, use the WARC output with ReplayWeb.page (https://replayweb.page). It runs entirely in-browser and replays WARC files with proper URL rewriting â all internal links work correctly because ReplayWeb intercepts navigation and serves from the WARC archive. No post-processing needed. This is the intended workflow: Browsertrix captures, ReplayWeb replays.
Load timeout fix: Three flags to tune:
--pageLoadTimeout 120(increase from default 90s)--pageExtraDelay 5(wait for lazy-loaded content)--maxPageLimit 0(no page limit, let it complete)Also try
--workers 1to reduce concurrent load on the server
For bugs-legacy.mojang.com specifically, the timeouts were likely from Jira pages being JS-heavy. Single worker + higher timeout should resolve this.
Alternative if WARC workflow is too heavy: Monolith (https://github.com/nicholasgasior/monolith) saves individual pages as fully self-contained HTML with all assets embedded inline. Pair with wget --spider -r to get a URL list, then monolith each page. Links between pages need manual fixing but each page is a perfect snapshot.
f
s
a
.
I see the SingleFile issues you are hitting (broken internal links, JS failures, timeouts). Browsertrix Crawler should solve all three because it uses a full Chromium instance and produces WACZ archives where links are preserved internally.
Exact commands for bugs-legacy.mojang.com:
# Install (one-time)
docker pull webrecord/browsertrix-crawler
# Run the crawl (no login needed for this site)
docker run -v $PWD/crawls:/crawls/ -it webrecord/browsertrix-crawler crawl --url https://bugs-legacy.mojang.com/ --scopeType host --generateWACZ --behaviors autoscroll,autoplay --pageLimit 500 --timeout 120
# View the result (open in browser)
# Go to https://replayweb.page and drag the .wacz file from ./crawls/ onto it
If you need login: create a browser profile first with --profile, log in manually, then pass --profile to the crawl command. All links stay internal inside the WACZ replay. If you want static HTML files instead of WACZ, use the --generateCDX flag and extract with wacz2warc + warc2html tools, though WACZ with ReplayWeb.page is the simplest path. The pageLimit flag prevents runaway crawls on huge sites â set it to 5000 or higher if you want the full archive. Increase --timeout for slow pages.
Try HTTrack perhaps: https://www.httrack.com/
Instructions here for archiving a site that requires login: https://superuser.com/questions/157331/mirroring-a-web-site-behind-a-login-form#1274008
@IsaacKing No, that's after you download it, they are explaining one way to view the downloaded site. But the download itself is automated -- you specify a page to start from and some settings about how deep to follow links recursively and it proceeds automatically from that point.
@A Seems broken, the resulting files display raw HTML or a blank page that says "click here". Did it work for you?
@AlexanderTheGreater Hmm, seems complicated. I'll count it if you can lay out a simple series of terminal commands that get me the end result I want.
@GarrettBaker eg single-file https://www.wikipedia.org --crawl-links=true --crawl-inner-links-only=true --crawl-max-depth=1 --crawl-rewrite-rule="^(.*)\\?.*$ $1"
@IsaacKing My understanding is that it acts through your already downloaded chrome or chromium, and that it should work if you get automatically logged into the relevant websites through the cookies saved there
@GarrettBaker Same linking issue as Pappet; all internal links are becoming external links in the archive.
It also seems to not be loading Javascript properly, all javascript functionality breaks in the archived files.
