Currently, all of that is broken. At one point, I had a traumatic experience where an archived HTML file kept redirecting to the live site, even though I already had all the content rendered, so I ended up disabling all JavaScript entirely.
I have a project for creating and archiving RSS feeds, keeping the full history from the time the crawler starts. I need to clean up a bit, then will open source it soon.
Docker is designed to be undetectable by default, the best way I have found is to set env IN_DOCKER=True manually in your Dockerfile + check that there is no $DISPLAY configured + that you're on linux. Usually if all/most of those are true you can safely add --no-sandbox --disable-setuid-sandbox --disable-dev-shm-usage etc. all the docker-specific flags. Thats what we do in https://github.com/ArchiveBox/ArchiveBox/blob/dev/Dockerfile...
But, a compromise still lands on host's kernel, Docker doesn't provide kernel isolation (well it does on a macOS because it runs in Docker machine but thats a side effect).
I wonder if a better solution would be to play with seccomp or Linux capabilities so that Chrome is sandboxed even in Docker. Not sure how this would work tbh.
Answering here to get ideas, I saw your fix on Git and request for feedback (will try to review and give it some thought once I find some time)
Making docs available offline was one of my main motivations for building this tool. I will try Apple Docs too.
I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.
By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.
For video downloading, I suggest wrapping around yt-dlp. It's an awesome tool.
reply