From e5534d57a1564eb43928ab55c53c0f16130a764d Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 12 Apr 2022 20:06:52 -0400 Subject: [PATCH] Updated Scheduled Archiving (markdown) --- Scheduled-Archiving.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) diff --git a/Scheduled-Archiving.md b/Scheduled-Archiving.md index b58b015..2ae265b 100644 --- a/Scheduled-Archiving.md +++ b/Scheduled-Archiving.md @@ -30,9 +30,39 @@ ArchiveBox ignores links that are imported multiple times (keeping the earliest This means you can add cron jobs that regularly poll the same file or URL for new links, adding only new ones as necessary, or you can pass `--overwrite` to save a fresh copy each time the scheduled task runs. +The list of defined scheduled tasks can be inspected and cleared with `archivebox schedule --show` and `archivebox schedule --clear`. + ⚠️ Many popular sites such as Twitter, Reddit, Facebook, etc. take efforts to block/ratelimit/lazy-load content to avoid being scraped by bots like ArchiveBox. It may be better to use an alternative frontend with minimal JS when archiving those sites: https://github.com/mendel5/alternative-front-ends +The scheduler can be run in `--foreground` mode to avoid relying on your host system's cron scheduler. +In foreground mode, it will run all tasks previously added using `archivebox schedule` in a long-running foreground process. +This is useful for running scheduled tasks inside docker-compose or supervisord. + +### Docker Usage + +```bash +docker-compose run --rm archivebox schedule --every=day https://example.com +docker-compose run --rm archivebox schedule --foreground +# or +docker run -v $PWD:/data -it archivebox/archivebox schedule --every=day 'https://example.com' +docker run -v $PWD:/data -it archivebox/archivebox schedule --foreground +``` + +`docker-compose.yml`: +```yaml +services: + archivebox: + ... + + archivebox_scheduler: + command: schedule --foreground + ... +``` +For a full Docker Compose example config see here: https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#L64 + +--- + ### Example: Archive a Twitter user's profile once a week ```bash @@ -80,6 +110,11 @@ archivebox schedule --every=month --extract=git --overwrite 'https://github.com/ ``` `--extract=git` tells it to only use the Git source extractor and skip saving the HTML/screenshot/etc. other extractor methods. +### Example: Archive a list of URLs from the filesystem every 30 minutes + +```bash +archivebox schedule -- + --- ## Manual Scheduling Using Cron