From dc9e128927f64956a20446dcf85e424605462b0e Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 12 Apr 2022 20:01:04 -0400 Subject: [PATCH] Updated Scheduled Archiving (markdown) --- Scheduled-Archiving.md | 103 +++++++++++++++++++++++++++++++++++------ 1 file changed, 88 insertions(+), 15 deletions(-) diff --git a/Scheduled-Archiving.md b/Scheduled-Archiving.md index 3ba4e61..b58b015 100644 --- a/Scheduled-Archiving.md +++ b/Scheduled-Archiving.md @@ -1,17 +1,93 @@ # Scheduled Archiving -## Using Cron +ArchiveBox contains a built-in scheduler that supports pulling in URLs and files from the local filesystem containing URLs to archive. -To schedule regular archiving you can use any task scheduler like `cron`, `at`, `systemd`, etc. +```bash +archivebox schedule +archivebox schedule --help + +usage: archivebox schedule [-h] [--quiet] [--add] [--every EVERY] [--depth {0,1}] [--overwrite] [--clear] [--show] [--foreground] [--run-all] [import_path] + +Set ArchiveBox to regularly import URLs at specific times using cron + +positional arguments: + import_path Check this path and import any new links on every run (can be either local file or remote URL) + +optional arguments: + -h, --help show this help message and exit + --quiet, -q Don't warn about storage space. + --add Add a new scheduled ArchiveBox update job to cron + --every EVERY Run ArchiveBox once every [timeperiod] (hour/day/month/year or cron format e.g. "0 0 * * *") + --depth {0,1} Depth to archive to [0] or 1, see "add" command help for more info + --overwrite Re-archive any URLs that have been previously archived, overwriting existing Snapshots + --clear Stop all ArchiveBox scheduled runs (remove cron jobs) + --show Print a list of currently active ArchiveBox cron jobs + --foreground, -f Launch ArchiveBox scheduler as a long-running foreground task instead of using cron. + --run-all Run all the scheduled jobs once immediately, independent of their configured schedules, can be used together with --foreground +``` ArchiveBox ignores links that are imported multiple times (keeping the earliest version that it's seen). This means you can add cron jobs that regularly poll the same file or URL for new links, adding only new -ones as necessary. +ones as necessary, or you can pass `--overwrite` to save a fresh copy each time the scheduled task runs. + +⚠️ Many popular sites such as Twitter, Reddit, Facebook, etc. take efforts to block/ratelimit/lazy-load content to avoid being scraped by bots like ArchiveBox. It may be better to use an alternative frontend with minimal JS when archiving those sites: +https://github.com/mendel5/alternative-front-ends + +### Example: Archive a Twitter user's profile once a week + +```bash +archivebox schedule --every=week --overwrite https://nitter.net/ArchiveBoxApp +``` + +Nitter is an alternative frontends recommended Twitter that formats the content better for archiving/bots and avoids ratelimits. +`--overwrite` is passed to save a fresh copy each week, otherwise the URL will be ignored as it's already present in the collection after the first time it's added. + +### Example: Archive a Reddit subreddit and discussions for every post once a week + +```bash +# optionally limit URLs to Teddit (aka Reddit) to capture discussion and user pages but not outbound URLs +archivebox config --set URL_WHITELIST='^http(s)?:\/\/(.+)?teddit\.net\/?.*$' + +archivebox schedule --every=week --overwrite --depth=1 'https://teddit.net/r/DataHoarder/' +``` + +Teddit is an alternative frontend recommended for Reddit that formats the content better for archiving/bots and avoids ratelimits. + +### Example: Archive the HackerNews front page and all linked articles every 24 hours + +```bash +# optional exclude some URLs you don't want to archive +archivebox config --set URL_BLACKLIST='^http(s)?:\/\/(.+\.)?(youtube\.com)|(amazon\.com)\/.*$' + +archivebox schedule --every=day --depth=1 'https://news.ycombinator.com' +``` + +### Example: Archive all URLs in an RSS feed from Pocket every 12 hours + +This example imports your Pocket bookmark feed and archives any new links every 12 hours: + +First, set your Pocket RSS feed to "public" under https://getpocket.com/privacy_controls. + +Then tell ArchiveBox to pull it regularly: +```bash +archivebox schedule --every=day --depth=1 https://getpocket.com/users/yourusernamegoeshere/feed/all +``` + +### Example: Archive a Github repository's source code only once a month + +```bash +archivebox schedule --every=month --extract=git --overwrite 'https://github.com/ArchiveBox' +``` +`--extract=git` tells it to only use the Git source extractor and skip saving the HTML/screenshot/etc. other extractor methods. + +--- + +## Manual Scheduling Using Cron + +To schedule regular archiving you can use any task scheduler like `cron`, `at`, `systemd`, etc. or the built-in scheduler `archivebox schedule` (which uses crontab internally). For some example configs, see the [`etc/cron.d`](https://github.com/ArchiveBox/ArchiveBox/blob/master/etc/cron.d) and [`etc/supervisord`](https://github.com/ArchiveBox/ArchiveBox/blob/master/etc/supervisord) folders. -## Examples - ### Example: Import Firefox browser history every 24 hours This example exports your browser history and archives it once a day: @@ -32,19 +108,16 @@ archivebox add < ./output/sources/firefox_history.json >> /var/log/ArchiveBox.l ### Example: Import an RSS feed from Pocket every 12 hours -This example imports your Pocket bookmark feed and archives any new links every 12 hours: - -First, set your Pocket RSS feed to "public" under https://getpocket.com/privacy_controls. - -**Create `/opt/ArchiveBox/bin/pocket_custom.sh`:** +If you need to customize the import process or archive a password-locked RSS feed, you can do it manually with a bash script + cron `/home/ArchiveBox/archivebox/bin/scheduled_imports.sh`: ```bash #!/bin/bash -cd /opt/ArchiveBox -curl https://getpocket.com/users/yourusernamegoeshere/feed/all | archivebox add >> /var/log/ArchiveBox.log +cd /home/ArchiveBox/archivebox +curl --silent https://getpocket.com/users/yourusernamegoeshere/feed/all | archivebox add >> /home/ArchiveBox/archivebox/logs/scheduled_imports.log +# you can add additional flags to curl here e.g. to authenticate with HTTP +# curl --silent -u username:password ... | archivebox add >> ... ``` - -**Then create a new file `/etc/cron.d/ArchiveBox-Pocket` to tell cron to run your script every 12 hours:** +Then create a cronjob telling your system to run the script on your chosen regular interval (e.g. every 12 hours): ```bash -0 12 * * * www-data /opt/ArchiveBox/bin/pocket_custom.sh +echo '0 12 * * * archivebox /home/ArchiveBox/archivebox/bin/scheduled_imports.sh' > /etc/cron.d/archivebox_scheduled_imports ```