mirror of
https://github.com/pirate/ArchiveBox.git
synced 2025-09-02 10:53:15 +02:00
Updated Scheduled Archiving (markdown)
@@ -1,17 +1,93 @@
|
||||
# Scheduled Archiving
|
||||
|
||||
## Using Cron
|
||||
ArchiveBox contains a built-in scheduler that supports pulling in URLs and files from the local filesystem containing URLs to archive.
|
||||
|
||||
To schedule regular archiving you can use any task scheduler like `cron`, `at`, `systemd`, etc.
|
||||
```bash
|
||||
archivebox schedule
|
||||
archivebox schedule --help
|
||||
|
||||
usage: archivebox schedule [-h] [--quiet] [--add] [--every EVERY] [--depth {0,1}] [--overwrite] [--clear] [--show] [--foreground] [--run-all] [import_path]
|
||||
|
||||
Set ArchiveBox to regularly import URLs at specific times using cron
|
||||
|
||||
positional arguments:
|
||||
import_path Check this path and import any new links on every run (can be either local file or remote URL)
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--quiet, -q Don't warn about storage space.
|
||||
--add Add a new scheduled ArchiveBox update job to cron
|
||||
--every EVERY Run ArchiveBox once every [timeperiod] (hour/day/month/year or cron format e.g. "0 0 * * *")
|
||||
--depth {0,1} Depth to archive to [0] or 1, see "add" command help for more info
|
||||
--overwrite Re-archive any URLs that have been previously archived, overwriting existing Snapshots
|
||||
--clear Stop all ArchiveBox scheduled runs (remove cron jobs)
|
||||
--show Print a list of currently active ArchiveBox cron jobs
|
||||
--foreground, -f Launch ArchiveBox scheduler as a long-running foreground task instead of using cron.
|
||||
--run-all Run all the scheduled jobs once immediately, independent of their configured schedules, can be used together with --foreground
|
||||
```
|
||||
|
||||
ArchiveBox ignores links that are imported multiple times (keeping the earliest version that it's seen).
|
||||
This means you can add cron jobs that regularly poll the same file or URL for new links, adding only new
|
||||
ones as necessary.
|
||||
ones as necessary, or you can pass `--overwrite` to save a fresh copy each time the scheduled task runs.
|
||||
|
||||
⚠️ Many popular sites such as Twitter, Reddit, Facebook, etc. take efforts to block/ratelimit/lazy-load content to avoid being scraped by bots like ArchiveBox. It may be better to use an alternative frontend with minimal JS when archiving those sites:
|
||||
https://github.com/mendel5/alternative-front-ends
|
||||
|
||||
### Example: Archive a Twitter user's profile once a week
|
||||
|
||||
```bash
|
||||
archivebox schedule --every=week --overwrite https://nitter.net/ArchiveBoxApp
|
||||
```
|
||||
|
||||
Nitter is an alternative frontends recommended Twitter that formats the content better for archiving/bots and avoids ratelimits.
|
||||
`--overwrite` is passed to save a fresh copy each week, otherwise the URL will be ignored as it's already present in the collection after the first time it's added.
|
||||
|
||||
### Example: Archive a Reddit subreddit and discussions for every post once a week
|
||||
|
||||
```bash
|
||||
# optionally limit URLs to Teddit (aka Reddit) to capture discussion and user pages but not outbound URLs
|
||||
archivebox config --set URL_WHITELIST='^http(s)?:\/\/(.+)?teddit\.net\/?.*$'
|
||||
|
||||
archivebox schedule --every=week --overwrite --depth=1 'https://teddit.net/r/DataHoarder/'
|
||||
```
|
||||
|
||||
Teddit is an alternative frontend recommended for Reddit that formats the content better for archiving/bots and avoids ratelimits.
|
||||
|
||||
### Example: Archive the HackerNews front page and all linked articles every 24 hours
|
||||
|
||||
```bash
|
||||
# optional exclude some URLs you don't want to archive
|
||||
archivebox config --set URL_BLACKLIST='^http(s)?:\/\/(.+\.)?(youtube\.com)|(amazon\.com)\/.*$'
|
||||
|
||||
archivebox schedule --every=day --depth=1 'https://news.ycombinator.com'
|
||||
```
|
||||
|
||||
### Example: Archive all URLs in an RSS feed from Pocket every 12 hours
|
||||
|
||||
This example imports your Pocket bookmark feed and archives any new links every 12 hours:
|
||||
|
||||
First, set your Pocket RSS feed to "public" under https://getpocket.com/privacy_controls.
|
||||
|
||||
Then tell ArchiveBox to pull it regularly:
|
||||
```bash
|
||||
archivebox schedule --every=day --depth=1 https://getpocket.com/users/yourusernamegoeshere/feed/all
|
||||
```
|
||||
|
||||
### Example: Archive a Github repository's source code only once a month
|
||||
|
||||
```bash
|
||||
archivebox schedule --every=month --extract=git --overwrite 'https://github.com/ArchiveBox'
|
||||
```
|
||||
`--extract=git` tells it to only use the Git source extractor and skip saving the HTML/screenshot/etc. other extractor methods.
|
||||
|
||||
---
|
||||
|
||||
## Manual Scheduling Using Cron
|
||||
|
||||
To schedule regular archiving you can use any task scheduler like `cron`, `at`, `systemd`, etc. or the built-in scheduler `archivebox schedule` (which uses crontab internally).
|
||||
|
||||
For some example configs, see the [`etc/cron.d`](https://github.com/ArchiveBox/ArchiveBox/blob/master/etc/cron.d) and [`etc/supervisord`](https://github.com/ArchiveBox/ArchiveBox/blob/master/etc/supervisord) folders.
|
||||
|
||||
## Examples
|
||||
|
||||
### Example: Import Firefox browser history every 24 hours
|
||||
|
||||
This example exports your browser history and archives it once a day:
|
||||
@@ -32,19 +108,16 @@ archivebox add < ./output/sources/firefox_history.json >> /var/log/ArchiveBox.l
|
||||
|
||||
### Example: Import an RSS feed from Pocket every 12 hours
|
||||
|
||||
This example imports your Pocket bookmark feed and archives any new links every 12 hours:
|
||||
|
||||
First, set your Pocket RSS feed to "public" under https://getpocket.com/privacy_controls.
|
||||
|
||||
**Create `/opt/ArchiveBox/bin/pocket_custom.sh`:**
|
||||
If you need to customize the import process or archive a password-locked RSS feed, you can do it manually with a bash script + cron `/home/ArchiveBox/archivebox/bin/scheduled_imports.sh`:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
|
||||
cd /opt/ArchiveBox
|
||||
curl https://getpocket.com/users/yourusernamegoeshere/feed/all | archivebox add >> /var/log/ArchiveBox.log
|
||||
cd /home/ArchiveBox/archivebox
|
||||
curl --silent https://getpocket.com/users/yourusernamegoeshere/feed/all | archivebox add >> /home/ArchiveBox/archivebox/logs/scheduled_imports.log
|
||||
# you can add additional flags to curl here e.g. to authenticate with HTTP
|
||||
# curl --silent -u username:password ... | archivebox add >> ...
|
||||
```
|
||||
|
||||
**Then create a new file `/etc/cron.d/ArchiveBox-Pocket` to tell cron to run your script every 12 hours:**
|
||||
Then create a cronjob telling your system to run the script on your chosen regular interval (e.g. every 12 hours):
|
||||
```bash
|
||||
0 12 * * * www-data /opt/ArchiveBox/bin/pocket_custom.sh
|
||||
echo '0 12 * * * archivebox /home/ArchiveBox/archivebox/bin/scheduled_imports.sh' > /etc/cron.d/archivebox_scheduled_imports
|
||||
```
|
||||
|
Reference in New Issue
Block a user