1
0
mirror of https://github.com/pirate/ArchiveBox.git synced 2025-09-03 03:13:12 +02:00

Updated Scheduled Archiving (markdown)

Nick Sweeting
2022-04-12 20:06:52 -04:00
parent dc9e128927
commit e5534d57a1

@@ -30,9 +30,39 @@ ArchiveBox ignores links that are imported multiple times (keeping the earliest
This means you can add cron jobs that regularly poll the same file or URL for new links, adding only new
ones as necessary, or you can pass `--overwrite` to save a fresh copy each time the scheduled task runs.
The list of defined scheduled tasks can be inspected and cleared with `archivebox schedule --show` and `archivebox schedule --clear`.
⚠️ Many popular sites such as Twitter, Reddit, Facebook, etc. take efforts to block/ratelimit/lazy-load content to avoid being scraped by bots like ArchiveBox. It may be better to use an alternative frontend with minimal JS when archiving those sites:
https://github.com/mendel5/alternative-front-ends
The scheduler can be run in `--foreground` mode to avoid relying on your host system's cron scheduler.
In foreground mode, it will run all tasks previously added using `archivebox schedule` in a long-running foreground process.
This is useful for running scheduled tasks inside docker-compose or supervisord.
### Docker Usage
```bash
docker-compose run --rm archivebox schedule --every=day https://example.com
docker-compose run --rm archivebox schedule --foreground
# or
docker run -v $PWD:/data -it archivebox/archivebox schedule --every=day 'https://example.com'
docker run -v $PWD:/data -it archivebox/archivebox schedule --foreground
```
`docker-compose.yml`:
```yaml
services:
archivebox:
...
archivebox_scheduler:
command: schedule --foreground
...
```
For a full Docker Compose example config see here: https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#L64
---
### Example: Archive a Twitter user's profile once a week
```bash
@@ -80,6 +110,11 @@ archivebox schedule --every=month --extract=git --overwrite 'https://github.com/
```
`--extract=git` tells it to only use the Git source extractor and skip saving the HTML/screenshot/etc. other extractor methods.
### Example: Archive a list of URLs from the filesystem every 30 minutes
```bash
archivebox schedule --
---
## Manual Scheduling Using Cron