1
0
mirror of https://github.com/pirate/ArchiveBox.git synced 2025-08-21 05:41:54 +02:00

Updated Roadmap (markdown)

Nick Sweeting
2020-08-15 01:01:59 -04:00
parent bb0eab5e8c
commit 6856a467f2

@@ -8,7 +8,7 @@
---
## Planned Specification
## Past Releases
To see how this spec has been scheduled / implemented / released so far, read these pull requests:
- ✅ [v0.2.x](https://github.com/pirate/ArchiveBox/tree/483a3bef9e2b1a7b80611947a3be99b0cf4f9959)
@@ -16,654 +16,59 @@ To see how this spec has been scheduled / implemented / released so far, read th
- ✅ [v0.4.x](https://github.com/pirate/ArchiveBox/pull/207)
- 🛠 [v0.5.x](https://github.com/pirate/ArchiveBox/pull/275)
**API:**
- [`pip install archivebox`](#-pip-install-archivebox)
- [`archivebox version`](#-archivebox-version--version)
- [`archivebox help`](#-archivebox-help-h--help)
- [`archivebox init`](#-archivebox-init)
- [`archivebox status`](#-archivebox-status)
- [`archivebox add`](#-archivebox-add)
- [`archivebox remove`](#-archivebox-remove)
- [`archivebox schedule`](#-archivebox-schedule)
- [`archivebox config`](#-archivebox-config)
- [`archivebox update`](#-archivebox-update)
- [`archivebox list`](#-archivebox-list)
- [`archivebox oneshot`](#-archivebox-oneshot)
- [`archivebox server`](#-archivebox-server)
- [`archivebox proxy`](#-archivebox-proxy)
- [`archivebox shell`](#-archivebox-shell)
- [`archivebox manage`](#-archivebox-manage)
- [`from archivebox import ...`](#api-for-normal-archivebox-usage)
- [`from archivebox.component import ...`](#api-for-all-useful-subcomponents)
**Design:**
- [Overview](#design)
- [Dependencies](#dependencies)
- [Dependencies](#dependencies)
- [Code Layout](#code-folder-layout)
- [Data Layout](#collection-data-folder-layout)
- [Export Layout](#exported-folder-layout)
## CLI Usage
*Note, these ways to run ArchiveBox are equivalent:*
- `archivebox [subcommand] [...args]`
- `python3 -m archivebox [subcommand] [...args]`
- `docker run -v $PWD:/data nikisweeting/archivebox [subcommand] [...args]`
- `docker-compose run archivebox [subcommand] [...args]`
### `$ pip install archivebox`
```bash
...
Installing collected packages: archivebox
Running setup.py install for archivebox ... done
Successfully installed archivebox-0.4.9
```
Developers who are working on the ArchiveBox codebase should install the project in "linked" mode for development using: `pipenv install; pip install -e .`.
### `$ archivebox [version|--version]`
```bash
ArchiveBox v0.4.9
[i] Dependency versions:
√ PYTHON_BINARY /optArchiveBox/.venv/bin/python3.7 v3.7 valid
√ DJANGO_BINARY /optArchiveBox/.venv/lib/python3.7/site-packages/django/bin/django-admin.py v2.2.0 valid
√ CURL_BINARY /usr/bin/curl v7.54.0 valid
√ WGET_BINARY /usr/local/bin/wget v1.20.1 valid
√ GIT_BINARY /usr/local/bin/git v2.20.1 valid
√ YOUTUBEDL_BINARY /optArchiveBox/.venv/bin/youtube-dl v2019.04.17 valid
√ CHROME_BINARY /Applications/Google Chrome.app/Contents/MacOS/Google Chrome v74.0.3729.91 valid
[i] Folder locations:
√ REPO_DIR /optArchiveBox 28 files valid
√ PYTHON_DIR /optArchiveBox/archivebox 14 files valid
√ LEGACY_DIR /optArchiveBox/archivebox/legacy 15 files valid
√ TEMPLATES_DIR /optArchiveBox/archivebox/legacy/templates 7 files valid
√ OUTPUT_DIR /optArchiveBox/archivebox/data 10 files valid
√ SOURCES_DIR /optArchiveBox/archivebox/data/sources 1 files valid
√ LOGS_DIR /optArchiveBox/archivebox/data/logs 0 files valid
√ ARCHIVE_DIR /optArchiveBox/archivebox/data/archive 2 files valid
√ CHROME_USER_DATA_DIR /Users/squash/Library/Application Support/Chromium 2 files valid
- COOKIES_FILE - disabled - disabled
```
### `$ archivebox [help|-h|--help]`
```bash
ArchiveBox: The self-hosted internet archive.
Documentation:
https://github.com/pirate/ArchiveBox/wiki
UI Usage:
Open output/index.html to view your archive.
CLI Usage:
mkdir data; cd data/
archivebox init
echo 'https://example.com/some/page' | archivebox add
archivebox add https://example.com/some/other/page
archivebox add --depth=1 ~/Downloads/bookmarks_export.html
archivebox add --depth=1 https://example.com/feed.rss
archivebox update --resume=15109948213.123
```
### `$ archivebox init`
Initialize a new "collection" folder, aka a complete archive containing an ArchiveBox.conf config file, an index of all the archived pages, and the archived content for each page.
```bash
$ mkdir ~/my-archive && ~/my-archive
$ archivebox init
[+] Initializing a new ArchiveBox collection in this folder...
~/my-archive
------------------------------------------------------------------
[+] Building archive folder structure...
√ ~/my-archive/sources
√ ~/my-archive/archive
√ ~/my-archive/logs
[+] Building main SQL index and running migrations...
√ ~/my-archive/index.sqlite3
Operations to perform:
Apply all migrations: admin, auth, contenttypes, core, sessions
Running migrations:
Applying contenttypes.0001_initial... OK
Applying auth.0001_initial... OK
Applying admin.0001_initial... OK
...
[*] Collecting links from any existing index or archive folders...
√ Loaded 30 links from existing main index...
! Skipped adding 2 orphaned link data directories that would have overwritten existing data.
! Skipped adding 2 corrupted/unrecognized link data directories that could not be read.
For more information about the link data directories that were skipped, run:
archivebox status
archivebox list --status=invalid
archivebox list --status=orphaned
archivebox list --status=duplicate
[*] [2019-04-24 15:41:11] Writing 30 links to main index...
√ ~/my-archive/index.sqlite3
√ ~/my-archive/index.json
√ ~/my-archive/index.html
------------------------------------------------------------------
[] Done. A new ArchiveBox collection was initialized (30 links).
To view your archive index, open:
~/my-archive/index.html
To add new links, you can run:
archivebox add 'https://example.com'
For more usage and examples, run:
archivebox help
```
### `$ archivebox status`
Print out some info and statistics about the archive collection.
```bash
$ archivebox status
[*] Scanning archive collection main index...
/Users/squash/Documents/Code/ArchiveBox/data/*
Size: 209.3 KB across 3 files
> JSON Main Index: 30 links (found in index.json)
> SQL Main Index: 30 links (found in index.sqlite3)
> HTML Main Index: 30 links (found in index.html)
> JSON Link Details: 1 links (found in archive/*/index.json)
> Admin: 0 users (found in index.sqlite3)
Hint: You can create an admin user by running:
archivebox manage createsuperuser
[*] Scanning archive collection link data directories...
/Users/squash/Documents/Code/ArchiveBox/data/archive/*
Size: 1.6 MB across 46 files in 50 directories
> indexed: 30 (indexed links without checking archive status or data directory validity)
> archived: 1 (indexed links that are archived with a valid data directory)
> unarchived: 29 (indexed links that are unarchived with no data directory or an empty data directory)
> present: 1 (dirs that are expected to exist based on the main index)
> valid: 1 (dirs with a valid index matched to the main index and archived content)
> invalid: 0 (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized)
> duplicate: 0 (dirs that conflict with other directories that have the same link URL or timestamp)
> orphaned: 0 (dirs that contain a valid index but aren't listed in the main index)
> corrupted: 0 (dirs that don't contain a valid index and aren't listed in the main index)
> unrecognized: 0 (dirs that don't contain recognizable archive data and aren't listed in the main index)
Hint: You can list link data directories by status like so:
archivebox list --status=<status> (e.g. indexed, corrupted, archived, etc.)
```
### `$ archivebox add`
#### `--only-new`
Controls whether to only add new links or also retry previously failed/skipped links.
#### `--index-only`
Pass this to only add the links to the main index without archiving them.
#### `--mirror`
Archive an entire site (finding all linked pages below it on the same domain)
#### `--depth`
Controls how far to follow links from the given url. `0` sets it to only archive the page, and not follow any outlinks. `1` sets it to archive the page and follow one link outwards and archive those pages. `2` sets it to follow a maximum of two hops outwards, and so on...
#### `--crawler=[type]`
Controls which crawler to use in order to find outlinks in a given page.
#### `url`
Is the page you want to archive
#### `< stdin`
URLs to be added can also be piped in via stdin instead of passed as an argument
```bash
$ archivebox add --depth=1 https://example.com
[+] [2019-03-30 18:36:41] Adding 1 new url and all pages 1 hop out: https://example.com
[*] [2019-03-30 18:36:42] Saving main index files...
√ ./index.json
√ ./index.html
[] [2019-03-30 18:36:42] Updating archive content...
[+] [2019-03-30 18:36:42] "Using Geolocation Data to Understand Consumer Behavior During Severe Weather Events"
https://orbitalinsight.com/using-geolocation-data-understand-consumer-behavior-severe-weather-events
> ./archive/1553789823
> wget
> warc
> media
> screenshot
[] [2019-03-30 18:39:00] Update of 37 pages complete (2.08 sec)
- 35 links skipped
- 0 links updated
- 2 links had errors
[*] [2019-03-30 18:39:00] Saving main index files...
√ ./index.json
√ ./index.html
To view your archive, open:
/Users/example/ArchiveBox/index.html
```
### `$ archivebox schedule`
Use `python-crontab` to add, remove, and edit regularly scheduled archive update jobs.
#### `--run-all`
Run all the scheduled jobs once immediately, independent of their configured schedules
#### `--foreground`
Launch ArchiveBox as a long-running foreground task instead of using cron.
#### `--show`
Print a list of currently active ArchiveBox cron jobs
#### `--clear`
Stop all ArchiveBox scheduled runs, clear it completely from cron
#### `--add`
Add a new scheduled ArchiveBox update job to cron
#### `--quiet`
Don't warn about many jobs potentially using up storage space.
#### `--every=[schedule]`
The schedule to run the command can be either:
- `minute`/`hour`/`day`/`week`/`month`/`year`
- or a cron-formatted schedule like `"0/2 * * * *"`/`"* 0/10 * * * *"`/...
#### `import_path`
Specify the path as the path to a local file or remote URL to check for new links.
```bash
$ archivebox schedule --show
@hourly cd /optArchiveBox/data && /opt/ArchiveBox/.venv/bin/archivebox add "https://getpocket.com/users/nikisweeting/feed/all" 2>&1 > /opt/ArchiveBox/data/logs/archivebox.log # archivebox_schedule
```
```bash
$ archivebox schedule --add --every=hour https://getpocket.com/users/nikisweeting/feed/all
[] Scheduled new ArchiveBox cron job for user: squash (1 jobs are active).
> @hourly cd /Users/squash/Documents/Code/ArchiveBox/data && /Users/squash/Documents/Code/ArchiveBox/.venv/bin/archivebox add "https://getpocket.com/users/nikisweeting/feed/all" 2>&1 > /Users/squash/Documents/Code/ArchiveBox/data/logs/archivebox.log # archivebox_schedule
[!] With the current cron config, ArchiveBox is estimated to run >365 times per year.
Congrats on being an enthusiastic internet archiver! 👌
Make sure you have enough storage space available to hold all the data.
Using a compressed/deduped filesystem like ZFS is recommended if you plan on archiving a lot.
```
### `$ archivebox config`
#### `(no args)`
Print the entire config to stdout.
#### `--get KEY`
Get the given config key:value and print it to stdout.
#### `--set KEY=VALUE`
Set the given config key:value in the current collection's config file.
#### `< stdin`
```bash
$ archviebox config
OUTPUT_DIR="output"
OUTPUT_PERMISSIONS=755
ONLY_NEW=False
...
```
```bash
$ archviebox config --get CHROME_VERSION
Google Chrome 74.0.3729.40 beta
```
```bash
$ archviebox config --set USE_CHROME=False
USE_CHROME=False
```
### `$ archivebox update`
**Check all subscribed feeds for new links, archive them and retry any previously failed pages.**
#### `(no args)`
Update the index and go through each page, retrying any that failed previously.
#### `--only-new`
By default it always retries previously failed/skipped pages, pass this flag to only archive newly added links without going through the whole archive and attempting to fix previously failed links.
#### `--resume=[timestamp]`
Resume the update process from a specific URL timestamp.
#### `--snapshot`
[TODO] by default ArchiveBox never re-archives pages after the first successful archive, if you want to take a new snapshot of every page even if there's an existing version, pass this option.
### `$ archivebox list`
#### `--csv=COLUMNS`
Print the output in CSV format, with the specified columns, e.g. `--csv=timestamp,base_url,is_archived`
#### `--json`
Print the output in JSON format (with all the link attributes included in the JSON output).
#### `--filter=REGEX`
Print only URLs matching a specified regex, e.g. `--filter='.*github.com.*'`
#### `--before=TIMESTAMP` / `--after=TIMESTAMP`
Print only URLs before or after a given timestamp, e.g. `--before=1554263415.2` or `--after=1554260000`
```bash
$ archivebox list --sort=timestamp
http://www.iana.org/domains/example
https://github.com/pirate/ArchiveBox/wiki
https://github.com/pirate/ArchiveBox/commit/0.4.0
https://github.com/pirate/ArchiveBox
https://archivebox.io
```
```bash
$ archivebox list --sort=timestamp --csv=timestamp,url
timestamp,url
1554260947,http://www.iana.org/domains/example
1554263415,https://github.com/pirate/ArchiveBox/wiki
1554263415.0,https://github.com/pirate/ArchiveBox/commit/0.4.0
1554263415.1,https://github.com/pirate/ArchiveBox
1554263415.2,https://archivebox.io
```
```bash
$ archivebox list --sort=timestamp --csv=timestamp,url --after=1554263415.0
timestamp,url
1554263415,https://github.com/pirate/ArchiveBox/wiki
1554263415.0,https://github.com/pirate/ArchiveBox/commit/0.4.0
1554263415.1,https://github.com/pirate/ArchiveBox
1554263415.2,https://archivebox.io
```
### `$ archivebox remove`
#### `--yes`
Proceed with removal without prompting the user for confirmation.
#### `--delete`
Also delete all the matching links snapshot data folders and content files.
#### `--filter-type`
Defaults to `exact`, but can be set to any of `exact`, `substring`, `domain`, or `regex`.
#### `pattern`
The filter pattern used to match links in the index. Matching links are removed.
#### `--before=TIMESTAMP` / `--after=TIMESTAMP`
Remove any URLs bookmarked before/after the given timestamp, e.g. `--before=1554263415.2` or `--after=1554260000`.
```bash
$ archivebox remove --delete --filter-type=regex 'http(s)?:\\/\\/(.+)?(demo\\.dev|example\\.com)\\/?.*'
[*] Finding links in the archive index matching these regex patterns:
http(s)?:\/\/(.+)?(youtube\.com|example\.com)\/?.*
---------------------------------------------------------------------------------------------------
timestamp | is_archived | num_outputs | url
"1554984695" | true | 7 | "https://example.com"
---------------------------------------------------------------------------------------------------
[i] Found 1 matching URLs to remove.
1 Links will be de-listed from the main index, and their archived content folders will be deleted from disk.
(1 data folders with 7 archived files will be deleted!)
[?] Do you want to proceed with removing these 1 links?
y/[n]: y
[*] [2019-04-11 08:11:57] Saving main index files...
√ /opt/ArchiveBox/data/index.json
√ /opt/ArchiveBox/data/index.html
[] Removed 1 out of 1 links from the archive index.
Index now contains 0 links.
```
```bash
$ archivebox remove --yes --delete --filter-type=domain example.com
...
```
### `$ archivebox manage`
Run a Django management command in the context of the current archivebox data directory.
#### `[command] [...args]`
The name of the management command to run, e.g.: `help`, `migrate`, `changepassword`, `createsuperuser`, etc.
```bash
$ archivebox manage help
Type 'archivebox manage help <subcommand>' for help on a specific subcommand.
Available subcommands:
[auth]
changepassword
createsuperuser
[contenttypes]
remove_stale_contenttypes
[core]
archivebox
...
```
### `$ archivebox server`
#### `--bind=[ip:port]`
The address:port combo to run the web UI server on, defaults to `127.0.0.1:8000`.
```bash
$ archivebox server
[+] Starting ArchiveBox webserver...
Watching for file changes with StatReloader
Performing system checks...
System check identified no issues (0 silenced).
April 23, 2019 - 01:40:52
Django version 2.2, using settings 'core.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
```
### `$ archivebox proxy`
Run a live HTTP/HTTPS proxy that records all traffic into WARC files using pywb.
#### `--bind=[ip:port]`
The address:port combo to run the proxy on, defaults to `127.0.0.1:8010`.
#### `--record`
Save all traffic visited through the proxy to the archive.
#### `--replay`
Attempt to serve all pages visited through the proxy from the archive.
### `$ archivebox shell`
Drop into an ArchiveBox Django shell with access to all models and data.
```bash
$ archivebox shell
Loaded archive data folder ~/example_collection...
Python 3.7.2 (default, Feb 12 2019, 08:15:36)
In [1]: url_to_archive = Link.objects.filter(is_archived=True).values_list('url', flat=True)
...
```
### `$ archivebox oneshot`
Create a single URL archive folder with an index.json and index.html, and all the archive method outputs. You can run this to archive single pages without needing to create a whole collection with `archivebox init`.
#### `--out-dir=[path]`
Path to save the single archive folder to, e.g. `./example.com_archive`.
#### `[--all|--media|--wget|...]`
Which archive methods to use when saving the URL.
## Python Usage
### API for normal ArchiveBox usage
```python
from archivebox import add, subscribe, update
add('https://example.com', depth=2)
subscribe('https://example.com/some/feed.rss')
update(only_new=True)
```
### API for All Useful Subcomponents
```python
from archivebox import oneshot
from archivebox.crawl import rss
from archivebox.extract import media
links = crawl_rss(open('feed.rss', 'r').read())
assets = media.extract('https://youtube.com/watch?v=example')
oneshot('https://example.com', depth=2, out_dir='~/Desktop/example.com_archive')
```
---
## Design
## Planned Specification
As of v0.4.0 ArchiveBox also writes the index to a `sqlite3` file using the Django ORM (in addition to the usual `json` and `html` formats, those aren't going away). To an end user, it will still appear to be a single CLI application, and none of the django complexity will be exposed. Django is used primarily because it allows for safe migrations of a sqlite database. As the schema gets updated in the future I don't want to break people's archives with every new version. It also allows us to have the GUI server start with many safe defaults and share much of the same codebase with the CLI and library components, including maintaining the archive database and managing a worker pool.
(this is not set in stone, just a rough estimate)
There will be 3 primary use cases for archivebox, and all three will be served by the pip package:
### `v0.5`: Remove live-updated JSON & HTML index in favor of `archivebox export`
- use SQLite as the main db and export staticfile indexes once at the *end* of the whole process instead of live-updating them during each extractor run (i.e. remove `patch_main_index`)
- create archivebox export command
- we have to create a public view to replace `index.html` / `old.html` used for non-logged in users
### `v0.6`: Code cleanup / refactor
- move config loading logic into settings.py
- move all the extractors into "plugin" style folders that register their own config
- right now, the paths of the extractor output are scattered all over the codebase, e.g. `output.pdf` (should be moved to constants at the top of the plugin config file)
- make out_dir, link_dir, extractor_dir, naming consistent across codebase
- convert all `os.path` calls and raw string paths to `Pathlib`
- simple CLI operation:
`archivebox.cli import add --depth=1 ./path/to/export.html` (similar to current `archivebox` CLI)
- use of individual components as a library:
`from archivebox.extract import screenshot` or `archivebox oneshot --screenshot ...`
- usage in server mode with a GUI to add/remove links and create exports:
`archivebox server`
### `v0.7`: Schema improvements
- remove `timestamps` as primary keys in favor of hashes, UUIDs, or some other slug
- create a migration system for folder layout independent of the index (`mv` is atomic at the FS level, so we just need a `transaction.atomic(): move(oldpath, newpath); snap.data_dir = newpath; snap.save()`)
- make `Tag` a real model `ManyToMany` with Snapshots
- allow multiple Snapshots of the same site over time + CLI / UI to manage those, + migration from old style `#2020-01-01` hack to proper versioned snapshots
### `v0.8`: Security
- Add CSRF/CSP/XSS protection to rendered archive pages
- Provide secure reverse proxy in front of archivebox server in docker-compose.yml
- Create UX flow for users to setup session cookies / auth for archiving private sites
- cookies for wget, curl, etc low-level commands
- localstorage, cookies, indexedb setup for chrome archiving methods
### `v0.9`: Performance
- setup huey, break up archiving process into tasks on a queue that a worker pool executes
- setup pyppeteer2 to wrap chrome so that it's not open/closed during each extractor
## Dependencies:
* django (required)
* sqlite (required)
* headless chrome (required)
* wget (required)
* redis (optional, for web GUI only)
* dramatiq (optinal, for web GUI only)
### `v1.0`: Full headless browser control
- run user-scripts / extensions in the context of the page during archiving
- community userscripts for unrolling twitter threads, reddit threads, youtube comment sections, etc.
- pywb-based headless browser session recording and warc replay
- archive proxy support
- support sending upstream requests through an external proxy
- support for exposing a proxy that archives all downstream traffic
When launched in webserver mode, archivebox will automatically spawn a pool of workers (dramatiq) as big as the number of CPUs available to use for crawling, archiving, and publishing.
...
When launched in CLI mode it will use normal subprocesses to do multithreading without redis/dramatiq.
## Code Folder Layout
* archivebox/
* core/
* models.py
Archive = Dict[Page, Dict[Archiver, List[Asset]]] # A collection of archived pages
Crawl = List[Page] # list of links to add to an archive
Page # an archived page with unique url
Asset # a file archived from a page
* util.py
* settings.py
* crawl/
impl:
detect_crawlable(Import) -> bool
crawl(Import) -> List[Page]
* txt.py
* rss.py
* netscape.py
* pocket.py
* pinboard.py
* html.py
* extract/
impl:
detect_extractable(Page) -> bool
extract(Page) -> List[Asset]
* wget.py
* screenshot.py
* pdf.py
* dom.py
* youtubedl.py
* waybackmachine.py
* solana.py
* publish/
impl:
publish(Archive, output_format)
* html.py
* json.py
* csv.py
* sql.py
### `v2.0` Federated or distributed archiving + paid hosted service offering
- merkel tree for storing archive output subresource hashes
- DHT for assigning merkel tree hash:file shards to nodes
- tag system for tagging certain hashes with human-readable names, e.g. title, url, tags, filetype etc.
- distributed tag lookup system
## Collection Data Folder Layout
* ArchiveBox.conf
* database/
* sqlite.db
* archive
* assets/\<hash>/
* logs/
* server.log
* crawl.log
* archive.log
## Exported Folder Layout
For publishing the archive as static html/json/csv/sql.
* index.html,json,csv,sql
* archive/
* \<timestamp>/
* index.html
* \<url>/
* index.html,json,csv,sql
* assets/
* hash.mp4
* hash.txt
* hash.mp3
---
The server will be runnable with docker / docker-compose as well:
```yaml
version: '3'
services:
archivebox:
image: archivebox
ports:
- '8098:80'
volumes:
- ./data/:/data
```
---