1
0
mirror of https://github.com/pirate/ArchiveBox.git synced 2025-08-22 06:03:23 +02:00

docs: Update usage page

Cristian
2020-07-21 13:52:27 -05:00
parent 2061184e3e
commit 899f696742
2 changed files with 97 additions and 51 deletions

@@ -2,30 +2,31 @@
## Overview
Running ArchiveBox with Docker allows you to manage it in a container without exposing it to the rest of your system. Usage with Docker is similar to usage of ArchiveBox normally, with a few small differences.
Running ArchiveBox with Docker allows you to manage it in a container without exposing it to the rest of your system. Usage with Docker is similar to usage of ArchiveBox normally, with a few small differences.
Make sure you have Docker installed and set up on your machine before following these instructions. If you don't already have Docker installed, follow the official install instructions for Linux, macOS, or Windows here: https://docs.docker.com/install/#supported-platforms.
Make sure you have Docker installed and set up on your machine before following these instructions. If you don't already have Docker installed, follow the official install instructions for Linux, macOS, or Windows here: https://docs.docker.com/install/#supported-platforms.
<img src="https://i.imgur.com/qFAPRwC.png" width="20%" align="right">
<img src="https://i.imgur.com/qFAPRwC.png" width="20%" align="right">
- [Overview](#)
- [Docker Compose](#docker-compose) (recommended way)
+ [Setup](#setup)
+ [Usage](#usage)
+ [Accessing the data](#accessing-the-data)
+ [Configuration](#configuration)
- [Setup](#setup)
- [Usage](#usage)
- [Accessing the data](#accessing-the-data)
- [Configuration](#configuration)
- [Plain Docker](#docker)
+ [Setup](#setup-1)
+ [Usage](#usage-1)
+ [Accessing the data](#accessing-the-data-1)
+ [Configuration](#configuration-1)
- [Setup](#setup-1)
- [Usage](#usage-1)
- [Accessing the data](#accessing-the-data-1)
- [Configuration](#configuration-1)
**Official Docker Hub image:**
https://hub.docker.com/r/nikisweeting/archivebox
**Usage:**
```bash
echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox
echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox add
```
---
@@ -34,9 +35,10 @@ echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/ar
## Docker Compose
An example [`docker-compose.yml`](https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml) config with ArchiveBox and an Nginx server to serve the archive is included in the project root. You can edit it as you see fit, or just run it as it comes out-of-the-box.
An example [`docker-compose.yml`](https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml) config with ArchiveBox and an Nginx server to serve the archive is included in the project root. You can edit it as you see fit, or just run it as it comes out-of-the-box.
Just make sure you have a Docker version that's [new enough](https://docs.docker.com/compose/compose-file/) to support `version: 3` format:
```bash
docker --version
Docker version 18.09.1, build 4c52b90 # must be >= 17.04.0
@@ -59,25 +61,29 @@ First, make sure you're `cd`'ed into the same folder as your `docker-compose.yml
To add new URLs, you can use docker-compose just like the normal `./archive` CLI.
**To add an individual link or list of links**, pass in URLs via stdin.
```bash
echo "https://example.com" | docker-compose exec -T archivebox /bin/archive
```
**To import links from a file** you can either `cat` the file and pass it via stdin like above, or move it into your data folder so that ArchiveBox can access it from within the container.
```bash
mv ~/Downloads/bookmarks.html data/sources/bookmarks.html
docker-compose exec archivebox /bin/archive /data/sources/bookmarks.html
```
**To pull in links from a feed or remote file**, pass the URL or path to the feed as an argument.
```bash
docker-compose exec archivebox /bin/archive https://example.com/some/feed.rss
```
Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links *inside* of it, so only use it for RSS feeds or other *lists of links* you want to add. To add an individual link you want to archive use the instruction above and pass via stdin instead of by argument.
Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links _inside_ of it, so only use it for RSS feeds or other _lists of links_ you want to add. To add an individual link you want to archive use the instruction above and pass via stdin instead of by argument.
### Accessing the data
The outputted archive data is stored in `data/` (relative to the project root), or whatever folder path you specified in the `docker-compose.yml` `volumes:` section. Make sure the `data/` folder on the host has permissions initially set to `777` so that the ArchiveBox command is able to set it to the specified `OUTPUT_PERMISSIONS` config setting on the first run.
The outputted archive data is stored in `data/` (relative to the project root), or whatever folder path you specified in the `docker-compose.yml` `volumes:` section. Make sure the `data/` folder on the host has permissions initially set to `777` so that the ArchiveBox command is able to set it to the specified `OUTPUT_PERMISSIONS` config setting on the first run.
To access your archive, you can open `data/index.html` directly, or you can use the provided Nginx server running inside docker on [`http://127.0.0.1:8098`](http://127.0.0.1:8098).
@@ -88,6 +94,7 @@ ArchiveBox running with docker-compose accepts all the same environment variable
The recommended way to pass in config variables is to edit the `environment:` section in `docker-compose.yml` directly or add an `env_file: ./path/to/ArchiveBox.conf` line before `environment:` to import variables from an env file.
Example of adding config options to `docker-compose.yml`:
```yaml
...
@@ -105,7 +112,7 @@ services:
You can also specify an env file via CLI when running compose using `docker-compose --env-file=/path/to/config.env ...` although you must specify the variables in the `environment:` section that you want to have passed down to the ArchiveBox container from the passed env file.
If you want to access your archive server with HTTPS, put a reverse proxy like Nginx or Caddy in front of `http://127.0.0.1:8098` to do SSL termination. You can find many instructions to do this online if you search "SSL reverse proxy".
If you want to access your archive server with HTTPS, put a reverse proxy like Nginx or Caddy in front of `http://127.0.0.1:8098` to do SSL termination. You can find many instructions to do this online if you search "SSL reverse proxy".
---
@@ -114,6 +121,7 @@ If you want to access your archive server with HTTPS, put a reverse proxy like N
### Setup
Fetch and run the ArchiveBox Docker image to create your initial archive.
```bash
echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox
```
@@ -124,7 +132,8 @@ Make sure the data folder you use host is either a new, uncreated path, or if it
### Usage
**To add a single URL to the archive** or a list of links from a file, pipe them in via stdin. This will archive each link passed in.
**To add a single URL to the archive** or a list of links from a file, pipe them in via stdin. This will archive each link passed in.
```bash
echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox
# or
@@ -132,27 +141,33 @@ cat bookmarks.html | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox
```
**To add a list of pages via feed URL or remote file,** pass the URL of the feed as an argument.
```bash
docker run -v -v ~/ArchiveBox:/data nikisweeting/archivebox /bin/archive 'https://example.com/some/rss/feed.xml'
```
Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links *inside* of it, so only use it for RSS feeds or other *lists of links* you want to add. To add an individual link use the instruction above and pass via stdin instead of by argument.
Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links _inside_ of it, so only use it for RSS feeds or other _lists of links_ you want to add. To add an individual link use the instruction above and pass via stdin instead of by argument.
### Accessing the data
#### Using a bind folder
Use the flag:
```bash
-v /full/path/to/folder/on/host:/data
```
This will use the folder `/full/path/to/folder/on/host` on your host to store the ArchiveBox output.
#### Using a named Docker data volume
#### Using a named Docker data volume
```bash
docker volume create archivebox-data
```
Then use the flag:
```bash
-v archivebox-data:/data
```
@@ -161,6 +176,7 @@ You can mount your data volume using standard docker tools, or access the conten
`/var/lib/docker/volumes/archivebox-data/_data` (on most Linux systems)
On a Mac you'll have to enter the base Docker Linux VM first to access the volume data:
```bash
screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty
cd /var/lib/docker/volumes/archivebox-data/_data
@@ -171,11 +187,13 @@ cd /var/lib/docker/volumes/archivebox-data/_data
ArchiveBox in Docker accepts all the same environment variables as normal, see the list on the [[Configuration]] page.
To pass environment variables when running, you can use the env command.
```bash
echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox env FETCH_SCREENSHOT=False /bin/archive
```
Or you can create an `ArchiveBox.env` file (copy from the default `etc/ArchiveBox.conf.default`) and pass it in like so:
```bash
docker run -i -v --env-file=ArchiveBox.env nikisweeting/archivebox
```

@@ -1,63 +1,88 @@
# Usage
▶️ *Make sure the dependencies are [fully installed](https://github.com/pirate/ArchiveBox/wiki/Install) before running any ArchiveBox commands.*
▶️ _Make sure the dependencies are [fully installed](https://github.com/pirate/ArchiveBox/wiki/Install) before running any ArchiveBox commands._
**ArchiveBox API Reference:**
<img src="https://i.imgur.com/aQZZcku.png" width="20%" align="right"/>
<img src="https://i.imgur.com/aQZZcku.png" width="20%" align="right"/>
- [Overview](#Overview): Program structure and outline of basic archiving process.
- [CLI Usage](#CLI-Usage): Docs and examples for the ArchiveBox command line interface.
- [UI Usage](#UI-Usage): Docs and screenshots for the outputted HTML archive interface.
- [Disk Layout](#Disk-Layout): Description of the archive folder structure and contents.
- [Overview](#Overview): Program structure and outline of basic archiving process.
- [CLI Usage](#CLI-Usage): Docs and examples for the ArchiveBox command line interface.
- [UI Usage](#UI-Usage): Docs and screenshots for the outputted HTML archive interface.
- [Disk Layout](#Disk-Layout): Description of the archive folder structure and contents.
**Related:**
- [[Docker]]: Learn about ArchiveBox usage with Docker and Docker Compose
- [[Configuration]]: Learn about the various archive method options
- [[Scheduled Archiving]]: Learn how to set up automatic daily archiving
- [[Publishing Your Archive]]: Learn how to host your archive for others to access
- [[Troubleshooting]]: Resources if you encounter any problems
- [Screenshots](https://github.com/pirate/ArchiveBox#Screenshots): See what the CLI and outputted HTML look like
- [[Docker]]: Learn about ArchiveBox usage with Docker and Docker Compose
- [[Configuration]]: Learn about the various archive method options
- [[Scheduled Archiving]]: Learn how to set up automatic daily archiving
- [[Publishing Your Archive]]: Learn how to host your archive for others to access
- [[Troubleshooting]]: Resources if you encounter any problems
- [Screenshots](https://github.com/pirate/ArchiveBox#Screenshots): See what the CLI and outputted HTML look like
## Overview
The `./archive` binary is a shortcut to `bin/archivebox`. Piping RSS, JSON, [Netscape](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx), or TXT lists of links into the `./archive` command will add them to your archive folder, and create a locally-stored browsable archive for each new URL.
The `./archive` binary is a shortcut to `bin/archivebox`.
You can create/initialize an archive folder anywhere you want. You can to this with:
The archiver produces an [output folder](#Disk-Layout) `output/` containing `index.html`, `index.json`, and archived copies of all the sites organized by timestamp bookmarked. It's powered by [Chrome headless](https://developers.google.com/web/updates/2017/04/headless-chrome), good 'ol `wget`, and a few other common Unix tools.
```bash
archivebox init
```
You can add urls using `stdin` or `args` along with the `add` command:
```bash
archivebox add https://example.com
archivebox add https://my-rss-feed.com --depth=1
```
Passing RSS, JSON, [Netscape](<https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx>), or TXT lists of links into the `./archive add` command will add them to your archive folder, and create a locally-stored browsable archive for each new URL.
The archiver will create new files for the index and the archived copies of all the sites organized by timestamp. It's powered by [Chrome headless](https://developers.google.com/web/updates/2017/04/headless-chrome), good 'ol `wget`, and a few other common Unix tools.
## CLI Usage
<img src="https://i.imgur.com/biVfFYr.png" width="30%" align="right">
`./archive` refers to the executable shortcut in the root of the project, but you can also call ArchiveBox via `./bin/archivebox`. If you add `/path/to/ArchiveBox/bin` to your shell `$PATH` then you can call `archivebox` from anywhere on your system.
`archivebox` refers to the executable that is available when you install this project using PIP.
If you're using Docker, the CLI interface is similar but needs to be prefixed by `docker-compose exec ...` or `docker run ...`, for examples see the [[Docker]] page.
- [Run ArchiveBox with configuration options](#Run-ArchiveBox-with-configuration-options)
- [Import a single URL or list of URLs via stdin](#Import-a-single-URL-or-list-of-URLs-via-stdin)
- [Import list of links exported from browser or another service](#Import-list-of-links-exported-from-browser-or-another-service)
- [Import list of URLs from a remote RSS feed or file](#Import-list-of-URLs-from-a-remote-RSS-feed-or-file)
- [Import list of links from browser history](#Import-list-of-links-from-browser-history)
- [Run ArchiveBox with configuration options](#Run-ArchiveBox-with-configuration-options)
- [Import a single URL or list of URLs via stdin](#Import-a-single-URL-or-list-of-URLs-via-stdin)
- [Import list of links exported from browser or another service](#Import-list-of-links-exported-from-browser-or-another-service)
- [Import list of URLs from a remote RSS feed or file](#Import-list-of-URLs-from-a-remote-RSS-feed-or-file)
- [Import list of links from browser history](#Import-list-of-links-from-browser-history)
---
### Run ArchiveBox with configuration options
You can set environment variables in your shell profile, a config file, or by using the `env` command.
```bash
env FETCH_MEDIA=True MEDIA_TIMEOUT=500 ./archive ...
```
See [[Configuration]] page for more details about the available options and ways to pass config.
If you're using Docker, also make sure to read the Configuration section on the [[Docker]] page.
---
### Import a single URL or list of URLs via stdin
### Import a single URL
```bash
echo 'https://example.com' | ./archive
echo 'https://example.com' | archivebox add
# or
cat urls_to_archive.txt | ./archive
archivebox add https://example.com
```
### Import a list of URLs from a txt file
```bash
cat urls_to_archive.txt | archivebox add
```
You can also pipe in RSS, XML, Netscape, or any of the other supported import formats via stdin.
---
@@ -72,24 +97,26 @@ You can also pipe in RSS, XML, Netscape, or any of the other supported import fo
./archive ~/Downloads/other_links.txt
```
Passing a file as an argument here does not archive the file, it parses it as a list of URLs and archives the links *inside of it*, so only use it for *lists of links* to archive, not HTML files or other content you want added directy to the archive.
Passing a file as an argument here does not archive the file, it parses it as a list of URLs and archives the links _inside of it_, so only use it for _lists of links_ to archive, not HTML files or other content you want added directy to the archive.
---
### Import list of URLs from a remote RSS feed or file
ArchiveBox will download the URL to a local file in `output/sources/` and attempt to autodetect the format and import any URLs found. Currently, Netscape HTML, JSON, RSS, and plain text links lists are supported.
```bash
./archive https://example.com/feed.rss
echo https://my-rss-feed | archivebox add --depth=1
# or
./archive https://example.com/links.txt
archivebox add https://my-rss-feed --depth=1
```
Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links *inside* of it, so only use it for RSS feeds or other *lists of links* you want to add. To add an individual link use the instruction above and pass the URL via stdin instead of as an argument.
Passing a URL as an argument here does archive the specified URL, it downloads it and archives the links _inside_ of it, so you can use it for RSS feeds or other _lists of links_ you want to add.
---
### Import list of links from browser history
```bash
./bin/archivebox-export-browser-history --chrome
./archive output/sources/chrome_history.json
@@ -102,11 +129,11 @@ Passing a URL as an argument here does not archive the specified URL, it downloa
## UI Usage
To access your archive, open `output/index.html` in a browser. You should see something [like this](https://archive.sweeting.me).
To access your archive, open `output/index.html` in a browser. You should see something [like this](https://archive.sweeting.me).
You can sort by column, search using the box in the upper right, and see the total number of links at the bottom.
Click the Favicon under the "Files" column to go to the details page for each link.
Click the Favicon under the "Files" column to go to the details page for each link.
<div align="center">
<img src="https://i.imgur.com/52RjhUM.png" width="45%">
@@ -128,7 +155,7 @@ The `output/` folder containing the UI HTML and archived data has the structure
- index.html
# Archive method outputs:
- warc/
- warc/
- media/
- git/
...
@@ -145,17 +172,19 @@ The `output/` folder containing the UI HTML and archived data has the structure
### Large Archives
I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
Storage requirements go up immensely if you're using `FETCH_MEDIA=True` and are archiving many pages with audio & video.
You can run it in parallel by using the `resume` feature, or by manually splitting export.html into multiple files:
```bash
./archive export.html 1498800000 & # second argument is timestamp to resume downloading from
./archive export.html 1498810000 &
./archive export.html 1498820000 &
./archive export.html 1498830000 &
```
Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
If you already imported a huge list of bookmarks and want to import only new
@@ -163,7 +192,6 @@ bookmarks, you can use the `ONLY_NEW` environment variable. This is useful if
you want to import a bookmark dump periodically and want to skip broken links
which are already in the index.
## Python API Usage
```python