From 899f6967424f7072ca4a77d7b2ac4636bd794deb Mon Sep 17 00:00:00 2001 From: Cristian Date: Tue, 21 Jul 2020 13:52:27 -0500 Subject: [PATCH] docs: Update usage page --- Docker.md | 56 +++++++++++++++++++++------------ Usage.md | 92 ++++++++++++++++++++++++++++++++++++------------------- 2 files changed, 97 insertions(+), 51 deletions(-) diff --git a/Docker.md b/Docker.md index 4d03ccf..01c0dac 100644 --- a/Docker.md +++ b/Docker.md @@ -2,30 +2,31 @@ ## Overview -Running ArchiveBox with Docker allows you to manage it in a container without exposing it to the rest of your system. Usage with Docker is similar to usage of ArchiveBox normally, with a few small differences. +Running ArchiveBox with Docker allows you to manage it in a container without exposing it to the rest of your system. Usage with Docker is similar to usage of ArchiveBox normally, with a few small differences. -Make sure you have Docker installed and set up on your machine before following these instructions. If you don't already have Docker installed, follow the official install instructions for Linux, macOS, or Windows here: https://docs.docker.com/install/#supported-platforms. +Make sure you have Docker installed and set up on your machine before following these instructions. If you don't already have Docker installed, follow the official install instructions for Linux, macOS, or Windows here: https://docs.docker.com/install/#supported-platforms. - + - [Overview](#) - [Docker Compose](#docker-compose) (recommended way) - + [Setup](#setup) - + [Usage](#usage) - + [Accessing the data](#accessing-the-data) - + [Configuration](#configuration) + - [Setup](#setup) + - [Usage](#usage) + - [Accessing the data](#accessing-the-data) + - [Configuration](#configuration) - [Plain Docker](#docker) - + [Setup](#setup-1) - + [Usage](#usage-1) - + [Accessing the data](#accessing-the-data-1) - + [Configuration](#configuration-1) + - [Setup](#setup-1) + - [Usage](#usage-1) + - [Accessing the data](#accessing-the-data-1) + - [Configuration](#configuration-1) **Official Docker Hub image:** https://hub.docker.com/r/nikisweeting/archivebox **Usage:** + ```bash -echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox +echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox add ``` --- @@ -34,9 +35,10 @@ echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/ar ## Docker Compose -An example [`docker-compose.yml`](https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml) config with ArchiveBox and an Nginx server to serve the archive is included in the project root. You can edit it as you see fit, or just run it as it comes out-of-the-box. +An example [`docker-compose.yml`](https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml) config with ArchiveBox and an Nginx server to serve the archive is included in the project root. You can edit it as you see fit, or just run it as it comes out-of-the-box. Just make sure you have a Docker version that's [new enough](https://docs.docker.com/compose/compose-file/) to support `version: 3` format: + ```bash docker --version Docker version 18.09.1, build 4c52b90 # must be >= 17.04.0 @@ -59,25 +61,29 @@ First, make sure you're `cd`'ed into the same folder as your `docker-compose.yml To add new URLs, you can use docker-compose just like the normal `./archive` CLI. **To add an individual link or list of links**, pass in URLs via stdin. + ```bash echo "https://example.com" | docker-compose exec -T archivebox /bin/archive ``` **To import links from a file** you can either `cat` the file and pass it via stdin like above, or move it into your data folder so that ArchiveBox can access it from within the container. + ```bash mv ~/Downloads/bookmarks.html data/sources/bookmarks.html docker-compose exec archivebox /bin/archive /data/sources/bookmarks.html ``` **To pull in links from a feed or remote file**, pass the URL or path to the feed as an argument. + ```bash docker-compose exec archivebox /bin/archive https://example.com/some/feed.rss ``` -Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links *inside* of it, so only use it for RSS feeds or other *lists of links* you want to add. To add an individual link you want to archive use the instruction above and pass via stdin instead of by argument. + +Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links _inside_ of it, so only use it for RSS feeds or other _lists of links_ you want to add. To add an individual link you want to archive use the instruction above and pass via stdin instead of by argument. ### Accessing the data -The outputted archive data is stored in `data/` (relative to the project root), or whatever folder path you specified in the `docker-compose.yml` `volumes:` section. Make sure the `data/` folder on the host has permissions initially set to `777` so that the ArchiveBox command is able to set it to the specified `OUTPUT_PERMISSIONS` config setting on the first run. +The outputted archive data is stored in `data/` (relative to the project root), or whatever folder path you specified in the `docker-compose.yml` `volumes:` section. Make sure the `data/` folder on the host has permissions initially set to `777` so that the ArchiveBox command is able to set it to the specified `OUTPUT_PERMISSIONS` config setting on the first run. To access your archive, you can open `data/index.html` directly, or you can use the provided Nginx server running inside docker on [`http://127.0.0.1:8098`](http://127.0.0.1:8098). @@ -88,6 +94,7 @@ ArchiveBox running with docker-compose accepts all the same environment variable The recommended way to pass in config variables is to edit the `environment:` section in `docker-compose.yml` directly or add an `env_file: ./path/to/ArchiveBox.conf` line before `environment:` to import variables from an env file. Example of adding config options to `docker-compose.yml`: + ```yaml ... @@ -105,7 +112,7 @@ services: You can also specify an env file via CLI when running compose using `docker-compose --env-file=/path/to/config.env ...` although you must specify the variables in the `environment:` section that you want to have passed down to the ArchiveBox container from the passed env file. -If you want to access your archive server with HTTPS, put a reverse proxy like Nginx or Caddy in front of `http://127.0.0.1:8098` to do SSL termination. You can find many instructions to do this online if you search "SSL reverse proxy". +If you want to access your archive server with HTTPS, put a reverse proxy like Nginx or Caddy in front of `http://127.0.0.1:8098` to do SSL termination. You can find many instructions to do this online if you search "SSL reverse proxy". --- @@ -114,6 +121,7 @@ If you want to access your archive server with HTTPS, put a reverse proxy like N ### Setup Fetch and run the ArchiveBox Docker image to create your initial archive. + ```bash echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox ``` @@ -124,7 +132,8 @@ Make sure the data folder you use host is either a new, uncreated path, or if it ### Usage -**To add a single URL to the archive** or a list of links from a file, pipe them in via stdin. This will archive each link passed in. +**To add a single URL to the archive** or a list of links from a file, pipe them in via stdin. This will archive each link passed in. + ```bash echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox # or @@ -132,27 +141,33 @@ cat bookmarks.html | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox ``` **To add a list of pages via feed URL or remote file,** pass the URL of the feed as an argument. + ```bash docker run -v -v ~/ArchiveBox:/data nikisweeting/archivebox /bin/archive 'https://example.com/some/rss/feed.xml' ``` -Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links *inside* of it, so only use it for RSS feeds or other *lists of links* you want to add. To add an individual link use the instruction above and pass via stdin instead of by argument. + +Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links _inside_ of it, so only use it for RSS feeds or other _lists of links_ you want to add. To add an individual link use the instruction above and pass via stdin instead of by argument. ### Accessing the data #### Using a bind folder Use the flag: + ```bash -v /full/path/to/folder/on/host:/data ``` + This will use the folder `/full/path/to/folder/on/host` on your host to store the ArchiveBox output. -#### Using a named Docker data volume +#### Using a named Docker data volume ```bash docker volume create archivebox-data ``` + Then use the flag: + ```bash -v archivebox-data:/data ``` @@ -161,6 +176,7 @@ You can mount your data volume using standard docker tools, or access the conten `/var/lib/docker/volumes/archivebox-data/_data` (on most Linux systems) On a Mac you'll have to enter the base Docker Linux VM first to access the volume data: + ```bash screen ~/Library/Containers/com.docker.docker/Data/vms/0/tty cd /var/lib/docker/volumes/archivebox-data/_data @@ -171,11 +187,13 @@ cd /var/lib/docker/volumes/archivebox-data/_data ArchiveBox in Docker accepts all the same environment variables as normal, see the list on the [[Configuration]] page. To pass environment variables when running, you can use the env command. + ```bash echo 'https://example.com' | docker run -i -v ~/ArchiveBox:/data nikisweeting/archivebox env FETCH_SCREENSHOT=False /bin/archive ``` Or you can create an `ArchiveBox.env` file (copy from the default `etc/ArchiveBox.conf.default`) and pass it in like so: + ```bash docker run -i -v --env-file=ArchiveBox.env nikisweeting/archivebox ``` diff --git a/Usage.md b/Usage.md index daaa601..9952e27 100644 --- a/Usage.md +++ b/Usage.md @@ -1,63 +1,88 @@ # Usage -▶️ *Make sure the dependencies are [fully installed](https://github.com/pirate/ArchiveBox/wiki/Install) before running any ArchiveBox commands.* +▶️ _Make sure the dependencies are [fully installed](https://github.com/pirate/ArchiveBox/wiki/Install) before running any ArchiveBox commands._ **ArchiveBox API Reference:** - + - - [Overview](#Overview): Program structure and outline of basic archiving process. - - [CLI Usage](#CLI-Usage): Docs and examples for the ArchiveBox command line interface. - - [UI Usage](#UI-Usage): Docs and screenshots for the outputted HTML archive interface. - - [Disk Layout](#Disk-Layout): Description of the archive folder structure and contents. +- [Overview](#Overview): Program structure and outline of basic archiving process. +- [CLI Usage](#CLI-Usage): Docs and examples for the ArchiveBox command line interface. +- [UI Usage](#UI-Usage): Docs and screenshots for the outputted HTML archive interface. +- [Disk Layout](#Disk-Layout): Description of the archive folder structure and contents. **Related:** - - [[Docker]]: Learn about ArchiveBox usage with Docker and Docker Compose - - [[Configuration]]: Learn about the various archive method options - - [[Scheduled Archiving]]: Learn how to set up automatic daily archiving - - [[Publishing Your Archive]]: Learn how to host your archive for others to access - - [[Troubleshooting]]: Resources if you encounter any problems - - [Screenshots](https://github.com/pirate/ArchiveBox#Screenshots): See what the CLI and outputted HTML look like + +- [[Docker]]: Learn about ArchiveBox usage with Docker and Docker Compose +- [[Configuration]]: Learn about the various archive method options +- [[Scheduled Archiving]]: Learn how to set up automatic daily archiving +- [[Publishing Your Archive]]: Learn how to host your archive for others to access +- [[Troubleshooting]]: Resources if you encounter any problems +- [Screenshots](https://github.com/pirate/ArchiveBox#Screenshots): See what the CLI and outputted HTML look like ## Overview -The `./archive` binary is a shortcut to `bin/archivebox`. Piping RSS, JSON, [Netscape](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx), or TXT lists of links into the `./archive` command will add them to your archive folder, and create a locally-stored browsable archive for each new URL. +The `./archive` binary is a shortcut to `bin/archivebox`. +You can create/initialize an archive folder anywhere you want. You can to this with: -The archiver produces an [output folder](#Disk-Layout) `output/` containing `index.html`, `index.json`, and archived copies of all the sites organized by timestamp bookmarked. It's powered by [Chrome headless](https://developers.google.com/web/updates/2017/04/headless-chrome), good 'ol `wget`, and a few other common Unix tools. +```bash +archivebox init +``` + +You can add urls using `stdin` or `args` along with the `add` command: + +```bash +archivebox add https://example.com +archivebox add https://my-rss-feed.com --depth=1 +``` + +Passing RSS, JSON, [Netscape](), or TXT lists of links into the `./archive add` command will add them to your archive folder, and create a locally-stored browsable archive for each new URL. + +The archiver will create new files for the index and the archived copies of all the sites organized by timestamp. It's powered by [Chrome headless](https://developers.google.com/web/updates/2017/04/headless-chrome), good 'ol `wget`, and a few other common Unix tools. ## CLI Usage -`./archive` refers to the executable shortcut in the root of the project, but you can also call ArchiveBox via `./bin/archivebox`. If you add `/path/to/ArchiveBox/bin` to your shell `$PATH` then you can call `archivebox` from anywhere on your system. +`archivebox` refers to the executable that is available when you install this project using PIP. If you're using Docker, the CLI interface is similar but needs to be prefixed by `docker-compose exec ...` or `docker run ...`, for examples see the [[Docker]] page. - - [Run ArchiveBox with configuration options](#Run-ArchiveBox-with-configuration-options) - - [Import a single URL or list of URLs via stdin](#Import-a-single-URL-or-list-of-URLs-via-stdin) - - [Import list of links exported from browser or another service](#Import-list-of-links-exported-from-browser-or-another-service) - - [Import list of URLs from a remote RSS feed or file](#Import-list-of-URLs-from-a-remote-RSS-feed-or-file) - - [Import list of links from browser history](#Import-list-of-links-from-browser-history) +- [Run ArchiveBox with configuration options](#Run-ArchiveBox-with-configuration-options) +- [Import a single URL or list of URLs via stdin](#Import-a-single-URL-or-list-of-URLs-via-stdin) +- [Import list of links exported from browser or another service](#Import-list-of-links-exported-from-browser-or-another-service) +- [Import list of URLs from a remote RSS feed or file](#Import-list-of-URLs-from-a-remote-RSS-feed-or-file) +- [Import list of links from browser history](#Import-list-of-links-from-browser-history) --- ### Run ArchiveBox with configuration options + You can set environment variables in your shell profile, a config file, or by using the `env` command. ```bash env FETCH_MEDIA=True MEDIA_TIMEOUT=500 ./archive ... ``` + See [[Configuration]] page for more details about the available options and ways to pass config. If you're using Docker, also make sure to read the Configuration section on the [[Docker]] page. --- -### Import a single URL or list of URLs via stdin +### Import a single URL + ```bash -echo 'https://example.com' | ./archive +echo 'https://example.com' | archivebox add # or -cat urls_to_archive.txt | ./archive +archivebox add https://example.com ``` + +### Import a list of URLs from a txt file + +```bash +cat urls_to_archive.txt | archivebox add +``` + You can also pipe in RSS, XML, Netscape, or any of the other supported import formats via stdin. --- @@ -72,24 +97,26 @@ You can also pipe in RSS, XML, Netscape, or any of the other supported import fo ./archive ~/Downloads/other_links.txt ``` -Passing a file as an argument here does not archive the file, it parses it as a list of URLs and archives the links *inside of it*, so only use it for *lists of links* to archive, not HTML files or other content you want added directy to the archive. +Passing a file as an argument here does not archive the file, it parses it as a list of URLs and archives the links _inside of it_, so only use it for _lists of links_ to archive, not HTML files or other content you want added directy to the archive. --- ### Import list of URLs from a remote RSS feed or file + ArchiveBox will download the URL to a local file in `output/sources/` and attempt to autodetect the format and import any URLs found. Currently, Netscape HTML, JSON, RSS, and plain text links lists are supported. ```bash -./archive https://example.com/feed.rss +echo https://my-rss-feed | archivebox add --depth=1 # or -./archive https://example.com/links.txt +archivebox add https://my-rss-feed --depth=1 ``` -Passing a URL as an argument here does not archive the specified URL, it downloads it and archives the links *inside* of it, so only use it for RSS feeds or other *lists of links* you want to add. To add an individual link use the instruction above and pass the URL via stdin instead of as an argument. +Passing a URL as an argument here does archive the specified URL, it downloads it and archives the links _inside_ of it, so you can use it for RSS feeds or other _lists of links_ you want to add. --- ### Import list of links from browser history + ```bash ./bin/archivebox-export-browser-history --chrome ./archive output/sources/chrome_history.json @@ -102,11 +129,11 @@ Passing a URL as an argument here does not archive the specified URL, it downloa ## UI Usage -To access your archive, open `output/index.html` in a browser. You should see something [like this](https://archive.sweeting.me). +To access your archive, open `output/index.html` in a browser. You should see something [like this](https://archive.sweeting.me). You can sort by column, search using the box in the upper right, and see the total number of links at the bottom. -Click the Favicon under the "Files" column to go to the details page for each link. +Click the Favicon under the "Files" column to go to the details page for each link.
@@ -128,7 +155,7 @@ The `output/` folder containing the UI HTML and archived data has the structure - index.html # Archive method outputs: - - warc/ + - warc/ - media/ - git/ ... @@ -145,17 +172,19 @@ The `output/` folder containing the UI HTML and archived data has the structure ### Large Archives I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB. -Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV. +Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV. Storage requirements go up immensely if you're using `FETCH_MEDIA=True` and are archiving many pages with audio & video. You can run it in parallel by using the `resume` feature, or by manually splitting export.html into multiple files: + ```bash ./archive export.html 1498800000 & # second argument is timestamp to resume downloading from ./archive export.html 1498810000 & ./archive export.html 1498820000 & ./archive export.html 1498830000 & ``` + Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running). If you already imported a huge list of bookmarks and want to import only new @@ -163,7 +192,6 @@ bookmarks, you can use the `ONLY_NEW` environment variable. This is useful if you want to import a bookmark dump periodically and want to skip broken links which are already in the index. - ## Python API Usage ```python