diff --git a/Configuration.md b/Configuration.md index 6efa3a1..07d66b9 100644 --- a/Configuration.md +++ b/Configuration.md @@ -15,60 +15,150 @@ env CHROME_BINARY=google-chrome-stable RESOLUTION=1440,900 FETCH_PDF=False ./arc As defined in [`archivebox/config.py`](https://github.com/pirate/ArchiveBox/blob/master/archivebox/config.py) and [`etc/ArchiveBox.conf.default`](https://github.com/pirate/ArchiveBox/blob/master/etc/ArchiveBox.conf.default). -*Documentation format:* - - description of setting: `ENVIRONMENT_VARIABLE_NAME`: [`default value`]/`example value`/... - ### Shell Options #### `USE_COLOR` +[`True`]/`False` Colorize console ouput. -[`True`]/`False` - - colorize console ouput: `USE_COLOR`=[`True`]/`False` - - show progress bar: `SHOW_PROGRESS` value: [`True`]/`False` - - archive permissions: `OUTPUT_PERMISSIONS` values: [`755`]/`644`/`...` +#### `SHOW_PROGRESS` +[`True`]/`False` +Show real-time progress bar in console output. ### Dependency Options - - path to Chrome: `CHROME_BINARY` values: [`chromium-browser`]/`/usr/local/bin/google-chrome`/`...` - - path to wget: `WGET_BINARY` values: [`wget`]/`/usr/local/bin/wget`/`...` -### Archive Settings - - output directory: `OUTPUT_DIR` values: [`$REPO_DIR/output`]/`/srv/www/bookmarks`/`...` Optionally output the archives to an alternative directory. - - maximum allowed download time per link: `TIMEOUT` values: [`60`]/`30`/`...` - - maximum allowed download time per media file: `MEDIA_TIMEOUT` values: [`3600`]/`120`/`...` - - import only new links: `ONLY_NEW` values `True`/[`False`] +#### `CHROME_BINARY` +[`chromium-browser`]/`/usr/local/bin/google-chrome`/... +Path or name of the Google Chrome / Chromium binary to use for all the headless browser archive methods. -### Archive Method Toggles -Possible values: [`True`]/`False` +Without setting this environment variable, ArchiveBox by default look for the following binaries in `$PATH` in this order: + - `chromium-browser` + - `chromium` + - `google-chrome` + - `google-chrome-stable` + - `google-chrome-unstable` + - `google-chrome-beta` + - `google-chrome-canary` + - `google-chrome-dev` - - fetch page with wget: `FETCH_WGET` - - fetch images/css/js with wget: `FETCH_WGET_REQUISITES` (True is highly recommended) - - print page as PDF: `FETCH_PDF` - - fetch a screenshot of the page: `FETCH_SCREENSHOT` - - fetch a DOM dump of the page: `FETCH_DOM` - - fetch git repositories on the page: `FETCH_GIT` - - fetch a WARC dump of the page: `FETCH_WARC` - - fetch all audio and video on the page: `FETCH_MEDIA` - - fetch a DOM dump of the page: `FETCH_DOM` - - fetch a favicon for the page: `FETCH_FAVICON` - - fetch and parse the title tag from html: `FETCH_TITLE` - - submit the page to archive.org: `SUBMIT_ARCHIVE_DOT_ORG` - -### Archive Method Options - - screenshot: `RESOLUTION` values: [`1440,900`]/`1024,768`/`...` - - user agent: `WGET_USER_AGENT` values: [`Wget/1.19.1`]/`"Mozilla/5.0 ..."`/`...` - - git domains: `GIT_DOMAINS` values: [`github.com,bitbucket.org,gitlab.com`]/`git.example.com`/`...` - - cookies file: `COOKIES_FILE` values: [`None`]/`/path/to/cookies.txt`/`...` - To capture sites that require a user to be logged in, you can specify a path to a [netscape-format](http://www.cookiecentral.com/faq/#3.5) `cookies.txt` file for wget to use. You can generate this file by using a browser extension to export your cookies in this format, or by using wget with `--save-cookies`. - - chrome profile: `CHROME_USER_DATA_DIR` values: [`~/Library/Application\ Support/Google/Chrome/Default`]/`/tmp/chrome-profile`/`...` - To capture sites that require a user to be logged in, you can specify a path to a chrome user profile (which loads the cookies needed for the user to be logged in). If you don't have an existing chrome profile, create one with `chromium-browser --disable-gpu --user-data-dir=/tmp/chrome-profile`, and log into the sites you need. Then set `CHROME_USER_DATA_DIR=/tmp/chrome-profile` to make ArchiveBox use that profile. - - (See defaults & more at the top of `config.py`) - -To tweak the outputted html index file's look and feel, just edit the HTML files in `archiver/templates/`. +You can override the default behavior to search for any available bin by setting the environment variable to your preferred Chrome binary name or path. The chrome/chromium dependency is _optional_ and only required for screenshots, PDF, and DOM dump output, it can be safely ignored if those three methods are disabled. +#### `WGET_BINARY` +[`wget`]/`/usr/local/bin/wget`/... +Path or name of the wget binary to use. + +### Archive Settings + +#### `OUTPUT_DIR` +[`$REPO_DIR/output`]/`/srv/www/bookmarks`/... +Path to an output folder to store the archive in. Defaults to `output/` in the root directory of the repository. + +#### `OUTPUT_PERMISSIONS` +[`755`]/`644`/... +Permissions to set the output directory and file contents to. + +#### `ONLY_NEW` +[`False`]/`True` +Download files for only newly added links when running the `./archive` command. By default ArchiveBox will go through all links in the index and download any missing files on every run, set this to `True` to only archive the fresh links added during this run without attempting to also update older archived links. + +#### `TIMEOUT` +[`60`]/`30`/... +Maximum allowed download time per archive method for each link in seconds. If you have a slow network connection or are seeing frequent timeout errors, you can raise this value. + +#### `MEDIA_TIMEOUT` +[`3600`]/`120`/... +Maximum allowed download time for fetching media when `FETCH_MEDIA=True` in seconds. This timeout is separate and usually much longer than `TIMEOUT` because media downloaded with `youtube-dl` can often be quite large and take many minutes/hours to download. Tweak this setting based on your network speed and maximum media file size you plan on downloading. + +#### `TEMPLATES_DIR` +[`$REPO_DIR/archivebox/templates`]/`/path/to/custom/templates`/... +Path to a directory containing custom index html templates for themeing your archive output. Folder at specified path must contain the following files: + - `static/` + - `index.html` + - `link_index.html` + - `index_row.html` + +You can copy the files in `archivebox/templates` into your own directory to start developing a custom theme, then edit `TEMPLATES_DIR` to point to your new custom templates directory. + +#### `FOOTER_INFO` +[`Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.`]/`Operated by ACME Co.`/... +Some text to display in the footer of the archive index. Useful for providing server admin contact info to respond to takedown requests. + +### Archive Method Toggles + +#### `FETCH_TITLE` +[`True`]/`False` +Fetch the page HTML and attempt to parse the links' title from any `