1
0
mirror of https://github.com/pirate/ArchiveBox.git synced 2025-08-17 03:54:08 +02:00

Updated Configuration (markdown)

Nick Sweeting
2021-04-23 21:12:52 -04:00
parent eaa6d464b7
commit 103d583ae5

@@ -39,15 +39,6 @@ In case this document is ever out of date, it's recommended to read the code tha
*General options around the archiving process, output format, and timing.* *General options around the archiving process, output format, and timing.*
---
#### `OUTPUT_DIR`
**Possible Values:** [`.`]/`~/archivebox`/...
Path to an output folder to store the archive in.
Defaults to the current folder you're in `./` (`$PWD`) when you run the `archivebox` command.
*Note: make sure the user running ArchiveBox has permissions set to allow writing to this folder!*
--- ---
#### `OUTPUT_PERMISSIONS` #### `OUTPUT_PERMISSIONS`
**Possible Values:** [`755`]/`644`/... **Possible Values:** [`755`]/`644`/...
@@ -82,13 +73,24 @@ Maximum allowed download time for fetching media when `SAVE_MEDIA=True` in secon
[`SAVE_MEDIA`](#save_media) [`SAVE_MEDIA`](#save_media)
--- ---
#### `TEMPLATES_DIR` #### `CUSTOM_TEMPLATES_DIR`
**Possible Values:** [`$REPO_DIR/archivebox/templates`]/`/path/to/custom/templates`/... **Possible Values:** [`None`]/`./path/to/custom_templates`/...
Path to a directory containing custom index html templates for theming your archive output. Files found in the folder at the specified path can override any of the defaults in the [`archivebox/themes`](https://github.com/ArchiveBox/ArchiveBox/tree/master/archivebox/themes) directory. If you've used `django` before, this works exactly the same way that `django` template overrides work (because it uses `django` under the hood). Path to a directory containing custom html/css/images for overriding the default UI styling. Files found in the folder at the specified path can override any of the defaults in the [`TEMPLATES_DIR`](https://github.com/ArchiveBox/ArchiveBox/tree/dev/archivebox/templates) directory (copy files from that default dir into your custom dir to get started making a custom theme).
If you've used `django` before, this works exactly the same way that `django` template overrides work (because it uses `django` under the hood).
*Related options:* *Related options:*
[`FOOTER_INFO`](#footer_info) [`FOOTER_INFO`](#footer_info)
---
#### `SNAPSHOTS_PER_PAGE`
**Possible Values:** [`40`]/`100`/...
Maximum number of Snapshots to show per page on Snapshot list pages. Lower this value on slower machines to make the UI faster.
*Related options:*
[`SEARCH_BACKEND_TIMEOUT`](#search_backend_timeout)
--- ---
#### `FOOTER_INFO` #### `FOOTER_INFO`
**Possible Values:** [`Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.`]/`Operated by ACME Co.`/... **Possible Values:** [`Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.`]/`Operated by ACME Co.`/...
@@ -192,7 +194,16 @@ Extract article text, summary, and byline using Mozilla's [Readability](https://
Unlike the other methods, this does not download any additional files, so it's practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into readability. Unlike the other methods, this does not download any additional files, so it's practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into readability.
*Related options:* *Related options:*
[`TIMEOUT`](#timeout), [`SAVE_WGET`](#save_wget), [`SAVE_DOM`](#save_dom), [`SAVE_SINGLEFILE`](#save_singlefile) [`TIMEOUT`](#timeout), [`SAVE_WGET`](#save_wget), [`SAVE_DOM`](#save_dom), [`SAVE_SINGLEFILE`](#save_singlefile), [`SAVE_MERCURY`](#save_mercury)
---
#### `SAVE_MERCURY`
**Possible Values:** [`True`]/`False`
Extract article text, summary, and byline using the [Mercury](https://github.com/postlight/mercury-parser) library.
Unlike the other methods, this does not download any additional files, so it's practically free from a disk usage perspective. It works by using any existing downloaded HTML version (e.g. wget, DOM dump, singlefile) and piping it into readability.
*Related options:*
[`TIMEOUT`](#timeout), [`SAVE_WGET`](#save_wget), [`SAVE_DOM`](#save_dom), [`SAVE_SINGLEFILE`](#save_singlefile), [`SAVE_READABILITY`](#save_readability)
--- ---
@@ -248,7 +259,7 @@ Screenshot resolution in pixels width,height.
--- ---
#### `CURL_USER_AGENT` #### `CURL_USER_AGENT`
**Possible Values:** [`Curl/1.19.1`]/`"Mozilla/5.0 ..."`/... **Possible Values:** [`Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/) curl/{CURL_VERSION}`]/`"Mozilla/5.0 ..."`/...
This is the user agent to use during curl archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you're getting blocked by servers for having an unknown/blacklisted user agent. This is the user agent to use during curl archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you're getting blocked by servers for having an unknown/blacklisted user agent.
*Related options:* *Related options:*
@@ -256,7 +267,7 @@ This is the user agent to use during curl archiving. You can set this to impers
--- ---
#### `WGET_USER_AGENT` #### `WGET_USER_AGENT`
**Possible Values:** [`Wget/1.19.1`]/`"Mozilla/5.0 ..."`/... **Possible Values:** [`Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/) wget/{WGET_VERSION}`]/`"Mozilla/5.0 ..."`/...
This is the user agent to use during wget archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you're getting blocked by servers for having an unknown/blacklisted user agent. This is the user agent to use during wget archiving. You can set this to impersonate a more common browser like Chrome or Firefox if you're getting blocked by servers for having an unknown/blacklisted user agent.
*Related options:* *Related options:*
@@ -264,7 +275,7 @@ This is the user agent to use during wget archiving. You can set this to impers
--- ---
#### `CHROME_USER_AGENT` #### `CHROME_USER_AGENT`
**Possible Values:** [`"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/73.0.3683.75 Safari/537.36"`]/`"Mozilla/5.0 ..."`/... **Possible Values:** [`Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)`]/`"Mozilla/5.0 ..."`/...
This is the user agent to use during Chrome headless archiving. If you're experiencing being blocked by many sites, you can set this to hide the `Headless` string that reveals to servers that you're using a headless browser. This is the user agent to use during Chrome headless archiving. If you're experiencing being blocked by many sites, you can set this to hide the `Headless` string that reveals to servers that you're using a headless browser.
@@ -408,24 +419,47 @@ Path or name of the curl binary to use.
--- ---
#### `SINGLEFILE_BINARY` #### `SINGLEFILE_BINARY`
**Possible Values:** [`single-file`]/`/usr/local/bin/single-file`/... **Possible Values:** [`single-file`]/`./node_modules/single-file/cli/single-file`/...
Path or name of the SingleFile binary to use. Path or name of the SingleFile binary to use.
This can be installed using `npm install -g git+https://github.com/gildas-lormeau/SingleFile.git`. This can be installed using `npm install --no-audit --no-fund 'git+https://github.com/gildas-lormeau/SingleFile.git'`.
*Related options:* *Related options:*
[`SAVE_SINGLEFILE`](#save_singlefile), [`CHROME_BINARY`](#chrome_binary), [`CHROME_USER_DATA_DIR`](#chrome_user_data_dir), [`CHROME_HEADLESS`](#chrome_headless), [`CHROME_SANDBOX`](#chrome_sandbox) [`SAVE_SINGLEFILE`](#save_singlefile), [`CHROME_BINARY`](#chrome_binary), [`CHROME_USER_DATA_DIR`](#chrome_user_data_dir), [`CHROME_HEADLESS`](#chrome_headless), [`CHROME_SANDBOX`](#chrome_sandbox)
--- ---
#### `READABILITY_BINARY` #### `READABILITY_BINARY`
**Possible Values:** [`readability-extractor`]/`/usr/local/bin/readability-extractor`/... **Possible Values:** [`readability-extractor`]/`./node_modules/readability-extractor/readability-extractor`/...
Path or name of the Readability extrator binary to use. Path or name of the Readability extrator binary to use.
This can be installed using `npm install -g git+https://github.com/pirate/readability-extractor.git`. This can be installed using `npm install --no-audit --no-fund 'git+https://github.com/ArchiveBox/readability-extractor.git'`.
*Related options:* *Related options:*
[`SAVE_READABILITY`](#save_readability) [`SAVE_READABILITY`](#save_readability)
---
#### `MERCURY_BINARY`
**Possible Values:** [`mercury-parser`]/`./node_modules/@postlight/mercury-parser/cli.js`/...
Path or name of the Mercury parser extractor binary to use.
This can be installed using `npm install --no-audit --no-fund '@postlight/mercury-parser'`.
*Related options:*
[`SAVE_MERCURY`](#save_mercury)
---
#### `RIPGREP_BINARY`
**Possible Values:** [`rg`]/`rga`/...
Path or name of the ripgrep binary to use for full text search.
This can be installed using your system package manager, e.g. `apt install ripgrep` or `brew install ripgrep`.
Optionally switch this to use `ripgrep-all` for full-text search support across more filetypes (e.g. PDF): https://github.com/phiresky/ripgrep-all.
*Related options:*
[`SEARCH_BACKEND_ENGINE`](#search_backend_engine)
<img src="https://i.imgur.com/almAbwK.png" width="100%"/> <img src="https://i.imgur.com/almAbwK.png" width="100%"/>
[]: []: