1
0
mirror of https://github.com/pirate/ArchiveBox.git synced 2025-08-18 04:11:57 +02:00

Updated Configuration (markdown)

Nick Sweeting
2021-07-06 23:55:17 -04:00
parent 066f722da1
commit ae6529bc2f

@@ -128,7 +128,7 @@ A regex expression used to exclude all URLs that don't match the given pattern f
When building your blacklist, you can check whether a given URL matches your regex expression in `python` like so:
```python
>>> import re
>>> URL_WHITELIST = r'^http(s)?:\/\/(.+)?example\.org\/?.*$' # replace this with your regex to test
>>> URL_WHITELIST = r'^http(s)?:\/\/(.+)?example\.com\/?.*$' # replace this with your regex to test
>>> test_url = 'https://test.example.com/example.php?abc=123'
>>> bool(re.compile(URL_BLACKLIST, re.IGNORECASE | re.UNICODE | re.MULTILINE).search(test_url))
True # this URL would be archived
@@ -138,7 +138,7 @@ True # this URL would be archived
False # this URL would be excluded from archiving
```
This option is useful for recursively archiving all the pages on a given domain (aka crawling/spidering), without following links to external domains.
This option is useful for **recursive archiving** of all the pages under a given domain or subfolder (aka crawling/spidering), without following links to external domains / parent folders.
```bash
# temporarily enforce a whitelist by setting the option as an environment variable
export URL_WHITELIST='^http(s)?:\/\/(.+)?example\.com\/?.*$'