mirror of
https://github.com/pirate/ArchiveBox.git
synced 2025-08-18 12:21:42 +02:00
Updated Configuration (markdown)
@@ -128,7 +128,7 @@ A regex expression used to exclude all URLs that don't match the given pattern f
|
|||||||
When building your blacklist, you can check whether a given URL matches your regex expression in `python` like so:
|
When building your blacklist, you can check whether a given URL matches your regex expression in `python` like so:
|
||||||
```python
|
```python
|
||||||
>>> import re
|
>>> import re
|
||||||
>>> URL_WHITELIST = r'^http(s)?:\/\/(.+)?example\.org\/?.*$' # replace this with your regex to test
|
>>> URL_WHITELIST = r'^http(s)?:\/\/(.+)?example\.com\/?.*$' # replace this with your regex to test
|
||||||
>>> test_url = 'https://test.example.com/example.php?abc=123'
|
>>> test_url = 'https://test.example.com/example.php?abc=123'
|
||||||
>>> bool(re.compile(URL_BLACKLIST, re.IGNORECASE | re.UNICODE | re.MULTILINE).search(test_url))
|
>>> bool(re.compile(URL_BLACKLIST, re.IGNORECASE | re.UNICODE | re.MULTILINE).search(test_url))
|
||||||
True # this URL would be archived
|
True # this URL would be archived
|
||||||
@@ -138,7 +138,7 @@ True # this URL would be archived
|
|||||||
False # this URL would be excluded from archiving
|
False # this URL would be excluded from archiving
|
||||||
```
|
```
|
||||||
|
|
||||||
This option is useful for recursively archiving all the pages on a given domain (aka crawling/spidering), without following links to external domains.
|
This option is useful for **recursive archiving** of all the pages under a given domain or subfolder (aka crawling/spidering), without following links to external domains / parent folders.
|
||||||
```bash
|
```bash
|
||||||
# temporarily enforce a whitelist by setting the option as an environment variable
|
# temporarily enforce a whitelist by setting the option as an environment variable
|
||||||
export URL_WHITELIST='^http(s)?:\/\/(.+)?example\.com\/?.*$'
|
export URL_WHITELIST='^http(s)?:\/\/(.+)?example\.com\/?.*$'
|
||||||
|
Reference in New Issue
Block a user