1
0
mirror of https://github.com/pirate/ArchiveBox.git synced 2025-08-16 19:44:08 +02:00

Updated Configuration (markdown)

Nick Sweeting
2021-07-07 07:00:17 -04:00
parent 90edcd4227
commit e1c059c595

@@ -124,9 +124,9 @@ True # this URL would not be archived because it matches the exclusion pattern
#### `URL_WHITELIST`
**Possible Values:** [`None`]/`^http(s)?:\/\/(.+)?example\.com\/?.*$`/...
A regex expression used to exclude all URLs that don't match the given pattern from archiving. You can use if there are certain domains, extensions, or other URL patterns that you want to ignore whenever they get imported. Blacklisted URLs wont be included in the index, and their page content wont be archived.
A regex expression used to exclude all URLs that don't match the given pattern from archiving. You can use if there are certain domains, extensions, or other URL patterns that you want to restrict the scope of archiving to (e.g. to only archive a single domain, subdirectory, or filetype, etc..
When building your blacklist, you can check whether a given URL matches your regex expression in `python` like so:
When building your whitelist, you can check whether a given URL matches your regex expression in `python` like so:
```python
>>> import re
>>> URL_WHITELIST = r'^http(s)?:\/\/(.+)?example\.com\/?.*$' # replace this with your regex to test