From 4f7de4b7d0864a9e1a17c50383e1fc0469cd2c50 Mon Sep 17 00:00:00 2001 From: Nick Sweeting Date: Tue, 2 Apr 2019 17:39:11 -0400 Subject: [PATCH] Updated Configuration (markdown) --- Configuration.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/Configuration.md b/Configuration.md index 04c8780..0378dc9 100644 --- a/Configuration.md +++ b/Configuration.md @@ -101,10 +101,19 @@ Some text to display in the footer of the archive index. Useful for providing s --- #### `URL_BLACKLIST` -**Possible Values:** [`None`]/`.*\.exe$`/`(youtube\.com)|(amazon\.com)'`/... +**Possible Values:** [`None`]/`.+\.exe$`/`http(s)?:\/\/(.+)?(ebay\.com)|(amazon\.com)\/.*'`/... A regex expression used to exclude certain URLs from the archive. You can use if there are certain domains, extensions, or other URL patterns that you want to ignore whenever they get imported. Blacklisted URLs wont be included in the index, and their page content wont be archived. +When building your blacklist, you can check whether a given URL matches your regex expression like so: +```python +>>>import re +>>>URL_BLACKLIST = r'http(s)?:\/\/(.+)?(youtube\.com)|(amazon\.com)\/.*' # replace this with your regex to test +>>>test_url = 'https://test.youtube.com/example.php?abc=123' +>>>bool(re.compile(URL_BLACKLIST, re.IGNORECASE).match(test_url)) +True +``` + *Related options:* [`FETCH_MEDIA`](#FETCH_MEDIA), [`FETCH_GIT`](#FETCH_GIT), [`GIT_DOMAINS`](#GIT_DOMAINS)