
@@ -147,7 +153,7 @@ The `OUTPUT_DIR` folder (usually whatever folder you run `archivebox` in), conta
- index.html
# Archive method outputs:
- - warc/
+ - warc/
- media/
- git/
...
@@ -164,7 +170,7 @@ The `OUTPUT_DIR` folder (usually whatever folder you run `archivebox` in), conta
### Large Archives
I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
-Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
+Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
Storage requirements go up immensely if you're using `FETCH_MEDIA=True` and are archiving many pages with audio & video.
@@ -174,9 +180,25 @@ archivebox add < urls_chunk_1.txt &
archivebox add < urls_chunk_2.txt &
archivebox add < urls_chunk_3.txt &
```
+(though this may not be faster if you have a very large collection/main index)
+
Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
If you already imported a huge list of bookmarks and want to import only new
bookmarks, you can use the `ONLY_NEW` environment variable. This is useful if
you want to import a bookmark dump periodically and want to skip broken links
which are already in the index.
+
+## Python API Usage
+
+```python
+from archivebox.main import add, info, remove, check_data_folder
+
+out_dir = '~/path/to/my/data/folder'
+check_data_folder(out_dir=out_dir)
+add('https://example.com', index_only=True, out_dir=out_dir)
+info(out_dir=out_dir)
+remove('https://example.com', delete=True, yes=True, out_dir=out_dir)
+```
+
+For more information see the Python API Reference.
diff --git a/Web-Archiving-Community.md b/Web-Archiving-Community.md
index ad2cdd9..5434c62 100644
--- a/Web-Archiving-Community.md
+++ b/Web-Archiving-Community.md
@@ -1,3 +1,5 @@
+# Web Archiving Community
+
@@ -12,7 +14,6 @@ The internet archiving community is surprisingly far-reaching and almost univers
Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open source tool for your web archiving need, or just want to see where archivists hang out online, this is my attempt at an index of the entire web archiving community.
-## Contents

@@ -394,4 +395,4 @@ You can find more organizations and initiatives on these other lists:
[](https://archive.org/donate/)
^ Back to Top ^
-
\ No newline at end of file
+
diff --git a/archivebox.cli.rst b/archivebox.cli.rst
new file mode 100644
index 0000000..7c6a357
--- /dev/null
+++ b/archivebox.cli.rst
@@ -0,0 +1,142 @@
+archivebox.cli package
+======================
+
+Submodules
+----------
+
+archivebox.cli.archivebox module
+--------------------------------
+
+.. automodule:: archivebox.cli.archivebox
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_add module
+-------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_add
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_config module
+----------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_config
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_help module
+--------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_help
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_info module
+--------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_info
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_init module
+--------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_init
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_list module
+--------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_list
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_manage module
+----------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_manage
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_remove module
+----------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_remove
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_schedule module
+------------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_schedule
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_server module
+----------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_server
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_shell module
+---------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_shell
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_update module
+----------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_update
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.archivebox\_version module
+-----------------------------------------
+
+.. automodule:: archivebox.cli.archivebox_version
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.logging module
+-----------------------------
+
+.. automodule:: archivebox.cli.logging
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.cli.tests module
+---------------------------
+
+.. automodule:: archivebox.cli.tests
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox.cli
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/archivebox.config.rst b/archivebox.config.rst
new file mode 100644
index 0000000..b71af50
--- /dev/null
+++ b/archivebox.config.rst
@@ -0,0 +1,22 @@
+archivebox.config package
+=========================
+
+Submodules
+----------
+
+archivebox.config.stubs module
+------------------------------
+
+.. automodule:: archivebox.config.stubs
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox.config
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/archivebox.core.migrations.rst b/archivebox.core.migrations.rst
new file mode 100644
index 0000000..72c2291
--- /dev/null
+++ b/archivebox.core.migrations.rst
@@ -0,0 +1,30 @@
+archivebox.core.migrations package
+==================================
+
+Submodules
+----------
+
+archivebox.core.migrations.0001\_initial module
+-----------------------------------------------
+
+.. automodule:: archivebox.core.migrations.0001_initial
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.migrations.0002\_auto\_20190417\_0739 module
+------------------------------------------------------------
+
+.. automodule:: archivebox.core.migrations.0002_auto_20190417_0739
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox.core.migrations
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/archivebox.core.rst b/archivebox.core.rst
new file mode 100644
index 0000000..8b4682c
--- /dev/null
+++ b/archivebox.core.rst
@@ -0,0 +1,93 @@
+archivebox.core package
+=======================
+
+Subpackages
+-----------
+
+.. toctree::
+
+ archivebox.core.migrations
+
+Submodules
+----------
+
+archivebox.core.admin module
+----------------------------
+
+.. automodule:: archivebox.core.admin
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.apps module
+---------------------------
+
+.. automodule:: archivebox.core.apps
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.models module
+-----------------------------
+
+.. automodule:: archivebox.core.models
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.settings module
+-------------------------------
+
+.. automodule:: archivebox.core.settings
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.tests module
+----------------------------
+
+.. automodule:: archivebox.core.tests
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.urls module
+---------------------------
+
+.. automodule:: archivebox.core.urls
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.views module
+----------------------------
+
+.. automodule:: archivebox.core.views
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.welcome\_message module
+---------------------------------------
+
+.. automodule:: archivebox.core.welcome_message
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.core.wsgi module
+---------------------------
+
+.. automodule:: archivebox.core.wsgi
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox.core
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/archivebox.extractors.rst b/archivebox.extractors.rst
new file mode 100644
index 0000000..a8ba6a3
--- /dev/null
+++ b/archivebox.extractors.rst
@@ -0,0 +1,86 @@
+archivebox.extractors package
+=============================
+
+Submodules
+----------
+
+archivebox.extractors.archive\_org module
+-----------------------------------------
+
+.. automodule:: archivebox.extractors.archive_org
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.dom module
+--------------------------------
+
+.. automodule:: archivebox.extractors.dom
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.favicon module
+------------------------------------
+
+.. automodule:: archivebox.extractors.favicon
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.git module
+--------------------------------
+
+.. automodule:: archivebox.extractors.git
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.media module
+----------------------------------
+
+.. automodule:: archivebox.extractors.media
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.pdf module
+--------------------------------
+
+.. automodule:: archivebox.extractors.pdf
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.screenshot module
+---------------------------------------
+
+.. automodule:: archivebox.extractors.screenshot
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.title module
+----------------------------------
+
+.. automodule:: archivebox.extractors.title
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.extractors.wget module
+---------------------------------
+
+.. automodule:: archivebox.extractors.wget
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox.extractors
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/archivebox.index.rst b/archivebox.index.rst
new file mode 100644
index 0000000..49ab62c
--- /dev/null
+++ b/archivebox.index.rst
@@ -0,0 +1,54 @@
+archivebox.index package
+========================
+
+Submodules
+----------
+
+archivebox.index.csv module
+---------------------------
+
+.. automodule:: archivebox.index.csv
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.index.html module
+----------------------------
+
+.. automodule:: archivebox.index.html
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.index.json module
+----------------------------
+
+.. automodule:: archivebox.index.json
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.index.schema module
+------------------------------
+
+.. automodule:: archivebox.index.schema
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.index.sql module
+---------------------------
+
+.. automodule:: archivebox.index.sql
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox.index
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/archivebox.parsers.rst b/archivebox.parsers.rst
new file mode 100644
index 0000000..d3b902c
--- /dev/null
+++ b/archivebox.parsers.rst
@@ -0,0 +1,78 @@
+archivebox.parsers package
+==========================
+
+Submodules
+----------
+
+archivebox.parsers.generic\_json module
+---------------------------------------
+
+.. automodule:: archivebox.parsers.generic_json
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.parsers.generic\_rss module
+--------------------------------------
+
+.. automodule:: archivebox.parsers.generic_rss
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.parsers.generic\_txt module
+--------------------------------------
+
+.. automodule:: archivebox.parsers.generic_txt
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.parsers.medium\_rss module
+-------------------------------------
+
+.. automodule:: archivebox.parsers.medium_rss
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.parsers.netscape\_html module
+----------------------------------------
+
+.. automodule:: archivebox.parsers.netscape_html
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.parsers.pinboard\_rss module
+---------------------------------------
+
+.. automodule:: archivebox.parsers.pinboard_rss
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.parsers.pocket\_html module
+--------------------------------------
+
+.. automodule:: archivebox.parsers.pocket_html
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.parsers.shaarli\_rss module
+--------------------------------------
+
+.. automodule:: archivebox.parsers.shaarli_rss
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox.parsers
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/archivebox.rst b/archivebox.rst
new file mode 100644
index 0000000..b96e694
--- /dev/null
+++ b/archivebox.rst
@@ -0,0 +1,58 @@
+archivebox package
+==================
+
+Subpackages
+-----------
+
+.. toctree::
+
+ archivebox.cli
+ archivebox.config
+ archivebox.core
+ archivebox.extractors
+ archivebox.index
+ archivebox.parsers
+
+Submodules
+----------
+
+archivebox.main module
+----------------------
+
+.. automodule:: archivebox.main
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.manage module
+------------------------
+
+.. automodule:: archivebox.manage
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.system module
+------------------------
+
+.. automodule:: archivebox.system
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+archivebox.util module
+----------------------
+
+.. automodule:: archivebox.util
+ :members:
+ :undoc-members:
+ :show-inheritance:
+
+
+Module contents
+---------------
+
+.. automodule:: archivebox
+ :members:
+ :undoc-members:
+ :show-inheritance:
diff --git a/conf.py b/conf.py
new file mode 100644
index 0000000..d4daedd
--- /dev/null
+++ b/conf.py
@@ -0,0 +1,134 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# http://www.sphinx-doc.org/en/master/config
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+import os
+import sys
+
+import django
+import recommonmark
+from recommonmark.transform import AutoStructify
+
+os.environ['USE_CHROME'] = 'False'
+
+PYTHON_DIR = os.path.abspath(os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'archivebox'))
+
+sys.path.insert(0, os.path.abspath('.'))
+sys.path.insert(0, os.path.abspath('../'))
+sys.path.insert(0, PYTHON_DIR)
+os.environ.setdefault("DJANGO_SETTINGS_MODULE", "core.settings")
+django.setup()
+
+VERSION = open(os.path.join(PYTHON_DIR, 'VERSION'), 'r').read().strip()
+
+# -- Project information -----------------------------------------------------
+
+project = 'ArchiveBox'
+copyright = '2020, Nick Sweeting'
+author = 'Nick Sweeting'
+github_url = 'https://github.com/pirate/ArchiveBox'
+github_doc_root = 'https://github.com/pirate/ArchiveBox/tree/master/docs/'
+language = 'en'
+
+# The full version, including alpha/beta/rc tags
+release = VERSION
+
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+ 'sphinx.ext.autodoc',
+ 'sphinx.ext.napoleon',
+ 'sphinx.ext.viewcode',
+ 'recommonmark',
+]
+
+source_suffix = {
+ '.rst': 'restructuredtext',
+ '.txt': 'markdown',
+ '.md': 'markdown',
+}
+master_doc = 'index'
+napoleon_google_docstring = True
+napoleon_use_param = True
+napoleon_use_ivar = False
+napoleon_use_rtype = True
+napoleon_include_special_with_doc = False
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = [
+ '_build',
+ 'Thumbs.db',
+ '.DS_Store',
+ 'data',
+ 'output',
+ 'templates',
+ 'tests',
+ 'migrations',
+]
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages. See the documentation for
+# a list of builtin themes.
+#
+html_logo = 'logo.png'
+html_theme = 'sphinx_rtd_theme'
+html_theme_options = {
+ 'navigation_depth': 5,
+ 'collapse_navigation': False,
+ 'sticky_navigation': True,
+}
+html_show_sphinx = False
+
+texinfo_documents = [
+ (master_doc, 'archivebox', 'archivebox Documentation',
+ author, 'archivebox', 'The open-source self-hosted internet archive.',
+ 'Miscellaneous'),
+]
+
+autodoc_default_flags = ['members']
+autodoc_member_order = 'bysource'
+extensions += ['sphinx.ext.autosummary',]
+autosummary_gerenerate = True
+
+pygments_style = 'sphinx'
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+
+man_pages = [
+ (master_doc, 'archivebox', 'archivebox Documentation',
+ [author], 1)
+]
+
+
+
+
+# At the bottom of conf.py
+def setup(app):
+ app.add_config_value('recommonmark_config', {
+ # 'url_resolver': lambda url: github_doc_root + url,
+ 'auto_toc_tree_section': 'Documentation',
+ }, True)
+ app.add_transform(AutoStructify)
diff --git a/index.rst b/index.rst
new file mode 100644
index 0000000..86d821a
--- /dev/null
+++ b/index.rst
@@ -0,0 +1,40 @@
+.. sidebar:: Welcome to ArchiveBox!
+
+ Just getting started?
+ Check out the `Quickstart