mirror of
https://github.com/pirate/ArchiveBox.git
synced 2025-09-02 19:02:37 +02:00
Updated Home (markdown)
34
Home.md
34
Home.md
@@ -41,40 +41,6 @@
|
|||||||
The archiving is additive, so you can schedule `./archive` to run regularly and pull new links into the index.
|
The archiving is additive, so you can schedule `./archive` to run regularly and pull new links into the index.
|
||||||
All the saved content is static and indexed with JSON files, so it lives forever & is easily parseable, it requires no always-running backend.
|
All the saved content is static and indexed with JSON files, so it lives forever & is easily parseable, it requires no always-running backend.
|
||||||
|
|
||||||
[DEMO: archive.sweeting.me](https://archive.sweeting.me)
|
|
||||||
|
|
||||||
<img src="https://i.imgur.com/q3Oz9wN.png" width="75%" alt="Desktop Screenshot" align="top"><img src="https://i.imgur.com/TG0fGVo.png" width="25%" alt="Mobile Screenshot" align="top"><br/>
|
|
||||||
|
|
||||||
## Details
|
|
||||||
|
|
||||||
`ArchiveBox/archive` is the script that takes a [Pocket-format](https://getpocket.com/export), [JSON-format](https://pinboard.in/export/), [Netscape-format](https://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx), RSS, or plan-text-formatted list of links, and downloads a clone of each linked website to turn into a browsable archive that you can store locally or host online.
|
|
||||||
|
|
||||||
The archiver produces an output folder `output/` containing an `index.html`, `index.json`, and archived copies of all the sites,
|
|
||||||
organized by timestamp bookmarked. It's Powered by [headless](https://developers.google.com/web/updates/2017/04/headless-chrome) Chromium and good 'ol `wget`.
|
|
||||||
|
|
||||||
Wget doesn't work on sites you need to be logged into, but chrome headless does, see the [Configuration](#configuration)* section for `CHROME_USER_DATA_DIR`.
|
|
||||||
|
|
||||||
### Large Archives
|
|
||||||
|
|
||||||
I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
|
|
||||||
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
|
|
||||||
|
|
||||||
Storage requirements go up immensely if you're using `FETCH_MEDIA=True` and are archiving many pages with audio & video.
|
|
||||||
|
|
||||||
You can run it in parallel by using the `resume` feature, or by manually splitting export.html into multiple files:
|
|
||||||
```bash
|
|
||||||
./archive export.html 1498800000 & # second argument is timestamp to resume downloading from
|
|
||||||
./archive export.html 1498810000 &
|
|
||||||
./archive export.html 1498820000 &
|
|
||||||
./archive export.html 1498830000 &
|
|
||||||
```
|
|
||||||
Users have reported running it with 50k+ bookmarks with success (though it will take more RAM while running).
|
|
||||||
|
|
||||||
If you already imported a huge list of bookmarks and want to import only new
|
|
||||||
bookmarks, you can use the `ONLY_NEW` environment variable. This is useful if
|
|
||||||
you want to import a bookmark dump periodically and want to skip broken links
|
|
||||||
which are already in the index.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
|
Reference in New Issue
Block a user