1
0
mirror of https://github.com/pirate/ArchiveBox.git synced 2025-08-23 22:53:00 +02:00

Updated Setting up Search (markdown)

Nick Sweeting
2024-05-07 01:03:25 -07:00
parent d9b180af52
commit 78dbe1f0ef

@@ -5,12 +5,13 @@ You can search your ArchiveBox data in a number of ways:
- using the CLI: `archivebox list --filter-type=search 'text to search'` - using the CLI: `archivebox list --filter-type=search 'text to search'`
- using the Web UI: both the `/public` index and `/admin/core/snapshot` pages provide a search box - using the Web UI: both the `/public` index and `/admin/core/snapshot` pages provide a search box
- using the REST API: `/api/v1/list?filter_type=search` provides the same search interface as the CLI - using the REST API: `/api/v1/list?filter_type=search` provides the same search interface as the CLI
- by searching the archive data on the filesystem with external tools (e.g. macOS Spotlight, [Cerebro](https://www.cerebroapp.com/), `ag`, `grep -r`, `SQLite FTS5`, etc.) - by searching the archive data folder directly with external tools (e.g. macOS Spotlight, [Cerebro](https://www.cerebroapp.com/), `ag`, `grep -r`, `SQLite FTS5`, etc.)
<br/>
> [!IMPORTANT] > *Note: ArchiveBox currently only returns the bare list of snapshots that match when performing a search.*
> *ArchiveBox currently only returns a plain list of snapshots that match when performing a search.* >
> This will be improved in the future to highlight the specific paragraph/line/area that matched within a Snapshot. > This will be [improved in the future](https://zulip.archivebox.io/#narrow/stream/154-support/topic/Full.20Text.20Search.20works.2E.2E.2E.20but.20is.20there.20a.20UI.3F) to highlight the *specific paragraph/line/area that matched* within a Snapshot.
> For now we recommend using Ctl+F in the browser or one of the external tools listed above to further filter for a term within a Snapshot's contents. > For now we recommend using Ctl+F in the browser or one of the external tools listed above to further filter for a term within a Snapshot's contents.
<br/> <br/>
@@ -22,26 +23,31 @@ You can search your ArchiveBox data in a number of ways:
```bash ```bash
# this setting controls which search backend ArchiveBox uses # this setting controls which search backend ArchiveBox uses
archivebox config --set SEARCH_BACKEND_ENGINE=[ripgrep]|sonic|sqlite archivebox config --set SEARCH_BACKEND_ENGINE=[ripgrep]|sonic|sqlite
# to information about the backend you are currently using, run:
archivebox version
archivebox config --get SEARCH_BACKEND_ENGINE
``` ```
ArchiveBox provides search functionality out-of-the-box using a simple but efficient disk-search tool called [`ripgrep`](https://github.com/BurntSushi/ripgrep). ArchiveBox provides search functionality out-of-the-box using a simple but efficient tool called [`ripgrep`](https://github.com/BurntSushi/ripgrep).
Ripgrep is the fastest currently available search tool that works without maintaining an separate index. However, there are some fundamental limitations of scanning through every file on disk each time a search is done, so ArchiveBox provides a number of additional search backend options that users can choose from when they outgrow the `ripgrep` default. Ripgrep is the fastest currently available filesystem search tool that scans over the raw data directly. We chose it as the default so that beginners and 95% users with small collections can have an experience that "just works" without needing to install and maintain complex additional dependencies or background workers.
> You should consider switching ArchiveBox to use one of its more powerful search backends if: However, there are some fundamental limitations of scanning through every file on disk each time a search is done, so ArchiveBox provides a number of additional search backend options that users can choose from when they outgrow the `ripgrep` default.
> [!TIP]
> **You should consider switching ArchiveBox to use `sonic` or another backend IF:**
> >
> - you have more than 1000 Snapshots in your archive > - you have more than 1,000 Snapshots saved in your archive
> - you're using a slow filesystem like a spinning hard drive or remote network mount > - your archive data is stored on a slower filesystem like a spinning hard drive or remote network mount
> - you want fuzzy-search features like stemming, boolean operators, searching binary files like PDFs, etc. > - you want more advanced search features like stemming, boolean operators, and ability to search PDFs, eBooks, ZIP/tar files, etc.
<br/> <br/>
### `ripgrep` (aka `rg`, the default) ### `ripgrep` *(the default)*
> *Note: You must have `ripgrep` installed on your system to use this backend (it's available automatically if you use ArchiveBox in Docker)*
If you do not already have `ripgrep` installed, follow the [instructions here](https://github.com/BurntSushi/ripgrep#installation) to get it. If you do not already have `ripgrep` installed, follow the [instructions here](https://github.com/BurntSushi/ripgrep#installation) to get it.
You can then configure ArchiveBox to use it like so: ArchiveBox will use `ripgrep` by default if it is found, however you can explicitly configure it to be used like so:
```bash ```bash
archivebox config --set SEARCH_BACKEND_ENGINE=ripgrep archivebox config --set SEARCH_BACKEND_ENGINE=ripgrep
@@ -57,12 +63,12 @@ archivebox list --filter-type=search 'text to search for'
#### Pros #### Pros
- supports advanced searching with regex patterns - supports advanced searching with regex patterns
- simple, few moving parts, and broadly available for all OSs and CPU architectures - simple, few moving parts, and broadly available for all OSs and CPU architectures
- lower idle resource use as there is no background worker using up resources - 0 idle resource use as there is no background indexer process running
- lower disk storage use as there is no separate search index containing copies of all the text - 0 additional disk storage needed as it searches the original data instead of maintaining a separate index
- reasonably fast on NVMe and SSD drives for small collections - reasonably fast on NVMe and SSD drives for small collections
#### Cons #### Cons
- very slow as archive collection size increases - very slow as archive collection size increases (doesn't scale well beyond 500~1,000 Snapshots)
- very slow if underlying filesytem is slow (e.g. HDDs or network mounts) - very slow if underlying filesytem is slow (e.g. HDDs or network mounts)
- doesn't support stemming, boolean operators, or other advanced full-text search features - doesn't support stemming, boolean operators, or other advanced full-text search features
@@ -84,28 +90,6 @@ archivebox version
# then try it out by searching via the Web UI or CLI: # then try it out by searching via the Web UI or CLI:
archivebox list --filter-type=search 'text to search for' archivebox list --filter-type=search 'text to search for'
``` ```
<br/>
### `ripgrep-all` (aka `rga`)
The same as ripgrep except that it supports searching more binary filetypes like PDFs, eBooks, Office documents, zip, tar.gz, etc.
To use it, follow the [install instruction for your OS](https://github.com/phiresky/ripgrep-all#installation), then configure ArchiveBox to use it like so:
```bash
archivebox config --set SEARCH_BACKEND_ENGINE=ripgrep
archivebox config --set RIPGREP_BINARY=rga
# check that archivebox detects the installed version:
archivebox version
# then try it out by searching via the Web UI or CLI:
archivebox list --filter-type=search 'text to search for'
```
#### Pros & Cons
Same as `ripgrep` with the addition of some extra supported filetypes, however `rga` is slightly less easy to install than `rg`.
<br/> <br/>
@@ -133,9 +117,9 @@ archivebox config --set RIPGREP_BINARY=ugrep+
- not as fast as `sonic` and but also not as simple as `ripgrep` - not as fast as `sonic` and but also not as simple as `ripgrep`
- not all of its features are fully integrated with ArchiveBox yet - not all of its features are fully integrated with ArchiveBox yet
<br/> <br/><br/>
### `sonic` ⭐️ (the recommended upgrade option for most people) ### `sonic` ⭐️ (the recommended upgrade path for most people)
Sonic is a fast, lightweight, rust-based alternative to super-heavy traditional search backends like Elasticsearch. It is capable of normalizing natural language search queries, fuzzy matching, and searching Unicode, without needing to maintain a duplicate document store index of all the searchable text. Instead it works as an index store, storing only the IDs of the Snapshots with a super-compressed internal index. This allows it to scale to searching terabytes of archive data while maintaining an index only a fraction of that size. Sonic is a fast, lightweight, rust-based alternative to super-heavy traditional search backends like Elasticsearch. It is capable of normalizing natural language search queries, fuzzy matching, and searching Unicode, without needing to maintain a duplicate document store index of all the searchable text. Instead it works as an index store, storing only the IDs of the Snapshots with a super-compressed internal index. This allows it to scale to searching terabytes of archive data while maintaining an index only a fraction of that size.