flexsearch/0.7.0.md at 281fc9288b69d109b51a538ab3e40a275e87372a

mirror_js/flexsearch

Fork 0

mirror of https://github.com/nextapps-de/flexsearch.git synced 2025-09-30 07:18:57 +02:00

Files

Thomas Wilkerling 9b4654be77 update readme

2021-06-08 16:45:55 +02:00

52 KiB

Raw Blame History

FlexSearch v0.7.0

A long journey finally comes to its end. This document gives you some insights of the current state plus an overview of introduced new features and changes.

I ended up by building the whole library from scratch again, starting from a blank file. FlexSearch was one of my first open source libraries from which I've learned much about how to make a codebase ready for continuously adding new features. Putting features to the old codebase over and over again ends in a structural disaster. I also started an implementation in Rust where I finally recognized that the old codebase has to be thrown away completely.

My first step was addressing each issue and goal as "acceptance criteria". Most of them are about matching and scoring, but also keeping the performance was very important. The criteria for scoring are really hard, there is no library out which could pass those tests actually. The half of them are currently covered, that's a really great capability. Don't worry, the final release will not being delayed by the completion of those criteria.

Let's talk about the current state. The most difficult part (aka "the core development") has almost finished. I'm very satisfying with the result, which is an improvement in every single aspect. To be honest, it is still complex of course, probably more complex than the old generation was, so please don't expect too much about the simplicity of the code. My biggest focus was on code structure and process flow.

FlexSearch basically is now divided into two Classes:

Index
Document

The index is a pure (flat) implementation, greatly optimized to perform fast. The document usually exists of multiple indexes (for each field in document).

In a document every field/index can have its own configuration including a custom encoder for each field.

You can use an instance of Index directly (very much like the old FlexSearch when not indexing documents, instead just ID and text contents).

Load Library

ES6 Modules:

import Index from "./index.js";
import Document from "./document.js";
import WorkerIndex from "./worker/index.js";

const index = new Index(options);
const document = new Document(options);
const worker = new WorkerIndex(options);

Browser Bundle (ES5 Support):

Or load one of the builds from the folder dist to your html as a script and use as follows:

var index = new FlexSearch.Index(options);
var document = new FlexSearch.Document(options);
var worker = new FlexSearch.Worker(options);

Node.js:

Install the non-published beta version:

npm install https://github.com/nextapps-de/flexsearch/tarball/0.7.0

Use as a npm package:

const { Index, Document, Worker } = require("flexsearch");

Basic Usage

The method signature stays almost unchanged:

var index = new Index(options);
index.add(id, text);
index.search(text, limit);
index.search(text, options);
index.search(text, limit, options);
index.search(options);

var document = new Document(options);
document.add(doc);
document.add(id, doc);
document.search(text, limit);
document.search(text, options);
document.search(text, limit, options);
document.search(options);

Source Code

Source Code v0.7.0-pre-alpha available here:
https://github.com/nextapps-de/flexsearch/tree/0.7.0/src

What is not included yet but comes soon?

~~WebWorker~~
~~Worker for Node.js~~
~~Inline Worker (optionally)~~
~~Offset-Pagination~~
~~Export/Import~~
~~Tags~~
~~Bundles: Light, Compact, Full~~
Test Automation (needs to be migrated)
Benchmark Suite (almost done)

What is not included yet and comes later?

Engines (Custom Indexes)

What will be dropped?

Where-Clause
Index Information index.info()
Paging Cursor (was replaced by offset)

Builtin Profiles

memory (primary optimize for memory)
performance (primary optimize for performance)
match (primary optimize for matching)
score (primary optimize for scoring)
default (the default balanced profile)

These profiles are covering standard use cases. It is recommended to apply custom configuration instead of using profiles to get the best out for your situation. Every profile could be optimized further to its specific task, e.g. extreme performance optimized configuration or extreme memory and so on.

Improvements

Bidirectional Context (the order of words can now vary, does not increase memory when using bidirectional context)
New memory-friendly strategy for indexes (switchable, saves up to 50% of memory for each index, slightly decrease performance)
Better scoring calculation (one of the biggest concerns of the old implementation was that the order of arrays processed in the intersection has affected the order of relevance in the final result)
Fix resolution (the resolution in the old implementation was not fully stretched through the whole range in some cases)
Skip words (optionally, automatically skip words from the context chain which are too short)
Hugely improves performance of long queries (up to 450x faster!) and also memory allocation (up to 250x less memory)
New fast-update strategy (optionally, hugely improves performance of all updates and removals of indexed contents up to 2850x)
Improved auto-balanced cache (keep and expire cache by popularity)
Append contents to already existing entries (already indexed documents or contents)
New method "contain" to check if an ID was already indexed
Access documents directly from internal store (read/write)
Suggestions are hugely improved, falls back from context search all the way down to single term match
Document descriptor has now array support (optionally adds array entries via the new append under the hood to provide a unique relevance context for each entry)
Document storage handler gets improved
Results from document index now grouped by field (this is one of the few bigger breaking changes which needs migrations of your old code)
Boolean search has a new concept (use in combination of the new result structure)
Node.js Worker Threads
Improved default latin encoders
New parallelization model and workload distribution
Improved Export/Import
Tag Search
Offset pagination
Enhanced Field Search
Improved sorting by relevance (score)
Added Context Scoring (context index has its own resolution)
Enhanced charset normalization
Improved bundler (support for inline WebWorker)

The whole "context" strategy has greatly improved.

Migration Quick Overview

Define document fields as object keys is not longer supported due to the unification of all option payloads.

A full configuration example for a context-based index:

var index = new Index({
    tokenize: "strict",
    resolution: 9,
    minlength: 3,
    optimize: true,
    fastupdate: true,
    cache: 100,
    context: {
        depth: 1,
        resolution: 3,
        bidirectional: true
    }
});

The resolution could be set also for the contextual index.

A full configuration example for a document based index:

const index = new Document({
    tokenize: "forward",
    optimize: true,
    resolution: 9,
    cache: 100,
    worker: true,
    document: {
        id: "id",
        tag: "tag",
        store: [
            "title", "content"
        ],
        index: [{
            field: "title",
            tokenize: "forward",
            optimize: true,
            resolution: 9
        },{
            field:  "content",
            tokenize: "strict",
            optimize: true,
            resolution: 9,
            minlength: 3,
            context: {
                depth: 1,
                resolution: 3
            }
        }]
    }
});

A full configuration example for a document search:

index.search({
    enrich: true,
    bool: "and",
    tag: ["cat", "dog"],
    index: [{
        field: "title",
        query: "some query",
        limit: 100,
        suggest: true
    },{
        field: "content",
        query: "same or other query",
        limit: 100,
        suggest: true
    }]
});

Index Stack-flow

The process flow of an index could be switched between "memory-optimized" and "default".

The default flow performs slightly faster, because it provides two additional optimizations: 1. "fast fail" (stop early when term was not found) and 2. "fast success" (return early when the result has reached the limit).

Default flow:
{term} => [score] => [ids]

Memory-optimized flow:
[score] => {term} => [ids]

The contextual index has the same schema as the lexical index above, but goes one level deeper:

Default flow:
{keyword} => {term} => [score] => [ids]

Memory-optimized flow:
[score] => {keyword} => {term} => [ids]

Tokenizer

The tokenizers are also available again. They affect the capability of matching partials (parts of a term).

strict
forward
reverse
full

Changes

I decide to use less parameter variation to make the code cleaner and also to make type definitions more practicable.

Async

The "async" options was removed, instead you can call each method in its async version, e.g. index.addAsync or index.searchAsync.

The advantage is you can now use both variations on the same index, whereas the old version is just performing asynchronous for all methods when the option flag was set.

You can assign callbacks to each async function:

index.addAsync(id, content, function(){
    console.log("Task Done");
});

index.searchAsync(query, function(result){
    console.log("Results: ", result);
});

Or did not pass a callback function and getting back a Promise instead:

index.addAsync(id, content).then(function(){
    console.log("Task Done");
});

index.searchAsync(query).then(function(result){
    console.log("Results: ", result);
});

Or use async and await:

async function add(){
    await index.addAsync(id, content);
    console.log("Task Done");
}

async function search(){
    const results = await index.searchAsync(query);
    console.log("Results: ", result);
}

Auto-Balanced Cache (By Popularity)

The cache was improved and has a new strategy for balance/expire cache entries by popularity.

Also, to prevent inner calls to the same function recursively the cache gets a new method. Again, this make it possible to switch between both kind of queries (cached/uncached) on the same index.

You still need to initialize the cache and its limit during the creation of the index:

const index = new Index({ cache: 100 });

const results = index.searchCache(query);

A common scenario for using a cache is an autocomplete or instant search when typing.

Append Contents

You can now append contents to an existing index like:

index.append(id, content);

This will not overwrite the old indexed contents as it will do when perform index.update(id, content). Keep in mind that index.add(id, content) will also perform "update" under the hood when the id was already being indexed.

Appended contents will have their own context and also their own full resolution. Therefore, the relevance isn't being stacked but gets its own context.

Let us take this example:

index.add(0, "some index");
index.append(0, "some appended content");

index.add(1, "some text");
index.append(1, "index appended content");

When you query index.search("index") then you will get index id 1 as the first entry in the result, because the context starts from zero for the appended data (isn't stacked to the old context) and here "index" is the first term.

If you didn't want this behavior than just ust index.add(id, content) and provide the full length of content.

Check existence of already indexed IDs

You can check if an ID was already indexed by:

if(index.contain(1)){
    console.log("ID is already in index");
}

Document Index

One of the key improvements is how documents will be indexed and processed. Such an index is getting its own class Document which contains instances of Index for each field under the hood. One advantage is you can query every Index of a document directly if you like. That comes close to the old "tag" feature, but runs significantly faster, e.g. when just query against one field. This way using a document can logically divide and distribute your contents through multiple indexes and is performing faster when query against one field compared to a non-document approach where you put in all your data to one index. Of course, a query through more than one field can't beat the non-document index performance-wise.

Document Descriptor

Every document needs an ID. When your documents has no ID, then you need to create one by passing an index or count or something else as an ID (a value from type number is highly recommended). Those IDs are unique references to a given content. This is important when you update or adding over content through existing IDs. When referencing is not a concern, you can simply use something simple like count++.

Assuming our document has a data structure like this:

{ 
    "id": 0, 
    "content": "some text"
}

Old syntax FlexSearch v0.6.3 (not supported anymore!):

const index = new Document({
    doc: {
        id: "id",
        field: ["content"]
    }
});

The document descriptor has slightly changed, there is no field branch anymore, instead just apply one level higher, so key becomes a main member of options.

For the new syntax the field "doc" was renamed to document and the field "field" was renamed to index:

const index = new Document({
    document: {
        id: "id",
        index: ["content"]
    }
});

index.add({ 
    id: 0, 
    content: "some text"
});

The field id describes where the ID or unique key lives inside your documents. The default key gets the value id by default when not passed, so you can shorten the example from above to:

const index = new Document({
    document: {
        index: ["content"]
    }
});

The member index has a list of fields which you want to be indexed from your documents. When just selecting one field, then you can pass a string. When also using default key id then this shortens to just:

const index = new Document({ document: "content" });
index.add({ id: 0, content: "some text" });

Assuming you have several fields, you can pass custom options for each field also:

const index = new Document({
    id: "id",
    index: [{
        field: "title",
        tokenize: "forward",
        optimize: true,
        resolution: 9
    },{
        field:  "content",
        tokenize: "strict",
        optimize: true,
        resolution: 9,
        minlength: 3,
        context: {
            depth: 1,
            resolution: 3
        }
    }]
});

Field options gets inherited when also global options was passed, e.g.:

const index = new Document({
    tokenize: "strict",
    optimize: true,
    resolution: 9,
    document: {
        id: "id",
        index:[{
            field: "title",
            tokenize: "forward"
        },{
            field: "content",
            minlength: 3,
            context: {
                depth: 1,
                resolution: 3
            }
        }]
    }
});

Note: The context options from the field "content" also gets inherited by the corresponding field options, whereas this field options was inherited by the global option.

Nested Data Fields

{
  "record": {
    "id": 0,
    "title": "some title",
    "content": {
      "header": "some text",
      "footer": "some text"
    }
  }
}

Use the colon ":" syntax to name each field hierarchically starting from the root, e.g.:

const index = new Document({
    document: {
        id: "record:id",
        index: [
            "record:title",
            "record:content:header",
            "record:content:footer"
        ]
    }
});

Just add fields you want to query against. Do not add fields to the index, you just need in the result (but did not query against). For this purpose you can store documents independently of its index (read below).

When you want to query through a field you have to pass the exact key of the field you have defined in the doc as a field name (with colon syntax):

index.search(query, {
    index: [
        "record:title",
        "record:content:header",
        "record:content:footer"
    ]
});

Same as:

index.search(query, [
    "record:title",
    "record:content:header",
    "record:content:footer"
]);

Using field-specific options:

index.search([{
    field: "record:title",
    query: "some query",
    limit: 100,
    suggest: true
},{
    field: "record:title",
    query: "some other query",
    limit: 100,
    suggest: true
}]);

You can perform a search through the same field with different queries.

When passing field-specific options you need to provide the full configuration for each field. They get not inherited like the document descriptor.

Complex Documents

You need to follow 2 rules for your documents:

The document cannot start with an Array at the root index. This will introduce sequential data and isn't supported yet. See below for a workaround for such data.

[ // <-- not allowed as document start!
  {
    "id": 0,
    "title": "title"
  }
]

The id can't be nested inside an array (also none of the parent fields can't be an array). This will introduce sequential data and isn't supported yet. See below for a workaround for such data.

{
  "records": [ // <-- not allowed when ID or tag lives inside!
    {
      "id": 0,
      "title": "title"
    }
  ]
}

Here an example for a supported complex document:

{
  "meta": {
    "tag": "cat",
    "id": 0
  },
  "contents": [
    {
      "body": {
        "title": "some title",
        "footer": "some text"
      },
      "keywords": ["some", "key", "words"]
    },
    {
      "body": {
        "title": "some title",
        "footer": "some text"
      },
      "keywords": ["some", "key", "words"]
    }
  ]
}

The corresponding document descriptor (when all fields should be indexed) looks like:

const index = new Document({
    document: {
        id: "meta:id",
        tag: "meta:tag",
        index: [
            "contents[]:body:title",
            "contents[]:body:footer",
            "contents[]:keywords"
        ]
    }
});

Again, when searching you have to use the same colon-separated-string from your field definition.

index.search(query, { 
    index: "contents[]:body:title"
});

Not Supported Documents (Sequential Data)

This example breaks both rules from above:

[ // <-- not allowed as document start!
  {
    "tag": "cat",
    "records": [ // <-- not allowed when ID or tag lives inside!
      {
        "id": 0,
        "body": {
          "title": "some title",
          "footer": "some text"
        },
        "keywords": ["some", "key", "words"]
      },
      {
        "id": 1,
        "body": {
          "title": "some title",
          "footer": "some text"
        },
        "keywords": ["some", "key", "words"]
      }
    ]
  }
]

You need to apply some kind of structure normalization.

A workaround to such a data structure looks like this:

const index = new Document({
    document: {
        id: "record:id",
        tag: "tag",
        index: [
            "record:body:title",
            "record:body:footer",
            "record:body:keywords"
        ]
    }
});

function add(sequential_data){

    for(let x = 0, data; x < sequential_data.length; x++){

        data = sequential_data[x];

        for(let y = 0, record; y < data.records.length; y++){

            record = data.records[y];

            index.add({
                id: record.id,
                tag: data.tag,
                record: record
            });
        }
    }  
}

// now just use add() helper method as usual:

add([{
    // sequential structured data
    // take the data example above
}]);

You can skip the first loop when your document data has just one index as the outer array.

Join / Append Arrays

On the complex example above, the field keywords is an array but here the markup did not have brackets like keywords[]. That will also detect the array but instead of appending each entry to a new context, the array will be joined into on large string and added to the index.

The difference of both kinds of adding array contents is the relevance when searching. When adding each item of an array via append() to its own context by using the syntax field[], then the relevance of the last entry concurrent with the first entry. When you left the brackets in the notation, it will join the array to one whitespace-separated string. Here the first entry has the highest relevance, whereas the last entry has the lowest relevance.

So assuming the keyword from the example above are pre-sorted by relevance to its popularity, then you want to keep this order (information of relevance). For this purpose do not add brackets to the notation. Otherwise, it would take the entries in a new scoring context (the old order is getting lost).

Also you can left bracket notation for better performance and smaller memory footprint. Use it when you did not need the granularity of relevance by the entries.

Field Search

Search through all fields:

index.search(query);

Search through a specific field:

index.search(query, { index: "title" });

Search through a given set of fields:

index.search(query, { index: ["title", "content"] });

Same as:

index.search(query, ["title", "content"]);

Pass custom modifiers and queries to each field:

index.search([{
    field: "content",
    query: "some query",
    limit: 100,
    suggest: true
},{
    field: "content",
    query: "some other query",
    limit: 100,
    suggest: true
}]);

You can perform a search through the same field with different queries.

New Result Set

One of the few breaking changes which needs migration of your old implementation is the result set. I was thinking a long time about it and came to the conclusion, that this new structure might look weird on the first time, but also comes with some nice new capabilities.

Schema of the result-set:

fields[] => { field, result[] => { document }}

The first index is an array of fields the query was applied to. Each of this field has a record (object) with 2 properties "field" and "result". The "result" is also an array and includes the result for this specific field. The result could be an array of IDs or as enriched with stored document data.

A non-enriched result set now looks like:

[{
    field: "title",
    result: [0, 1, 2]
},{
    field: "content",
    result: [3, 4, 5]
}]

An enriched result set now looks like:

[{
    field: "title",
    result: [
        { id: 0, doc: { /* document */ }},
        { id: 1, doc: { /* document */ }},
        { id: 2, doc: { /* document */ }}
    ]
},{
    field: "content",
    result: [
        { id: 3, doc: { /* document */ }},
        { id: 4, doc: { /* document */ }},
        { id: 5, doc: { /* document */ }}
    ]
}]

When using pluck instead of "field" you can explicitly select just one field and get back a flat representation:

index.search(query, { pluck: "title", enrich: true });

[
    { id: 0, doc: { /* document */ }},
    { id: 1, doc: { /* document */ }},
    { id: 2, doc: { /* document */ }}
]

These change is basically based on "boolean search". Instead of applying your bool logic to a nested object (which almost ends in structured hell), you can apply your logic by yourself on top of the result-set dynamically. This opens hugely capabilities on how you process the results. Therefore, the results from the fields aren't squashed into one result anymore. That keeps some important information, like the name of the field as well as the relevance of each field results which didn't get mixed anymore.

A field search will apply a query with the boolean "or" logic by default. Each field has its own result to the given query.

There is one situation where the bool property is still supported. When you like to switch the default "or" logic from the field search into "and", e.g.:

index.search(query, { 
    index: ["title", "content"],
    bool: "and" 
});

You will just get results which contains the query in both fields. That's it.

Tag Search

You can also fetch results from one or more tags when no query was passed:

index.search({ tag: ["cat", "dog"] });

In this case the result-set looks like:

[{
    tag: "cat",
    result: [ /* all cats */ ]
},{
    tag: "dog",
    result: [ /* all dogs */ ]
}]

Limit & Offset

By default, every query is limited to 100 entries. Unbounded queries leads into issues. You need to set the limit as an option to adjust the size.

You can set the limit and the offset for each query:

index.search(query, { limit: 20, offset: 100 });

You cannot pre-count the size of the result-set. That's a limit by the design of FlexSearch. When you really need a count of all results you are able to page through, then just assign a high enough limit and get back all results and apply your paging offset manually (this works also on server-side). FlexSearch is fast enough that this isn't an issue.

Document Stores

Do not use a store when: 1. an array of IDs as the result is good enough, or 2. you already have the contents/documents stored elsewhere (outside the index).

Only a document index can have a store. You can use a document index instead of a flat index to get this functionality also when only storing ID-content-pairs.

This will add the whole original content to the store:

const index = new Document({
    document: { 
        index: "content",
        store: true
    }
});
index.add({ id: 0, content: "some text" });

Access documents from internal store

You can get indexed documents from the store:

var data = index.get(1);

You can update/change store contents directly without changing the index by:

index.set(1, data);

To update the store and also update the index then just use index.update, index.add or index.append.

When you perform a query, weather it is a document index or a flat index, then you will always get back an array of IDs.

Optionally you can enrich the query results automatically with stored contents by:

index.search(query, { enrich: true });

Your results look now like:

[{
    id: 0,
    doc: { /* content from store */ }
},{
    id: 1,
    doc: { /* content from store */ }
}]

Configure Storage (Recommended)

This will add just specific fields from a document to the store (the ID isn't necessary to keep in store):

const index = new Document({
    document: {
        index: "content",
        store: ["author", "email"]
    }
});

index.add(id, content);

You can configure independently what should being indexed and what should being stored. It is highly recommended to make use of this whenever you can.

Here a useful example of configuring doc and store:

const index = new Document({
    document: { 
        index: "content",
        store: ["author", "email"] 
    }
});

index.add({
    id: 0,
    author: "Jon Doe",
    email: "john@mail.com",
    content: "Some content for the index ..."
});

You can query through the contents and will get back the stored values instead:

index.search("some content", { enrich: true });

Your results are now looking like:

[{
    field: "content",
    result: [{
        id: 0,
        doc: {
            author: "Jon Doe",
            email: "john@mail.com",
        }
    }]
}]

Both field "author" and "email" are not indexed.

Worker Parallelism (Browser + Node.js)

The whole worker implementation has changed by also keeping Node.js support in mind. The good news is worker will also get supported by Node.js by the library.

One important change is how workers divided their tasks and how contents are distributed. One big issue was that in the old model workers cycles for each task (Round Robin). Theoretically that provides an optimal balance of workload and storage. But that breaks the internal architecture of this search library and almost every performance optimization is getting lost.

Let us take an example. Assuming you have 4 workers and you will add 4 contents to the index, then each content is delegated to one worker (a perfect balance but index becomes a partial index).

Old syntax FlexSearch v0.6.3 (not supported anymore!):

const index = new FlexSearch({ worker: 4 });
index.add(1, "some")
     .add(2, "content")
     .add(3, "to")
     .add(4, "index");

Worker 1: { 1: "some" }
Worker 2: { 2: "content" }
Worker 3: { 3: "to" }
Worker 4: { 4: "index" }

The issue starts when you query a term. Each of the worker has to resolve the search on its own index and has to delegate back the results to apply the intersection calculation. That's the problem. No one of the workers could solve a search task completely, they have to transmit intermediate results back. Therefore, no optimization path could be applied early, because every worker has to send back the full (non-limited) result first.

The new worker model from v0.7.0 is divided into "fields" from the document (1 worker = 1 field index). This way the worker becomes able to solve tasks (subtasks) completely. The downside of this paradigm is they might not have been perfect balanced in storing contents (fields may have different length of contents). On the other hand there is no indication that balancing the storage gives any advantage (they all require the same amount in total).

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

index.add({ 
    id: 1, tag: "cat", name: "Tom", title: "some", text: "some" 
}).add({
    id: 2, tag: "dog", name: "Ben", title: "title", text: "content" 
}).add({ 
    id: 3, tag: "cat", name: "Max", title: "to", text: "to" 
}).add({ 
    id: 4, tag: "dog", name: "Tim", title: "index", text: "index" 
});

Worker 1: { 1: "cat", 2: "dog", 3: "cat", 4: "dog" }
Worker 2: { 1: "Tom", 2: "Ben", 3: "Max", 4: "Tim" }
Worker 3: { 1: "some", 2: "title", 3: "to", 4: "index" }
Worker 4: { 1: "some", 2: "content", 3: "to", 4: "index" }

When you perform a field search through all fields then this task is perfectly balanced through all workers, which can solve their subtasks independently.

WorkerIndex (Adapter)

Above we have seen that documents will create worker automatically for each field. You can also create a WorkerIndex directly (same like using Index instead of Document).

Use as ES6 module:

import WorkerIndex from "./worker/index.js";
const index = new WorkerIndex(options);
index.add(1, "some")
     .add(2, "content")
     .add(3, "to")
     .add(4, "index");

Or when bundled version was used instead:

var index = new FlexSearch.Worker(options);
index.add(1, "some")
     .add(2, "content")
     .add(3, "to")
     .add(4, "index");

Such a WorkerIndex works pretty much the same as a created instance of Index.

A WorkerIndex only support the async variant of all methods. That means when you call index.search() on a WorkerIndex this will perform also in async the same way as index.searchAsync() will do.

Worker Threads (Node.js)

The worker model for Node.js is based on "worker threads" and works exactly the same way:

const { Document } = require("flexsearch");

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

Or create a single worker instance for a non-document index:

const { Worker } = require("flexsearch");
const index = new Worker({ options });

The Worker Async Model (Best Practices)

A worker will always perform as async. On a query method call you always should handle the returned promise (e.g. use await) or pass a callback function as the last parameter.

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

All requests and sub-tasks will run in parallel (prioritize "all tasks completed"):

index.searchAsync(query, callback);
index.searchAsync(query, callback);
index.searchAsync(query, callback);

Also (prioritize "all tasks completed"):

index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);

Or when you have just one callback when all requests are done, simply use Promise.all() which also prioritize "all tasks completed":

Promise.all([
    
    index.searchAsync(query).then(callback),
    index.searchAsync(query).then(callback),
    index.searchAsync(query).then(callback)
    
]).then(callback);

Inside the callback of Promise.all() you will also get an array of results as the first parameter respectively for each query you put into.

When using await you can prioritize the order (prioritize "first task completed") and solve requests one by one and just process the sub-tasks in parallel:

await index.searchAsync(query);
await index.searchAsync(query);
await index.searchAsync(query);

Same for index.add(), index.append(), index.remove() or index.update(). Here there is a special case which isn't disabled by the library, but you need to keep in mind when using Workers.

When you call the "synced" version on a worker index:

index.add(doc);
index.add(doc);
index.add(doc);
// contents aren't indexed yet,
// they just queued on the message channel

Of course, you can do that but keep in mind that the main thread does not have an additional queue for distributed worker tasks. Running these in a long loop fires content massively to the message channel via worker.postMessage() internally. Luckily the browser and Node.js will handle such incoming tasks for you automatically (as long enough free RAM is available). When using the "synced" version on a worker index, the content isn't indexed one line below, because all calls are treated as async by default.

When adding/updating/removing large bulks of content to the index (or high frequency), it is recommended to use the async version along with async/await to keep a low memory footprint during long processes.

Export / Import

Export

The export has slightly changed. The export now consist of several smaller parts, instead of just one large bulk. You need to pass a callback function which has 2 arguments "key" and "data". This callback function is called by each part, e.g.:

index.export(function(key, data){ 
    
    // you need to store both the key and the data!
    // e.g. use the key for the filename and save your data
    
    localStorage.setItem(key, data);
});

Exporting data to the localStorage isn't really a good practice, but if size is not a concern than use it if you like. The export primarily exists for the usage in Node.js or to store indexes you want to delegate from a server to the client.

The size of the export corresponds to the memory consumption of the library. To reduce export size you have to use a configuration which has less memory footprint (use the table at the bottom to get information about configs and its memory allocation).

When your save routine runs asynchronously you have to return a promise:

index.export(function(key, data){ 
    
    return new Promise(function(resolve){
        
        // do the saving as async

        resolve();
    });
});

You cannot export the additional table for the "fastupdate" feature. These table exists of references and when stored they fully get serialized and becomes too large. The lib will handle these automatically for you. When importing data, the index automatically disables "fastupdate".

Import

Before you can import data, you need to create your index first. For document indexes provide the same document descriptor you used when export the data. This configuration isn't stored in the export.

var index = new Index({ ... });

To import the data just pass a key and data:

index.import(key, localStorage.getItem(key));

You need to import every key! Otherwise, your index does not work. You need to store the keys from the export and use this keys for the import (the order of the keys can differ).

This is just for demonstration and is not recommended, because you might have other keys in your localStorage which aren't supported as an import:

var keys = Object.keys(localStorage);

for(let i = 0, key; i < keys.length; i++){
    
    key = keys[i];
    index.import(key, localStorage.getItem(key));
}

Benchmark (Search)

Contextual Search

FlexSearch v0.6.3 (fastest profile):

query-single 4313828 op/s, Memory: 1
query-multi  1526028 op/s, Memory: 1
query-long     57181 op/s, Memory: 8
query-dupes  1460489 op/s, Memory: 1
not-found    2423155 op/s, Memory: 1

FlexSearch v0.7.0 (equivalent profile):

query-single 7344119 op/s, Memory: 1
query-multi  2460401 op/s, Memory: 1
query-long    931957 op/s, Memory: 1
query-dupes  2137628 op/s, Memory: 1
not-found    3028110 op/s, Memory: 1

This is a performance gain up to 16x faster.

Lexical Search

FlexSearch v0.6.3 (fastest profile):

query-single 4154241 op/s, Memory: 1
query-multi   175687 op/s, Memory: 3
query-long      1453 op/s, Memory: 516
query-dupes   969917 op/s, Memory: 1
not-found    2289013 op/s, Memory: 1

There was a performance leak when using extra long queries (for this test I've picked a worst-case scenario).

FlexSearch v0.7.0 (equivalent profile):

query-single 7362096 op/s, Memory: 1
query-multi   580524 op/s, Memory: 4
query-long    645983 op/s, Memory: 2
query-dupes  2136893 op/s, Memory: 1
not-found    3061433 op/s, Memory: 1

This is a performance gain up to 450x faster, also reduced memory allocation up to 250x.

Search + Cache

FlexSearch v0.6.3:

query-single  2342487 op/s, Memory: 1
query-multi   2445660 op/s, Memory: 1
query-long    3823374 op/s, Memory: 1
query-dupes   4162607 op/s, Memory: 1
not-found     3858238 op/s, Memory: 1

A fun fact is that the new version is almost as fast as the old version with cache enabled.

FlexSearch v0.7.0:

query-single 29266333 op/s, Memory: 1
query-multi  35164612 op/s, Memory: 1
query-long   33610046 op/s, Memory: 1
query-dupes  30240771 op/s, Memory: 1
not-found    36181951 op/s, Memory: 1

This is a performance gain up to 14 times faster.

Benchmark (Add, Update, Delete)

One part which also gets massively improvements is the update and removal of indexed contents by index.update(id, content) or by index.remove(id). That was the worst case scenario for FlexSearch.

The new option flag "fastupdate" make use of an additional register and pushes performance of all updates and removals of already indexed contents by a factor up to 2850x faster. This additional register comes with a moderate memory cost (+5%). When your index needs to be updated already indexed contents frequently, then this option is highly recommended. When just adding new contents (with new IDs), this option is useless and the extra memory cost isn't worth it.

FlexSearch v0.6.3 (fastest profile):

add            84788 op/s, Memory: 166
update           717 op/s, Memory: 785
remove          1186 op/s, Memory: 535

FlexSearch v0.7.0 (equivalent profile):

add           261529 op/s, Memory: 3
update          3043 op/s, Memory: 113
remove          5572 op/s, Memory: 530

This is a performance gain up to 5x faster.

FlexSearch v0.7.0 + "fastupdate" enabled:

add           261172 op/s, Memory: 3
update        238025 op/s, Memory: 1
remove       3430364 op/s, Memory: 1

This is a performance gain up to 2850x faster.

Contextual Search

The advantage of using a contextual search is the scoring of relevance which take the distance between each term from the indexed documents into account. That brings relevance search to a complete new level compared to TF-IDF. In fact a TF-IDF tells nothing about the relevance of a query which exist of multiple terms. TF-IDF is just useful when using one term queries and is also used by FlexSearch as a fallback for this purpose.

The context starts by a query which have more than one term and will increase for each additional term. Often you will need 3 or 4 words to get the absolutely perfect match in a complex document. A nice bonus is the performance boost you will get by internally cutting down the intersection calculations on multiple-term queries.

The contextual index is an additional index to the pre-scored lexical standard index. This addition comes with a memory cost.

Memory Allocation

The book "Gulliver's Travels Swift Jonathan 1726" was fully indexed for the examples below.

The most memory-optimized meaningful setting will allocate just 1.2 Mb for the whole book indexed! This is probably the most tiny memory footprint you will get from a search library.

import { encode } from "./lang/latin/extra.js";

index = new Index({
    encode: encode,
    tokenize: "strict",
    optimize: true,
    resolution: 1,
    minlength: 3,
    fastupdate: false,
    context: false
});

Compare Impact of Memory Allocation

by default a lexical index is very small:
depth: 0, bidirectional: 0, resolution: 3, minlength: 0 => 2.1 Mb

a higher resolution will increase the memory allocation:
depth: 0, bidirectional: 0, resolution: 9, minlength: 0 => 2.9 Mb

using the contextual index will increase the memory allocation:
depth: 1, bidirectional: 0, resolution: 9, minlength: 0 => 12.5 Mb

a higher contextual depth will increase the memory allocation:
depth: 2, bidirectional: 0, resolution: 9, minlength: 0 => 21.5 Mb

a higher minlength will decrease memory allocation:
depth: 2, bidirectional: 0, resolution: 9, minlength: 3 => 19.0 Mb

using bidirectional will decrease memory allocation:
depth: 2, bidirectional: 1, resolution: 9, minlength: 3 => 17.9 Mb

enable the option "fastupdate" will increase memory allocation:
depth: 2, bidirectional: 1, resolution: 9, minlength: 3 => 6.3 Mb

Full Comparison Table

Every search library is constantly in competition with these 4 properties:

Memory Allocation
Performance
Matching Capabilities
Relevance Order (Scoring)

FlexSearch provides you many parameters you can use to adjust the optimal balance for your specific use-case.


Modifier	Memory Impact *	Performance Impact **	Matching Impact **	Scoring Impact **
resolution	+1 (per level)	+1 (per level)	0	+2 (per level)
depth	+4 (per level)	-1 (per level)	-10 + depth	+10
minlength	-2 (per level)	+2 (per level)	-3 (per level)	+2 (per level)
bidirectional	-2	0	+3	-1
fastupdate	+1	+10 (update, remove)	0	0
optimize: true	-7	-1	0	-3
encoder: "icase"	0	0	0	0
encoder: "simple"	-2	-1	+2	0
encoder: "advanced"	-3	-2	+4	0
encoder: "extra"	-5	-5	+6	0
encoder: "soundex"	-6	-2	+8	0
tokenize: "strict"	0	0	0	0
tokenize: "forward"	+3	-2	+5	0
tokenize: "reverse"	+5	-4	+7	0
tokenize: "full"	+8	-5	+10	0
document index	+3 (per field)	-1 (per field)	0	0
document tags	+1 (per tag)	-1 (per tag)	0	0
store: true	+5 (per document)	0	0	0
store: [fields]	+1 (per field)	0	0	0
cache: true	+10	+10	0	0
cache: 100	+1	+9	0	0
type of ids: number	0	0	0	0
type of ids: string	+3	-3	0	0

* range from -10 to 10, lower is better (-10 => big decrease, 0 => unchanged, +10 => big increase)
** range from -10 to 10, higher is better

52 KiB Raw Blame History

FlexSearch v0.7.0

Load Library

ES6 Modules:

Browser Bundle (ES5 Support):

Node.js:

Basic Usage

Source Code

Builtin Profiles

Improvements

Migration Quick Overview

Index Stack-flow

Tokenizer

Changes

Async

Auto-Balanced Cache (By Popularity)

Append Contents

Check existence of already indexed IDs

Document Index

Document Descriptor

Nested Data Fields

Complex Documents

Not Supported Documents (Sequential Data)

Join / Append Arrays

Field Search

New Result Set

Tags

Tag Search

Limit & Offset

Document Stores

Access documents from internal store

Configure Storage (Recommended)

Worker Parallelism (Browser + Node.js)

WorkerIndex (Adapter)

Worker Threads (Node.js)

The Worker Async Model (Best Practices)

Export / Import

Export

Import

Benchmark (Search)

Contextual Search

Lexical Search

Search + Cache

Benchmark (Add, Update, Delete)

Contextual Search

Memory Allocation

Compare Impact of Memory Allocation

Full Comparison Table

52 KiB

Raw Blame History