flexsearch/worker.md at 2b1771fd6d2dc64b51fc2f2812557b24d70c3d16

mirror_js/flexsearch

Fork 0

mirror of https://github.com/nextapps-de/flexsearch.git synced 2025-10-04 00:51:52 +02:00

Files

Thomas Wilkerling 2b1771fd6d update readme part 1 of 2

2025-03-27 21:04:06 +01:00

14 KiB

Raw Blame History

Worker Parallelism (Browser + Node.js)

The new worker model from v0.7.0 is divided into "fields" from the document (1 worker = 1 field index). This way the worker becomes able to solve tasks (subtasks) completely. The downside of this paradigm is they might not have been perfect balanced in storing contents (fields may have different length of contents). On the other hand there is no indication that balancing the storage gives any advantage (they all require the same amount in total).

When using a document index, then just apply the option "worker":

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

index.add({ 
    id: 1, tag: "cat", name: "Tom", title: "some", text: "some" 
}).add({
    id: 2, tag: "dog", name: "Ben", title: "title", text: "content" 
}).add({ 
    id: 3, tag: "cat", name: "Max", title: "to", text: "to" 
}).add({ 
    id: 4, tag: "dog", name: "Tim", title: "index", text: "index" 
});

Worker 1: { 1: "cat", 2: "dog", 3: "cat", 4: "dog" }
Worker 2: { 1: "Tom", 2: "Ben", 3: "Max", 4: "Tim" }
Worker 3: { 1: "some", 2: "title", 3: "to", 4: "index" }
Worker 4: { 1: "some", 2: "content", 3: "to", 4: "index" }

When you perform a field search through all fields then this task is being balanced perfectly through all workers, which can solve their subtasks independently.

Worker Index

Above we have seen that documents will create worker automatically for each field. You can also create a WorkerIndex directly (same like using Index instead of Document).

Use as ES6 module:

import WorkerIndex from "./worker/index.js";
const index = new WorkerIndex(options);
index.add(1, "some")
     .add(2, "content")
     .add(3, "to")
     .add(4, "index");

Or when bundled version was used instead:

var index = new FlexSearch.Worker(options);
index.add(1, "some")
     .add(2, "content")
     .add(3, "to")
     .add(4, "index");

Such a WorkerIndex works pretty much the same as a created instance of Index.

A WorkerIndex only support the async variant of all methods. That means when you call index.search() on a WorkerIndex this will perform also in async the same way as index.searchAsync() will do.

Worker Threads (Node.js)

The worker model for Node.js is based on "worker threads" and works exactly the same way:

const { Document } = require("flexsearch");

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

Or create a single worker instance for a non-document index:

const { Worker } = require("flexsearch");
const index = new Worker({ options });

The Worker Async Model (Best Practices)

A worker will always perform as async. On a query method call you always should handle the returned promise (e.g. use await) or pass a callback function as the last parameter.

const index = new Document({
    index: ["tag", "name", "title", "text"],
    worker: true
});

All requests and sub-tasks will run in parallel (prioritize "all tasks completed"):

index.searchAsync(query, callback);
index.searchAsync(query, callback);
index.searchAsync(query, callback);

Also (prioritize "all tasks completed"):

index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);
index.searchAsync(query).then(callback);

Or when you have just one callback when all requests are done, simply use Promise.all() which also prioritize "all tasks completed":

Promise.all([
    index.searchAsync(query),
    index.searchAsync(query),
    index.searchAsync(query)
]).then(callback);

Inside the callback of Promise.all() you will also get an array of results as the first parameter respectively for each query you put into.

When using await you can prioritize the order (prioritize "first task completed") and solve requests one by one and just process the sub-tasks in parallel:

await index.searchAsync(query);
await index.searchAsync(query);
await index.searchAsync(query);

Same for index.add(), index.append(), index.remove() or index.update(). Here there is a special case which isn't disabled by the library, but you need to keep in mind when using Workers.

When you call the "synced" version on a worker index:

index.add(doc);
index.add(doc);
index.add(doc);
// contents aren't indexed yet,
// they just queued on the message channel

Of course, you can do that but keep in mind that the main thread does not have an additional queue for distributed worker tasks. Running these in a long loop fires content massively to the message channel via worker.postMessage() internally. Luckily the browser and Node.js will handle such incoming tasks for you automatically (as long enough free RAM is available). When using the "synced" version on a worker index, the content isn't indexed one line below, because all calls are treated as async by default.

When adding/updating/removing large bulks of content to the index (or high frequency), it is recommended to use the async version along with async/await to keep a low memory footprint during long processes.

Extern Worker Configuration

When using Worker by also assign custom functions to the options e.g.:

Custom Encoder
Custom Encoder methods (normalize, prepare, finalize)
Custom Score (function)
Custom Filter (function)
Custom Fields (function)

... then you'll need to move your field configuration into a file which exports the configuration as a default export. The field configuration is not the whole Document-Descriptor.

When not using custom functions in combination with Worker you can skip this part.

Since every field resolves into a dedicated Worker, also every field which includes custom functions should have their own configuration file accordingly.

Let's take this document descriptor:

{
    document: {
        index: [{
            // this is the field configuration
            // ---->
            field: "custom_field",
            custom: function(data){
                return "custom field content";
            }
            // <------
        }]
    }
};

The configuration which needs to be available as a default export is:

{
    field: "custom_field",
    custom: function(data){
        return "custom field content";
    }
};

You're welcome to make some suggestions how to improve the handling of extern configuration.

Example Node.js:

An extern configuration for one WorkerIndex, let's assume it is located in ./custom_field.js:

const { Charset } = require("flexsearch");
const { LatinSimple } = Charset;
// it requires a default export:
module.exports = {
    encoder: LatinSimple,
    tokenize: "forward",
    // custom function:
    custom: function(data){
        return "custom field content";
    }
};

Create Worker Index along the configuration above:

const { Document } = require("flexsearch");
const flexsearch = new Document({
    worker: true,
    document: {
        index: [{
            // the field name needs to be set here
            field: "custom_field",
            // path to your config from above:
            config: "./custom_field.js",
        }]
    }
});

Browser (ESM)

An extern configuration for one WorkerIndex, let's assume it is located in ./custom_field.js:

import { Charset } from "./dist/flexsearch.bundle.module.min.js";
const { LatinSimple } = Charset;
// it requires a default export:
export default {
    encoder: LatinSimple,
    tokenize: "forward",
    // custom function:
    custom: function(data){
        return "custom field content";
    }
};

Create Worker Index with the configuration above:

import { Document } from "./dist/flexsearch.bundle.module.min.js";
// you will need to await for the response!
const flexsearch = await new Document({
    worker: true,
    document: {
        index: [{
            // the field name needs to be set here
            field: "custom_field",
            // Absolute URL to your config from above:
            config: "http://localhost/custom_field.js"
        }]
    }
});

Here it needs the absolute URL, because the WorkerIndex context is from type Blob and you can't use relative URLs starting from this context.

Test Case

As a test the whole IMDB data collection was indexed, containing of:

JSON Documents: 9,273,132
Fields: 83,458,188
Tokens: 128,898,832

The used index configuration has 2 fields (using bidirectional context of depth: 1), 1 custom field, 2 tags and a full datastore of all input json documents.

A non-Worker Document index requires 181 seconds to index all contents.
The Worker index just takes 32 seconds to index them all, by processing every field and tag in parallel. For such large content it is a quite impressive result.

Export / Import Worker Indexes (Node.js)

Worker will save/load their data dedicated and does not need the message channel for the data transfer.

Basic Worker Index

This feature follows the strategy of using Extern Worker Configuration in combination with Basic Export Import.

Example (CommonJS): basic-worker-export-import
Example (ESM): basic-worker-export-import

Provide the index configuration and keep it, because it isn't stored. Provide a parameter config which is including the filepath to the extern configuration file:

const dirname = import.meta.dirname;
const config = {
    tokenize: "forward",
    config: dirname + "/config.js"
};

Any changes you made to the configuration will almost require a full re-index.

Provide the extern configuration file e.g. /config.js as a default export including the methods export and import:

import { promises as fs } from "fs";

export default {
    tokenize: "forward",
    export: async function(key, data){
        // like the usual export write files by key + data
        await fs.writeFile("./export/" + key, data, "utf8");
    },
    import: async function(index){
        // get the file contents of the export directory
        let files = await fs.readdir("./export/");
        files = await Promise.all(files);
        // loop through the files and push their contents to the index
        // by also passing the filename as the first parameter
        for(let i = 0; i < files.length; i++){
            const data = await fs.readFile("./export/" + files[i], "utf8");
            index.import(files[i], data);
        }
    }
};

Create your index by assigning the configuration file from above:

import { Worker as WorkerIndex } from "flexsearch";
const index = await new WorkerIndex(config);
// add data to the index
// ...

Export the index:

await index.export();

Import the index:

// create the same type of index you have used by .export()
// along with the same configuration
const index = await new WorkerIndex(config);
await index.import();

Document Worker Index

This feature follows the strategy of using Extern Worker Configuration in combination with Document Export Import.

Document Worker exports all their feature including:

Multi-Tag Indexes
Context-Search Indexes
Document-Store

Example (CommonJS): document-worker-export-import
Example (ESM): document-worker-export-import