1
0
mirror of https://github.com/nextapps-de/flexsearch.git synced 2025-08-22 13:42:51 +02:00
Files
flexsearch/doc/worker.md
2025-04-06 14:16:20 +02:00

429 lines
12 KiB
Markdown

# Worker Parallelism (Browser + Node.js)
### Worker Index
Using a Worker-Index is pretty much the same as using a standard `Index`.
> Worker-Index always return a `Promise` for all methods called on the index.
> When adding/updating/removing large bulks of content to the index, it is recommended to use the async version of each method to prevent blocking issues on the main thread. Read more about [Asynchronous Runtime Balancer](async.md)
### Worker Document
> The internal worker model is distributed by document fields and will solve subtasks in parallel.
Documents will create worker automatically for each field by just apply the option `worker: true`:
```js
const index = new Document({
worker: true,
document: {
id: "id",
index: ["name", "title"],
tag: ["cat"]
}
});
index.add({
id: 1, cat: "catA", name: "Tom", title: "some"
}).add({
id: 2, cat: "catA", name: "Ben", title: "title"
}).add({
id: 3, cat: "catB", name: "Max", title: "to"
}).add({
id: 4, cat: "catB", name: "Tim", title: "index""
});
```
When you perform a field search through multiple fields then this task is being well-balanced through all involved workers, which can solve their subtasks independently.
## Examples
### ES6 Module (Bundle):
When using one of the bundles from `/dist/` you can create a Worker-Index:
```js
import { Worker } from "./dist/flexsearch.bundle.module.min.js";
const index = new Worker({/* options */ });
await index.add(1, "some");
await index.add(2, "content");
await index.add(3, "to");
await index.add(4, "index");
```
### ES6 Module (Non-Bundle):
When not using a bundle you can take the worker file from `/dist/` folder as follows:
```js
import Worker from "./dist/module/worker.js";
const index = new Worker({/* options */ });
index.add(1, "some")
.add(2, "content")
.add(3, "to")
.add(4, "index");
```
### Browser Legacy (Bundle):
When loading a legacy bundle via script tag (non-modules):
```js
const index = new FlexSearch.Worker({/* options */ });
await index.add(1, "some");
await index.add(2, "content");
await index.add(3, "to");
await index.add(4, "index");
```
### Worker (Node.js)
The worker model for Node.js is based on native worker threads and works exactly the same way:
```js
const { Document } = require("flexsearch");
const index = new Document({
worker: true,
document: {
id: "id",
index: ["name", "title"],
tag: ["cat"]
}
});
```
Or create a single worker instance for a non-document index:
```js
const { Worker } = require("flexsearch");
const index = new Worker({ options });
```
### Worker-Index Options
> Worker-Index Options extends the default [Index Options](../README.md#index-options), you can apply also.
<table>
<tr></tr>
<tr>
<td>Option</td>
<td>Values</td>
<td>Description</td>
</tr>
<tr>
<td>config</td>
<td>String</td>
<td>Either the absolute URL to the config file when used in Browser context (should match the Same-Origin-Policy) or the filepath to the configuration file when used in Node.js context</td>
</tr>
<tr></tr>
<tr>
<td>export</td>
<td>function</td>
<td>The export handler function. Read more about <a href="export-import.md">Export</a></td>
</tr>
<tr></tr>
<tr>
<td>import</td>
<td>function</td>
<td>The export handler function. Read more about <a href="export-import.md">Import</a></td>
</tr>
</table>
## Extern Worker Configuration
When using Worker by __also__ assign custom functions to the options e.g.:
- Custom Encoder
- Custom Encoder methods (normalize, prepare, finalize)
- Custom Score (function)
- Custom Filter (function)
- Custom Fields (function)
... then you'll need to move your __field configuration__ into a file which exports the configuration as a `default` export. The field configuration is not the whole Document-Descriptor.
When not using custom functions in combination with Worker you can skip this part.
Since every field resolves into a dedicated Worker, also every field which includes custom functions should have their own configuration file accordingly.
Let's take this document descriptor:
```js
{
document: {
index: [{
// this is the field configuration
// ---->
field: "custom_field",
custom: function(data){
return "custom field content";
}
// <------
}]
}
};
```
The configuration which needs to be available as a default export is:
```js
{
field: "custom_field",
custom: function(data){
return "custom field content";
}
};
```
You're welcome to make some suggestions how to improve the handling of extern configuration.
### Example Node.js:
An extern configuration for one WorkerIndex, let's assume it is located in `./custom_field.js`:
```js
const { Charset } = require("flexsearch");
const { LatinSimple } = Charset;
// it requires a default export:
module.exports = {
encoder: LatinSimple,
tokenize: "forward",
// custom function:
custom: function(data){
return "custom field content";
}
};
```
Create Worker Index along the configuration above:
```js
const { Document } = require("flexsearch");
const flexsearch = new Document({
worker: true,
document: {
index: [{
// the field name needs to be set here
field: "custom_field",
// path to your config from above:
config: "./custom_field.js",
}]
}
});
```
### Browser (ESM)
An extern configuration for one WorkerIndex, let's assume it is located in `./custom_field.js`:
```js
import { Charset } from "./dist/flexsearch.bundle.module.min.js";
const { LatinSimple } = Charset;
// it requires a default export:
export default {
encoder: LatinSimple,
tokenize: "forward",
// custom function:
custom: function(data){
return "custom field content";
}
};
```
Create Worker Index with the configuration above:
```js
import { Document } from "./dist/flexsearch.bundle.module.min.js";
// you will need to await for the response!
const flexsearch = await new Document({
worker: true,
document: {
index: [{
// the field name needs to be set here
field: "custom_field",
// Absolute URL to your config from above:
config: "http://localhost/custom_field.js"
}]
}
});
```
Here it needs the __absolute URL__, because the WorkerIndex context is from type `Blob` and you can't use relative URLs starting from this context.
### Test Case
As a test the whole IMDB data collection was indexed, containing of:
JSON Documents: 9,273,132<br>
Fields: 83,458,188<br>
Tokens: 128,898,832<br>
The used index configuration has 2 fields (using bidirectional context of `depth: 1`), 1 custom field, 2 tags and a full datastore of all input json documents.
A non-Worker Document index requires 181 seconds to index all contents.<br>
The Worker index just takes 32 seconds to index them all, by processing every field and tag in parallel. For such large content it is a quite impressive result.
## Export / Import Worker Indexes (Node.js)
Worker will save/load their data dedicated and does not need the message channel for the data transfer.
### Basic Worker Index
> This feature follows the strategy of using [Extern Worker Configuration](#extern-worker-configuration) in combination with [Basic Export Import](../example/nodejs-commonjs/basic-export-import).
Example (CommonJS): [basic-worker-export-import](../example/nodejs-commonjs/basic-worker-export-import)<br>
Example (ESM): [basic-worker-export-import](../example/nodejs-esm/basic-worker-export-import)
Provide the index configuration and keep it, because it isn't stored. Provide a parameter `config` which is including the filepath to the extern configuration file:
```js
const dirname = import.meta.dirname;
const config = {
tokenize: "forward",
config: dirname + "/config.js"
};
```
> Any changes you made to the configuration will almost require a full re-index.
Provide the extern configuration file e.g. `/config.js` as a default export including the methods `export` and `import`:
```js
import { promises as fs } from "fs";
export default {
tokenize: "forward",
export: async function(key, data){
// like the usual export write files by key + data
await fs.writeFile("./export/" + key, data, "utf8");
},
import: async function(index){
// get the file contents of the export directory
let files = await fs.readdir("./export/");
files = await Promise.all(files);
// loop through the files and push their contents to the index
// by also passing the filename as the first parameter
for(let i = 0; i < files.length; i++){
const data = await fs.readFile("./export/" + files[i], "utf8");
index.import(files[i], data);
}
}
};
```
Create your index by assigning the configuration file from above:
```js
import { Worker as WorkerIndex } from "flexsearch";
const index = await new WorkerIndex(config);
// add data to the index
// ...
```
Export the index:
```js
await index.export();
```
Import the index:
```js
// create the same type of index you have used by .export()
// along with the same configuration
const index = await new WorkerIndex(config);
await index.import();
```
### Document Worker Index
> This feature follows the strategy of using [Extern Worker Configuration](#extern-worker-configuration) in combination with [Document Export Import](../example/nodejs-esm/document-export-import).
Document Worker exports all their feature including:
- Multi-Tag Indexes
- Context-Search Indexes
- Document-Store
Example (CommonJS): [document-worker-export-import](../example/nodejs-commonjs/document-worker-export-import)<br>
Example (ESM): [document-worker-export-import](../example/nodejs-esm/document-worker-export-import)
Provide the index configuration and keep it, because it isn't stored. Provide a parameter `config` which is including the filepath to the extern configuration file:
```js
const dirname = import.meta.dirname;
const config = {
worker: true,
document: {
id: "tconst",
store: true,
index: [{
field: "primaryTitle",
config: dirname + "/config.primaryTitle.js"
},{
field: "originalTitle",
config: dirname + "/config.originalTitle.js"
}],
tag: [{
field: "startYear"
},{
field: "genres"
}]
}
};
```
> Any changes you made to the configuration will almost require a full re-index.
Provide the extern configuration file as a default export including the methods `export` and `import`:
```js
import { promises as fs } from "fs";
export default {
tokenize: "forward",
export: async function(key, data){
// like the usual export write files by key + data
await fs.writeFile("./export/" + key, data, "utf8");
},
import: async function(file){
// instead of looping you will get the filename as 2nd paramter
// just return the loaded contents as a string
return await fs.readFile("./export/" + file, "utf8");
}
};
```
Create your index by assigning the configuration file from above:
```js
import { Document } from "flexsearch";
const document = await new Document(config);
// add data to the index
// ...
```
Export the index by providing a key-data handler:
```js
await document.export(async function(key, data){
await fs.writeFile("./export/" + key, data, "utf8");
});
```
Import the index:
```js
const files = await fs.readdir("./export/");
// create the same type of index you have used by .export()
// along with the same configuration
const document = await new Document(config);
await Promise.all(files.map(async file => {
const data = await fs.readFile("./export/" + file, "utf8");
// call import (async)
await document.import(file, data);
}));
```
## CSP-friendly Worker (Browser)
When using worker via one of the bundled versions (e.g. `flexearch.bundle.min.js`), the worker will be created by code generation under the hood. This might have issues when using strict CSP settings.
You can overcome this issue by using the non-bundled versions e.g. `dist/module/` or by passing the filepath to the worker file instead of `true` like `worker: "dist/module/worker/worker.js"`.