flexsearch/encoder.md at master

mirror_js/flexsearch

Fork 0

mirror of https://github.com/nextapps-de/flexsearch.git synced 2025-08-12 09:04:35 +02:00

Files

Thomas Wilkerling 6aadad9c3d re-add CJK encoder supporting graphemes

2025-04-12 11:05:00 +02:00

20 KiB

Raw Permalink Blame History

Encoder

Important

You shouldn't miss this part as it is one of the most important aspects of FlexSearch.

Search capabilities highly depends on language processing. The Encoder class is one of the most important core functionalities of FlexSearch.

Encoders are basically responsible for "fuzziness". Read here about Phonetic Search/Fuzzy Search

Default Encoder

The default Encoder (when passing no options on creation) uses this configuration:

const encoder = new Encoder({
    normalize: true,
    dedupe: true,
    cache: true,
    include: {
        letter: true,
        number: true,
        symbol: false,
        punctuation: false,
        control: false,
        char: ""
    }
});

The default configuration will:

apply charset normalization, e.g. "é" to "e"
apply letter deduplication, e.g. "missing" to "mising"
just index alphanumeric content and filter everything else out

This is important to keep in mind, because when you need a different configuration you'll have to change those settings accordingly.

Let's assume you want including the symbols "#", "@" and "-", because those are needed to differentiate search results (otherwise it would be useless), and let's say you don't need numeric content indexed you can do this by:

const encoder = new Encoder({
    // default configuration is applied
    // extend or override:
    include: {
        // by default everything is set to false
        letter: true,
        number: false,
        char: ["#", "@", "-"]
    }
});

Built-In Universal Encoders

Charset.Exact
Charset.Normalize (Charset.Default)

Built-In Latin Encoders

Charset.LatinBalance
Charset.LatinAdvanced
Charset.LatinExtra
Charset.LatinSoundex

Built-In CJK Encoder

Charset.CJK

Basic Usage

const encoder = new Encoder({
    normalize: true,
    dedupe: true,
    cache: true,
    include: {
        letter: true,
        number: true,
        symbol: false,
        punctuation: false,
        control: false,
        char: "@"
    }
});

You can use an include instead of an exclude definition:

const encoder = new Encoder({
    exclude: {
        letter: false,
        number: false,
        symbol: true,
        punctuation: true,
        control: true
    }
});

Instead of using include or exclude you can pass a regular expression or a string to the field split:

const encoder = new Encoder({ 
    split: /\s+/
});

E.g. this split configuration will tokenize every symbol/char from a content:

const encoder = new Encoder({ 
    split: ""
});

The definitions include and exclude is a replacement for split. You can just define one of those 3.

Adding custom functions to the encoder pipeline:

const encoder = new Encoder({
    normalize: function(str){
        return str.toLowerCase();
    },
    prepare: function(str){
        return str.replace(/&/g, " and ");
    },
    finalize: function(arr){
        return arr.filter(term => term.length > 2);
    }
});

Further reading: Encoder Processing Workflow

Assign an encoder to an index:

const index = new Index({ 
    encoder: encoder
});

Define language specific normalizations/transformations:

const encoder = new Encoder({
    stemmer: new Map([
        ["ly", ""]
    ]),
    filter: new Set([
        "and",
    ]),
    mapper: new Map([
        ["é", "e"]
    ]),
    matcher: new Map([
        ["xvi", "16"]
    ]),
    replacer: [
        /[´`’ʼ]/g, "'"
    ],
});

Further reading: Encoder Processing Workflow

Or use built-in helpers alternatively:

const encoder = new Encoder()
    .addStemmer("ly", "")
    .addFilter("and")
    .addMapper("é", "e")
    .addMatcher("xvi", "16")
    .addReplacer(/[´`’ʼ]/g, "'");

Some of the built-in helpers will automatically detect inputs and use the proper helper under the hood. So theoretically you can lazily just write:

const encoder = new Encoder()
    .addStemmer("ly", "")
    .addFilter("and")
    .addReplacer("é", "e")
    .addReplacer("xvi", "16")
    .addReplacer(/[´`’ʼ]/g, "'");

You can also use presets and extend it with custom options:

import EnglishBookPreset from "flexsearch/lang/en";
const encoder = new Encoder(
    EnglishBookPreset,
    // use the preset but don't filter terms
    { filter: false }
);

Equivalent:

const encoder = new Encoder(EnglishBookPreset);
encoder.assign({ filter: false });

Assign multiple extensions to the encoder instance:

import Charset from "flexsearch";
import EnglishBookPreset from "flexsearch/lang/en";
// stack definitions to the encoder instance
const encoder = new Encoder()
    .assign(Charset.LatinSoundex)
    .assign(EnglishBookPreset)
    // extend or override preset options:
    .assign({ minlength: 3 });
    // assign further presets ...

When adding extension to the encoder every previously assigned configuration is still intact, also when assigning custom functions, the previously added function will still execute.

Add custom transformations to an existing index:

const encoder = new Encoder(Charset.Normalize);
// filter terms
encoder.addFilter("and");
// replace single chars
encoder.addMapper("é", "e");
// replace char sequences
encoder.addMatcher("xvi", "16");
// replace single chars or char sequences
// at the end of a term
encoder.addStemmer("ly", "");
// custom regex replace
encoder.addReplacer(/[´`’ʼ]/g, "'");

Using a custom filter:

encoder.addFilter(function(str){
    // return true to keep the content
    return str.length > 1;
});

Shortcut for just assigning one encoder configuration to an index:

const index = new Index({ 
    encoder: Charset.Normalize
});

Encoder Options

Option	Values	Description	Default
You can just choose one of those 3 options:
`include`	Encoder Split Options	Define which of the string contents should be included (inclusion properties defaults to false)	{ letter: true, number: true }
`exclude`	Encoder Split Options	Define which of the string contents should be excluded (exclusion properties defaults to true)	false
`split`	false RegExp String Encoder Split Options	The expression used to split the content into terms	→ include { letter: true, number: true }
Other options:
`dedupe`	Boolean	Deduplicate consecutive letters, e.g. "missing" to "mising"	true
`numeric`	Boolean	By default, the extended numeric support (Triplets) inherits from chosen Encoder Split Options. You probably might want to disable Triplets to get a more exact result (fewer entries) in some cases.	true
`minlength`	Number	Set the minimum term length which should be added to the index. This limit does not apply to the `forward` tokenizer. You still get results when just typing "f" on a term "flexsearch" when e.g. `minlength: 4` was used.	1
`maxlength`	Number	Set the maximum term length which should be added to the index. Larger content will drop.	1
`rtl`	Boolean	Force Right-To-Left encoding (you should just apply this when the string content was not already encoded as RTL)	false
`normalize`	`true` enable normalization (default) `false` disable normalization `function(str) => str` custom function	The normalization stage will apply basic charset normalization e.g. by replacing "é" to "e"	true
`prepare`	`function(str) => str` custom function	The preparation stage is a custom function direct followed when normalization was done	false
`finalize`	`function([str]) => [str]` custom function	The finalization stage is a custom function executed at the last task in the encoding pipeline (here it gets an array of tokens and need to return an array of tokens)	false
`filter`	`Set(["and", "to", "be"])` `function(str) => bool` custom function `encoder.addFilter("and")`	Stop-word filter is like a blacklist of words to be filtered out from indexing at all (e.g. "and", "to" or "be"). This is also very useful when using Context Search	false
`stemmer`	`Map([["ing", ""], ["ies", "y"]])` `encoder.addStemmer("ing", "")`	Stemmer will normalize several linguistic mutations of the same word (e.g. "run" and "running", or "property" and "properties"). This is also very useful when using Context Search	false
`mapper`	`Map([["é", "e"], ["ß", "ss"]])` `encoder.addMapper("é", "e")`	Mapper will replace a single char (e.g. "é" into "e")	false
`matcher`	`Map([["and", "&"], ["usd", "$"]])` `encoder.addMatcher("and", "&")`	Matcher will do same as Mapper but instead of single chars it will replace char sequences	false
`replacer`	`[/[^a-z0-9]/g, "", /([^aeo])h(.)/g, "$1$2"])` `encoder.addReplacer(/[^a-z0-9]/g, "")`	Replacer takes custom regular expressions and couldn't get optimized in the same way as Mapper or Matcher. You should take this as the last option when no other replacement can do the same.	false
`cache`	Boolean	In some very rare situations (large consecutive content with high cardinality) it might be useful to disable the internal event-loop-cache	true

Tip

The methods .addMapper(), .addMatcher() and .addReplacer() might be confusing. For this reason they will automatically resolve to the right one when just using the same method for every rule. You can simplify this e.g. by just use .addReplacer() for each of this 3 rules.

Encoder Split Options

Option	Values	Description	Default
`letter`	Boolean	Toggle inclusion of letters on/off	true
`number`	Boolean	Toggle inclusion of numerics on/off	true
`symbol`	Boolean	Toggle inclusion of symbols on/off	false
`punctuation`	Boolean	Toggle inclusion of punctuation on/off	false
`control`	Boolean	Toggle inclusion of control chars on/off	false
`char`	String Array[String]	Toggle inclusion of specific chars on/off	false

Custom Encoder

Since it is very simple to create a custom Encoder, you are welcome to create your own. e.g.

function customEncoder(content){
   const tokens = [];
   // split content into terms/tokens
   // apply your changes to each term/token
   // you will need to return an Array of terms/tokens
   // so just iterate through the input string and
   // push tokens to the array
   // ...
   return tokens;
}

const index = new Index({
   // set to strict when your tokenization was already done
   tokenize: "strict",
   encode: customEncoder
});

You can't extend to the built-in tokenizer "exact", "forward", "bidirectional" or "full". If nothing of them are applicable for your task you should tokenize everything inside your custom encoder function.

If you get some good results please feel free creating a pull request to share your encoder to the community.

Encoder Processing Workflow

Charset Normalization
Custom Preparation
Split Content (into terms, apply includes/excludes)
Filter: Pre-Filter
Stemmer (substitute term endings)
Filter: Post-Filter
Replace Chars (Mapper)
Letter Deduplication
Matcher (substitute partials)
Custom Regex (Replacer)
Custom Finalize

This workflow schema might help you to understand each step in the iteration:

Right-To-Left Support

Note

When a string is already encoded/interpreted as Right-To-Left you didn't need to use that. This option is just useful, when the source content wasn't encoded as RTL.

Just set the property rtl: true when creating the Encoder:

const encoder = new Encoder({ rtl: true });

CJK Word Break (Chinese, Japanese, Korean)

const index = new Index({ encoder: Charset.CJK });
index.add(0, "一个单词");
var results = index.search("单词");

Built-In Language Packs

English: en
German: de
French: fr

Import Language Packs: ES6 Modules

The most simple way to assign charset/language specific encoding via modules is:

import EnglishPreset from "flexsearch/lang/en";
const index = Index({
    charset: EnglishPreset
});

You can stack up and combine multiple presets:

import { Charset } from "flexsearch";
import EnglishPreset from "flexsearch/lang/en";

const index = Index({
    charset: new Encoder(
        Charset.LatinAdvanced,
        EnglishPreset,
        { minlength: 3 }
    )
});

You can also assign the encoder preset directly:

const index = Index({
    encoder: Charset.Default
});

Import Language Packs: ES5 Legacy Browser

When loading language packs, make sure that the library was loaded before:

<script src="dist/flexsearch.compact.min.js"></script>
<script src="dist/lang/en.min.js"></script>

The language packs are registered on FlexSearch.Language:

const index = FlexSearch.Index({
    encoder: FlexSearch.Language["en"]
});

You can stack up and combine multiple presets:

const index = FlexSearch.Index({
    charset: new FlexSearch.Encoder(
        FlexSearch.Charset.LatinAdvanced,
        FlexSearch.Language["en"],
        { minlength: 3 }
    )
});

Import Language Packs: Node.js

In Node.js all built-in language packs files are available by its scope:

const EnglishPreset = require("flexsearch/lang/en");
const index = new Index({
    encoder: EnglishPreset
});

Assigning the Encoder instance to the top level configuration will share the encoder to all fields. You should avoid this when contents of fields don't have the same type of content (e.g. one field contains terms, another contains numeric IDs).

Sharing the encoder can improve encoding efficiency and memory allocation, but when not properly used also has negative effect to the performance. You can share encoders to any type of index, also through multiple instances of indexes (also documents).

You should group similar types of contents to one encoder respectively. When you have different content types then define one for each of them.

In this example there are two Document-Indexes for two different documents "orders" and "billings". You can also share encoder to different fields of just one document.

// usual term encoding
const encoder_terms = Encoder(
    Charset.LatinAdvanced,
    // just add letters (no numbers)
    { include: { letter: true } }
);
// numeric encoding
const encoder_numeric = new Encoder(Charset.Default);

const orders = Document({
   document: {
       id: "id",
       index: [{
          field: "product_title",
          encoder: encoder_terms
       },{
          field: "product_details",
          encoder: encoder_terms
       },{
          field: "order_date",
          encoder: encoder_numeric
       },{
          field: "customer_id",
          encoder: encoder_numeric
       }]
   }
});

const billings = Document({
   document: {
      id: "id",
      index: [{
         field: "product_title",
         encoder: encoder_terms
      },{
         field: "product_content",
         encoder: encoder_terms
      },{
         field: "billing_date",
         encoder: encoder_numeric
      },{
         field: "customer_id",
         encoder: encoder_numeric
      }]
   }
});

Merge Documents

When you have multiple document types (indexed by multiple indexes) but some of the data has same fields (like in the example above) and you can refer them by any identifier or key, you should consider merging those documents into one. This will hugely improve index size.

E.g. when you merge "orders" and "billings" from example above by ID, then you can use just one index:

const encoder_terms = Encoder(
    Charset.LatinAdvanced,
    // just add letters (no numbers)
    { include: { letter: true } }
);
const encoder_numeric = new Encoder(
    Charset.Default
);

const merged = Document({
   document: {
       id: "id",
       index: [{
          field: "product_title",
          encoder: encoder_terms
       },{
          field: "product_details",
          encoder: encoder_terms
       },{
          field: "order_date",
          encoder: encoder_numeric
       },{
           field: "billing_date",
           encoder: encoder_numeric
       },{
          field: "customer_id",
          encoder: encoder_numeric
       }]
   }
});

20 KiB Raw Permalink Blame History Unescape Escape