## Encoder > [!IMPORTANT] > You shouldn't miss this part as it is one of the most important aspects of FlexSearch. Search capabilities highly depends on language processing. The Encoder class is one of the most important core functionalities of FlexSearch. > Encoders are basically responsible for "fuzziness". [Read here about Phonetic Search/Fuzzy Search](../README.md#fuzzy-search) ### Default Encoder The default Encoder (when passing no options on creation) uses this configuration: ```js const encoder = new Encoder({ normalize: true, dedupe: true, cache: true, include: { letter: true, number: true, symbol: false, punctuation: false, control: false, char: "" } }); ``` The default configuration will: 1. apply charset normalization, e.g. "é" to "e" 2. apply letter deduplication, e.g. "missing" to "mising" 3. just index alphanumeric content and filter everything else out This is important to keep in mind, because when you need a different configuration you'll have to change those settings accordingly. Let's assume you want including the symbols "#", "@" and "-", because those are needed to differentiate search results (otherwise it would be useless), and let's say you don't need numeric content indexed you can do this by: ```js const encoder = new Encoder({ // default configuration is applied // extend or override: include: { // by default everything is set to false letter: true, number: false, char: ["#", "@", "-"] } }); ``` #### Built-In Universal Encoders 1. Charset.Exact 2. Charset.Normalize (Charset.Default) #### Built-In Latin Encoders 1. Charset.LatinBalance 2. Charset.LatinAdvanced 3. Charset.LatinExtra 4. Charset.LatinSoundex #### Built-In CJK Encoder 1. Charset.CJK ### Basic Usage ```js const encoder = new Encoder({ normalize: true, dedupe: true, cache: true, include: { letter: true, number: true, symbol: false, punctuation: false, control: false, char: "@" } }); ``` You can use an `include` __instead__ of an `exclude` definition: ```js const encoder = new Encoder({ exclude: { letter: false, number: false, symbol: true, punctuation: true, control: true } }); ``` Instead of using `include` or `exclude` you can pass a regular expression or a string to the field `split`: ```js const encoder = new Encoder({ split: /\s+/ }); ``` E.g. this split configuration will tokenize every symbol/char from a content: ```js const encoder = new Encoder({ split: "" }); ``` > The definitions `include` and `exclude` is a replacement for `split`. You can just define one of those 3. Adding custom functions to the encoder pipeline: ```js const encoder = new Encoder({ normalize: function(str){ return str.toLowerCase(); }, prepare: function(str){ return str.replace(/&/g, " and "); }, finalize: function(arr){ return arr.filter(term => term.length > 2); } }); ``` Further reading: [Encoder Processing Workflow](#encoder-processing-workflow) Assign an encoder to an index: ```js const index = new Index({ encoder: encoder }); ``` Define language specific normalizations/transformations: ```js const encoder = new Encoder({ stemmer: new Map([ ["ly", ""] ]), filter: new Set([ "and", ]), mapper: new Map([ ["é", "e"] ]), matcher: new Map([ ["xvi", "16"] ]), replacer: [ /[´`’ʼ]/g, "'" ], }); ``` Further reading: [Encoder Processing Workflow](#encoder-processing-workflow) Or use built-in helpers alternatively: ```js const encoder = new Encoder() .addStemmer("ly", "") .addFilter("and") .addMapper("é", "e") .addMatcher("xvi", "16") .addReplacer(/[´`’ʼ]/g, "'"); ``` Some of the built-in helpers will automatically detect inputs and use the proper helper under the hood. So theoretically you can lazily just write: ```js const encoder = new Encoder() .addStemmer("ly", "") .addFilter("and") .addReplacer("é", "e") .addReplacer("xvi", "16") .addReplacer(/[´`’ʼ]/g, "'"); ``` You can also use presets and extend it with custom options: ```js import EnglishBookPreset from "flexsearch/lang/en"; const encoder = new Encoder( EnglishBookPreset, // use the preset but don't filter terms { filter: false } ); ``` Equivalent: ```js const encoder = new Encoder(EnglishBookPreset); encoder.assign({ filter: false }); ``` Assign multiple extensions to the encoder instance: ```js import Charset from "flexsearch"; import EnglishBookPreset from "flexsearch/lang/en"; // stack definitions to the encoder instance const encoder = new Encoder() .assign(Charset.LatinSoundex) .assign(EnglishBookPreset) // extend or override preset options: .assign({ minlength: 3 }); // assign further presets ... ``` > When adding extension to the encoder every previously assigned configuration is still intact, also when assigning custom functions, the previously added function will still execute. Add custom transformations to an existing index: ```js const encoder = new Encoder(Charset.Normalize); // filter terms encoder.addFilter("and"); // replace single chars encoder.addMapper("é", "e"); // replace char sequences encoder.addMatcher("xvi", "16"); // replace single chars or char sequences // at the end of a term encoder.addStemmer("ly", ""); // custom regex replace encoder.addReplacer(/[´`’ʼ]/g, "'"); ``` Using a custom filter: ```js encoder.addFilter(function(str){ // return true to keep the content return str.length > 1; }); ``` Shortcut for just assigning one encoder configuration to an index: ```js const index = new Index({ encoder: Charset.Normalize }); ``` ## Encoder Options
Option | Values | Description | Default |
You can just choose one of those 3 options: | |||
include |
Encoder Split Options | Define which of the string contents should be included (inclusion properties defaults to false) | { letter: true, number: true } |
exclude |
Encoder Split Options | Define which of the string contents should be excluded (exclusion properties defaults to true) | false |
split |
false RegExp String Encoder Split Options |
The expression used to split the content into terms | → include { letter: true, number: true } |
Other options: | |||
dedupe |
Boolean | Deduplicate consecutive letters, e.g. "missing" to "mising" | true |
numeric |
Boolean | By default, the extended numeric support (Triplets) inherits from chosen Encoder Split Options. You probably might want to disable Triplets to get a more exact result (fewer entries) in some cases. | true |
minlength |
Number | Set the minimum term length which should be added to the index. This limit does not apply to the forward tokenizer. You still get results when just typing "f" on a term "flexsearch" when e.g. minlength: 4 was used. |
1 |
maxlength |
Number | Set the maximum term length which should be added to the index. Larger content will drop. | 1 |
rtl |
Boolean | Force Right-To-Left encoding (you should just apply this when the string content was not already encoded as RTL) | false |
normalize |
true enable normalization (default)false disable normalizationfunction(str) => str custom function
|
The normalization stage will apply basic charset normalization e.g. by replacing "é" to "e" | true |
prepare |
function(str) => str custom function
|
The preparation stage is a custom function direct followed when normalization was done | false |
finalize |
function([str]) => [str] custom function
|
The finalization stage is a custom function executed at the last task in the encoding pipeline (here it gets an array of tokens and need to return an array of tokens) | false |
filter |
Set(["and", "to", "be"]) function(str) => bool custom function
encoder.addFilter("and")
|
Stop-word filter is like a blacklist of words to be filtered out from indexing at all (e.g. "and", "to" or "be"). This is also very useful when using Context Search | false |
stemmer |
Map([["ing", ""], ["ies", "y"]])
encoder.addStemmer("ing", "")
|
Stemmer will normalize several linguistic mutations of the same word (e.g. "run" and "running", or "property" and "properties"). This is also very useful when using Context Search | false |
mapper |
Map([["é", "e"], ["ß", "ss"]])
encoder.addMapper("é", "e")
|
Mapper will replace a single char (e.g. "é" into "e") | false |
matcher |
Map([["and", "&"], ["usd", "$"]])
encoder.addMatcher("and", "&")
|
Matcher will do same as Mapper but instead of single chars it will replace char sequences | false |
replacer |
[/[^a-z0-9]/g, "", /([^aeo])h(.)/g, "$1$2"])
encoder.addReplacer(/[^a-z0-9]/g, "")
|
Replacer takes custom regular expressions and couldn't get optimized in the same way as Mapper or Matcher. You should take this as the last option when no other replacement can do the same. | false |
cache |
Boolean | In some very rare situations (large consecutive content with high cardinality) it might be useful to disable the internal event-loop-cache | true |
Option | Values | Description | Default |
letter |
Boolean | Toggle inclusion of letters on/off | true |
number |
Boolean | Toggle inclusion of numerics on/off | true |
symbol |
Boolean | Toggle inclusion of symbols on/off | false |
punctuation |
Boolean | Toggle inclusion of punctuation on/off | false |
control |
Boolean | Toggle inclusion of control chars on/off | false |
char |
String Array[String] |
Toggle inclusion of specific chars on/off | false |