1
0
mirror of https://github.com/nextapps-de/flexsearch.git synced 2025-08-31 09:31:59 +02:00

push v0.8 to master

# Conflicts:
#	README.md
This commit is contained in:
Thomas Wilkerling
2025-03-17 01:14:11 +01:00
parent 44153a67ad
commit 1820491b05
27 changed files with 6591 additions and 287 deletions

183
doc/encoder.md Normal file
View File

@@ -0,0 +1,183 @@
## Encoder
Search capabilities highly depends on language processing. The old workflow wasn't really practicable. The new Encoder class is a huge improvement and fully replaces the encoding part. Some FlexSearch options was moved to the new `Encoder` instance.
New Encoding Pipeline:
1. charset normalization
2. custom preparation
3. split into terms (apply includes/excludes)
4. filter (pre-filter)
5. matcher (substitute terms)
6. stemmer (substitute term endings)
7. filter (post-filter)
8. replace chars (mapper)
9. custom regex (replacer)
10. letter deduplication
11. apply finalize
### Example
```js
const encoder = new Encoder({
normalize: true,
dedupe: true,
cache: true,
include: {
letter: true,
number: true,
symbol: false,
punctuation: false,
control: false,
char: "@"
}
});
```
You can use an `include` __instead__ of an `exclude` definition:
```js
const encoder = new Encoder({
exclude: {
letter: false,
number: false,
symbol: true,
punctuation: true,
control: true
}
});
```
Instead of using `include` or `exclude` you can pass a regular expression to the field `split`:
```js
const encoder = new Encoder({
split: /\s+/
});
```
> The definitions `include` and `exclude` is a replacement for `split`. You can just define one of those 3.
Adding custom functions to the encoder pipeline:
```js
const encoder = new Encoder({
normalize: function(str){
return str.toLowerCase();
},
prepare: function(str){
return str.replace(/&/g, " and ");
},
finalize: function(arr){
return arr.filter(term => term.length > 2);
}
});
```
Assign encoder to an index:
```js
const index = new Index({
encoder: encoder
});
```
Define language specific transformations:
```js
const encoder = new Encoder({
replacer: [
/[´`ʼ]/g, "'"
],
filter: new Set([
"and",
]),
matcher: new Map([
["xvi", "16"]
]),
stemmer: new Map([
["ly", ""]
]),
mapper: new Map([
["é", "e"]
])
});
```
Or use predefined language and extend it with custom options:
```js
import EnglishBookPreset from "./lang/en.js";
const encoder = new Encoder(EnglishBookPreset, {
filter: false
});
```
Equivalent:
```js
import EnglishBookPreset from "./lang/en.js";
const encoder = new Encoder(EnglishBookPreset);
encoder.assign({ filter: false });
```
Assign extensions to the encoder instance:
```js
import LatinEncoderPreset from "./charset/latin/simple.js";
import EnglishBookPreset from "./lang/en.js";
// stack definitions to the encoder instance
const encoder = new Encoder()
.assign(LatinEncoderPreset)
.assign(EnglishBookPreset)
// override preset options ...
.assign({ minlength: 3 });
// assign further presets ...
```
> When adding extension to the encoder every previously assigned configuration is still intact, very much like Mixins, also when assigning custom functions.
Add custom transformations to an existing index:
```js
import LatinEncoderPreset from "./charset/latin/default.js";
const encoder = new Encoder(LatinEncoderPreset);
encoder.addReplacer(/[´`ʼ]/g, "'");
encoder.addFilter("and");
encoder.addMatcher("xvi", "16");
encoder.addStemmer("ly", "");
encoder.addMapper("é", "e");
```
Shortcut for just assigning one encoder configuration to an index:
```js
import LatinEncoderPreset from "./charset/latin/default.js";
const index = new Index({
encoder: LatinEncoderPreset
});
```
### Custom Encoder
Since it is very simple to create a custom Encoder, you are welcome to create your own.
e.g.
```js
function customEncoder(content){
const tokens = [];
// split content into terms/tokens
// apply your changes to each term/token
// you will need to return an Array of terms/tokens
// so just iterate through the input string and
// push tokens to the array
// ...
return tokens;
}
const index = new Index({
// set to strict when your tokenization was already done
tokenize: "strict",
encode: customEncoder
});
```
If you get some good results please feel free to share your encoder.