diff --git a/README.md b/README.md index b89a6c3..91d4bcc 100644 --- a/README.md +++ b/README.md @@ -15,13 +15,13 @@ FlexSearch v0.8: [Overview and Migration Guide](doc/0.8.0.md) [Basic Start](#load-library)  •  [API Reference](#api-overview)  •  -Encoder  •  -Document Search  •  -Persistent Indexes  •  -Using Worker  •  -Tag Search  •  -Resolver  •  -Changelog +[Encoder](doc/encoder.md)  •  +[Document Search](doc/document-search.md)  •  +[Persistent Indexes](doc/persistent.md)  •  +[Using Worker](doc/worker.md)  •  +[Tag Search](doc/document-search.md#tag-search)  •  +[Resolver](doc/resolver.md)  •  +[Changelog](CHANGELOG.md) - ```bash npm install flexsearch ``` @@ -301,7 +301,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/`
Download Builds - +
@@ -397,8 +397,9 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/`
Compare Bundles: Light, Compact, Bundle +
-> The Node.js package includes all features from `flexsearch.bundle.js`. +> The Node.js package includes all features.
@@ -419,7 +420,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` @@ -428,7 +429,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` @@ -437,7 +438,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` @@ -446,7 +447,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` @@ -455,7 +456,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` @@ -473,16 +474,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` - - - - - - - @@ -491,7 +483,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` @@ -509,7 +501,7 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` @@ -517,28 +509,28 @@ The **_dist_** folder is located in: `node_modules/flexsearch/dist/` - + - + - + - + @@ -700,15 +692,15 @@ const index = new FlexSearch.Index(/* ... */); Or require FlexSearch members separately by: ```js -const { Index, Document, Encoder, Charset, Resolver, Worker, IndexedDB } = require("flexsearch"); +const { Index, Document, Encoder, Charset, Resolver, Worker } = require("flexsearch"); const index = new Index(/* ... */); ``` When using ESM instead of CommonJS: ```js -import { Index, Document, Encoder, Charset, Resolver, Worker, IndexedDB } from "flexsearch"; -const index = new FlexSearch.Index(/* ... */); +import { Index, Document, Encoder, Charset, Resolver, Worker } from "flexsearch"; +const index = new Index(/* ... */); ``` Language packs are accessible via: @@ -746,19 +738,17 @@ index.add(id, text); const result = index.search(text, options); ``` -```js -worker.add(id, text); -const result = worker.search(text, options); -``` - ```js document.add(doc); const result = document.search(text, options); ``` -Each of these index types have a persistent model (optionally). So, persistent index isn't a new 4th index type, instead it extends the existing ones. +```js +await worker.add(id, text); +const result = await worker.search(text, options); +``` -> Every method called on a `Worker` index is treated as async. You will get back a `Promise` or you can provide a callback function as the last parameter additionally. +> Every method called on a `Worker` index is treated as async. You will get back a `Promise` or you can provide a callback function as the last parameter alternatively. ### Common Code Examples @@ -766,7 +756,7 @@ The documentation will refer to several examples. A list of all examples:
-Examples Node.js (CommonJS) +Examples Node.js (CommonJS)
- [basic](example/nodejs-commonjs/basic) - [basic-suggestion](example/nodejs-commonjs/basic-suggestion) @@ -786,7 +776,7 @@ The documentation will refer to several examples. A list of all examples:
-Examples Node.js (ESM/Module) +Examples Node.js (ESM/Module)
- [basic](example/nodejs-esm/basic) - [basic-suggestion](example/nodejs-esm/basic-suggestion) @@ -808,7 +798,7 @@ The documentation will refer to several examples. A list of all examples:
-Examples Browser (Legacy) +Examples Browser (Legacy)
- [basic](example/browser-legacy/basic) - [basic-suggestion](example/browser-legacy/basic-suggestion) @@ -823,7 +813,7 @@ The documentation will refer to several examples. A list of all examples:
-Examples Browser (ESM/Module) +Examples Browser (ESM/Module)
- [basic](example/browser-module/basic) - [basic-suggestion](example/browser-module/basic-suggestion) @@ -888,19 +878,19 @@ Global Members: `Document` Methods: -- document.__add__(\, document) -- ~~document.__append__(\, document)~~ -- document.__update__(\, document) -- document.__remove__(id) -- document.__remove__(document) -- document.__search__(string, \, \) -- document.__search__(options) -- document.__searchCache__(...) -- document.__contain__(id) -- document.__clear__() -- document.__cleanup__() -- document.__get__(id) -- document.__set__(\, document) +- document.__add__(\, document)\ +- ~~document.__append__(\, document)~~\ +- document.__update__(\, document)\ +- document.__remove__(id)\ +- document.__remove__(document)\ +- document.__search__(string, \, \)\ +- document.__search__(options)\ +- document.__searchCache__(...)\ +- document.__contain__(id)\ +- document.__clear__()\ +- document.__cleanup__()\ +- document.__get__(id)\ +- document.__set__(\, document)\ - _async_ document.__export__(handler) @@ -970,12 +960,14 @@ Methods `export` and also `import` are always async as well as every method you --- -`Charset` Encoder Preset: +`Charset` Universal Encoder Preset: - Charset.__Exact__ - Charset.__Default__ - Charset.__Normalize__ +- Charset.__Dedupe__ +`Charset` Latin-specific Encoder Preset: - Charset.__LatinBalance__ - Charset.__LatinAdvanced__ @@ -1064,53 +1056,55 @@ Encoding is one of the most important task and heavily influence:
+ + - - - - - - - - + + + - + + - + + - + + + +
- Async Processing + Async Processing
- Workers (Web + Node.js) + Workers (Web + Node.js) -
- Context Search + Context Search
- Document Search + Document Search
- Document Store + Document Store
- Relevance Scoring -
- Auto-Balanced Cache by Popularity/Last Queries + Auto-Balanced Cache by Popularity/Last Queries
- Tag Search + Tag Search
- Phonetic Search (Fuzzy Search) + Phonetic Search (Fuzzy Search)
EncoderEncoder
Export / Import IndexesExport / Import Indexes -
ResolverResolver - -
Persistent Index (IndexedDB)Persistent Index (IndexedDB) - -
Option DescriptionCharset Type Compression Ratio
Exact Bypass encoding and take exact inputUniversal (multi-lang) 0%
DefaultCase in-sensitive encoding3%
NormalizeCase in-sensitive encoding
Charset normalization
Normalize (Default)Case in-sensitive encoding
Charset normalization
Letter deduplication
Universal (multi-lang) ~ 7%
LatinBalanceCase in-sensitive encoding
Charset normalization
Phonetic basic transformation
Case in-sensitive encoding
Charset normalization
Letter deduplication
Phonetic basic transformation
Latin ~ 30%
LatinAdvancedCase in-sensitive encoding
Charset normalization
Phonetic advanced transformation
Case in-sensitive encoding
Charset normalization
Letter deduplication
Phonetic advanced transformation
Latin ~ 45%
LatinExtraCase in-sensitive encoding
Charset normalization
Soundex-like transformation
Case in-sensitive encoding
Charset normalization
Letter deduplication
Soundex-like transformation
Latin ~ 60%
LatinSoundex Full Soundex transformationLatin ~ 70%
function(str) => [str] Pass a custom encoding function to the EncoderLatin
@@ -1121,22 +1115,22 @@ Encoding is one of the most important task and heavily influence: #### Create a new index ```js -var index = new Index(); +const index = new Index(); ``` -Create a new index and choosing one of the presets: +Create a new index and choosing one of the [Presets](#presets): ```js -var index = new Index("performance"); +const index = new Index("match"); ``` Create a new index with custom options: ```js -var index = new Index({ - charset: "latin:extra", - tokenize: "reverse", - resolution: 9 +const index = new Index({ + tokenize: "forward", + resolution: 9, + fastupdate: true }); ``` @@ -1150,6 +1144,19 @@ var index = new FlexSearch({ }); ``` +Create a new index and assign an [Encoder](doc/encoder.md): + +```js +//import { Charset } from "./dist/module/charset.js"; +import { Charset } from "flexsearch"; +const index = new Index({ + tokenize: "forward", + encoder: Charset.LatinBalance +}); +``` + + + The resolution refers to the maximum count of scoring slots on which the content is divided into. > A formula to determine a well-balanced value for the `resolution` is: $2*floor(\sqrt{content.length})$ where content is the value pushed by `index.add()`. Here the maximum length of all contents should be used. @@ -1182,6 +1189,7 @@ Limit the result: index.search("John", 10); ``` + #### Check existence of already indexed IDs You can check if an ID was already indexed by: @@ -1192,34 +1200,6 @@ if(index.contain(1)){ } ``` - - #### Update item from an index @@ -1238,24 +1218,136 @@ index.update(0, "Max Miller"); index.remove(0); ``` - -## Document Search (Field-Search) - -[Read here](doc/document-search.md) - - ### Chaining Simply chain methods like: ```js -var index = Index.create().addMatcher({'â': 'a'}).add(0, 'foo').add(1, 'bar'); +const index = Index.create().addMatcher({'â': 'a'}).add(0, 'foo').add(1, 'bar'); ``` ```js index.remove(0).update(1, 'foo').add(2, 'foobar'); ``` +## Suggestions + +Any query on each of the index types is supporting the option `suggest: true`. Also within some of the `Resolver` stages (and, not, xor) you can add this option for the same purpose. + +When suggestions is enabled, it allows results which does not perfectly match to the given query e.g. when one term was not included. Suggestion-Search will keep track of the scoring, therefore the first result entry is the closest one to a perfect match. + +```js +const index = Index.create().add(1, "cat dog bird"); +const result = index.search("cat fish"); +// result => [] +``` + +Same query with suggestion enabled: + +```js +const result = index.search("cat fish", { suggest: true }); +// result => [ 1 ] +``` + +At least one match (or partial match) has to be found to get back any result: + +```js +const result = index.search("horse fish", { suggest: true }); +// result => [] +``` + +## Fuzzy-Search + +Fuzzysearch describes a basic concept of how making queries more tolerant. FlexSearch provides several methods to achieve fuzziness: + +1. Use a tokenizer: `forward`, `reverse` or `full` +2. Don't forget to use any of the builtin encoder `simple` > `balance` > `advanced` > `extra` > `soundex` (sorted by fuzziness) +3. Use one of the language specific presets e.g. `/lang/en.js` for en-US specific content +4. Enable suggestions by passing the search option `suggest: true` + +Additionally, you can apply custom `Mapper`, `Replacer`, `Stemmer`, `Filter` or by assigning a custom `normalize(str)`, `prepare(str)` or `finalize(arr)` function to the Encoder. + +### Compare Built-In Encoder Preset + +Original term which was indexed: "Struldbrugs" + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Encoder:ExactNormalize (Default)LatinBalanceLatinAdvancedLatinExtraLatinSoundex
Index Size3.1 Mb1.9 Mb1.7 Mb1.6 Mb1.1 Mb0.7 Mb
Struldbrugs
strũlldbrųĝgs
strultbrooks
shtruhldbrohkz
zdroltbrykz
struhlbrogger
+ +The index size was measured after indexing the book "Gulliver's Travels". + ## Context Search @@ -1292,6 +1384,24 @@ var index = new FlexSearch({ > The contextual index requires additional amount of memory depending on depth. +## Auto-Balanced Cache (By Popularity) + +You need to initialize the cache and its limit of available cache slots during the creation of the index: + +```js +const index = new Index({ cache: 100 }); +``` + +> The method `.searchCache(query)` is available for each type of index. + +```js +const results = index.searchCache(query); +``` + +> The cache automatically balance stored entries related to their popularity. + +The cache also stores latest queries. A common scenario is an autocomplete or instant search when typing. + ## Index Memory Allocation The book "Gulliver's Travels" (Swift Jonathan 1726) was indexed for this test. @@ -1331,9 +1441,18 @@ You can pass a preset during creation/initialization of the index. ## Best Practices +### Page-Load / Fast-Boot + +There are several options to optimize either the page load or when booting up or populate an index on server-side: + +- Using [Fast-Boot Serialization](doc/export-import.md#fast-boot-serialization-for-server-side-rendering-php-python-ruby-rust-java-go-nodejs-) for small and simple indexes +- Using [Non-Blocking Runtime Balancer (Async)](doc/async.md) for populating larger amounts of contents while doing other processes in parallel +- Using [Worker Indexes](doc/worker.md) will distribute the workload to dedicated balanced threads +- Using [Persistent Indexes](doc/persistent.md) when targeting a zero-latency boot-up + ### Use numeric IDs -It is recommended to use numeric id values as reference when adding content to the index. The byte length of passed ids influences the memory consumption significantly. If this is not possible you should consider to use a index table and map the ids with indexes, this becomes important especially when using contextual indexes on a large amount of content. +It is recommended to use id values from type `number` as reference when adding content to the index. The reserved byte length of passed ids influences the memory consumption significantly. When stringified numeric IDs are included in your datasets consider replacing these by `parseInt(...)` before pushing to the index. --- diff --git a/doc/cache.md b/doc/cache.md deleted file mode 100644 index e80150e..0000000 --- a/doc/cache.md +++ /dev/null @@ -1,17 +0,0 @@ -## Auto-Balanced Cache (By Popularity) - -You need to initialize the cache and its limit during the creation of the index: - -```js -const index = new Index({ cache: 100 }); -``` - -```js -const results = index.searchCache(query); -``` - -A common scenario for using a cache is an autocomplete or instant search when typing. - -> When passing a number as a limit the cache automatically balance stored entries related to their popularity. - -> When just using "true" the cache is unbounded and perform actually 2-3 times faster (because the balancer do not have to run). diff --git a/doc/custom-builds.md b/doc/custom-builds.md index 2103590..e41c164 100644 --- a/doc/custom-builds.md +++ b/doc/custom-builds.md @@ -12,7 +12,7 @@ You can't resolve build flags with: - rollup - Terser -These are some of the basic builds located in the `/dist/` folder: +You can run any of the basic builds located in the `/dist/` folder, e.g.: ```bash npm run build:bundle diff --git a/doc/document-search.md b/doc/document-search.md index 215bdd0..4850b02 100644 --- a/doc/document-search.md +++ b/doc/document-search.md @@ -1,10 +1,24 @@ +# Document Search (Field-Search) - -## Index Documents (Field-Search) +Whereas the simple `Index` can just consume id-content pairs, the `Document`-Index is able to process more complex data structures like JSON. +Technically, a `Document`-Index is a layer on top of several default indexes. You can create multiple independent Document-Indexes in parallel, any of them can use the `Worker` or `Persistent` model optionally. -### The Document Descriptor +FlexSearch Documents also contain these features: -Assuming our document has a data structure like this: +- Document Store including Enrichment +- Multi-Field-Search +- Multi-Tag-Search +- Resolver (Chain Complex Queries) +- Result Highlighting +- Export/Import +- Worker +- Persistent + +## The Document Descriptor + +When creating a `Document`-Index you will need to define a document descriptor in the field `document`. This descriptor is including any specific information about how the document data should be indexed. + +Assuming our document has a simple data structure like this: ```json { @@ -13,42 +27,32 @@ Assuming our document has a data structure like this: } ``` -> The document descriptor has slightly changed, there is no `field` branch anymore, instead just apply one level higher, so `key` becomes a main member of options. +An appropriate Document Descriptor has always to define at least 2 things: -For the new syntax the field "doc" was renamed to `document` and the field "field" was renamed to `index`: +1. the property `id` describes the location of the document ID within a document item +2. the property `index` (or `tag`) containing one or multiple fields from the document, which should be indexed for searching ```js +// create a document index const index = new Document({ document: { id: "id", - index: ["content"] + index: "content" } }); +// add documents to the index index.add({ id: 0, content: "some text" }); ``` -The field `id` describes where the ID or unique key lives inside your documents. The default key gets the value `id` by default when not passed, so you can shorten the example from above to: +As briefly explained above, the field `id` describes where the ID or unique key lives inside your documents. When not passed it will always take the field `id` from the top level scope of your data. -```js -const index = new Document({ - document: { - index: ["content"] - } -}); -``` +The property `index` takes all fields you would like to have indexed. When just selecting one field, then you can pass a string. -The member `index` has a list of fields which you want to be indexed from your documents. When just selecting one field, then you can pass a string. When also using default key `id` then this shortens to just: - -```js -const index = new Document({ document: "content" }); -index.add({ id: 0, content: "some text" }); -``` - -Assuming you have several fields, you can add multiple fields to the index: +The next example will add 2 fields `title` and `content` to the index: ```js var docs = [{ @@ -69,7 +73,7 @@ const index = new Document({ }); ``` -You can pass custom options for each field: +Add both fields to the document descriptor and pass individual [Index-Options](options.md) for each field: ```js const index = new Document({ @@ -77,47 +81,37 @@ const index = new Document({ index: [{ field: "title", tokenize: "forward", - optimize: true, + encoder: Charset.LatinAdvanced, resolution: 9 },{ field: "content", - tokenize: "strict", - optimize: true, - resolution: 5, - minlength: 3, - context: { - depth: 1, - resolution: 3 - } + tokenize: "forward", + encoder: Charset.LatinAdvanced, + resolution: 3 }] }); ``` -Field options gets inherited when also global options was passed, e.g.: +Field options inherits from top level options when passed, e.g.: ```js const index = new Document({ - tokenize: "strict", - optimize: true, + tokenize: "forward", + encoder: Charset.LatinAdvanced, resolution: 9, document: { id: "id", index:[{ - field: "title", - tokenize: "forward" + field: "title" },{ field: "content", - minlength: 3, - context: { - depth: 1, - resolution: 3 - } + resolution: 3 }] } }); ``` -Note: The context options from the field "content" also gets inherited by the corresponding field options, whereas this field options was inherited by the global option. +> Assigning the `Encoder` instance to the top level configuration will share the encoder to all fields. You should avoid this when contents of fields don't have the same type of content (e.g. one field contains terms, another contains numeric IDs). ### Nested Data Fields (Complex Objects) @@ -136,7 +130,7 @@ Assume the document array looks more complex (has nested branches etc.), e.g.: } ``` -Then use the colon separated notation `root:child:child` to define hierarchy within the document descriptor: +Then use the colon separated notation `root:child:child` as a name for each field defining the hierarchy which corresponds to the document: ```js const index = new Document({ @@ -150,9 +144,11 @@ const index = new Document({ } }); ``` -> Just add fields you want to query against. Do not add fields to the index, you just need in the result (but did not query against). For this purpose you can store documents independently of its index (read below). -When you want to query through a field you have to pass the exact key of the field you have defined in the `doc` as a field name (with colon syntax): +> [!TIP] +> Just add fields you want to query against. Do not add fields to the index, you just need in the result. For this purpose you can store documents independently of its index (read below). + +To query against one or multiple specific fields you have to pass the exact key of the field you have defined in the document descriptor as a field name (with colon syntax): ```js index.search(query, { @@ -176,6 +172,20 @@ index.search(query, [ Using field-specific options: +```js +index.search("some query", [{ + field: "record:title", + limit: 100, + suggest: true +},{ + field: "record:content:header", + limit: 100, + suggest: false +}]); +``` + +You can also perform a search through the same field with different queries: + ```js index.search([{ field: "record:title", @@ -190,15 +200,11 @@ index.search([{ }]); ``` -You can perform a search through the same field with different queries. - -> When passing field-specific options you need to provide the full configuration for each field. They get not inherited like the document descriptor. - ### Complex Documents You need to follow 2 rules for your documents: -1. The document cannot start with an Array at the root index. This will introduce sequential data and isn't supported yet. See below for a workaround for such data. +1. The document cannot start with an Array __at the root__. This will introduce sequential data and isn't supported yet. See below for a workaround for such data. ```js [ // <-- not allowed as document start! @@ -209,7 +215,7 @@ You need to follow 2 rules for your documents: ] ``` -2. The id can't be nested inside an array (also none of the parent fields can't be an array). This will introduce sequential data and isn't supported yet. See below for a workaround for such data. +2. The document ID can't be nested __inside an Array__. This will introduce sequential data and isn't supported yet. See below for a workaround for such data. ```js { @@ -255,27 +261,29 @@ The corresponding document descriptor (when all fields should be indexed) looks const index = new Document({ document: { id: "meta:id", - tag: "meta:tag", index: [ - "contents[]:body:title", - "contents[]:body:footer", - "contents[]:keywords" + "contents:body:title", + "contents:body:footer" + ], + tag: [ + "meta:tag", + "contents:keywords" ] } }); ``` -Again, when searching you have to use the same colon-separated-string from your field definition. +Remember when searching you have to use the same colon-separated-string as a key from your field definition. ```js index.search(query, { - index: "contents[]:body:title" + index: "contents:body:title" }); ``` ### Not Supported Documents (Sequential Data) -This example breaks both rules from above: +This example breaks both rules described above: ```js [ // <-- not allowed as document start! @@ -303,90 +311,83 @@ This example breaks both rules from above: ] ``` -You need to apply some kind of structure normalization. +You need to unroll your data within a simple loop before adding to the index. -A workaround to such a data structure looks like this: +A workaround to such a data structure from above could look like: ```js const index = new Document({ document: { - id: "record:id", - tag: "tag", + id: "id", index: [ - "record:body:title", - "record:body:footer", - "record:body:keywords" + "body:title", + "body:footer" + ], + tag: [ + "tag", + "keywords" ] } }); function add(sequential_data){ - for(let x = 0, data; x < sequential_data.length; x++){ + for(let x = 0, item; x < sequential_data.length; x++){ - data = sequential_data[x]; + item = sequential_data[x]; - for(let y = 0, record; y < data.records.length; y++){ - - record = data.records[y]; - - index.add({ - id: record.id, - tag: data.tag, - record: record - }); + for(let y = 0, record; y < item.records.length; y++){ + record = item.records[y]; + // append tag to each record + record.tag = item.tag; + // add to index + index.add(record); } } } // now just use add() helper method as usual: - add([{ // sequential structured data // take the data example above }]); ``` -You can skip the first loop when your document data has just one index as the outer array. - -### Add/Update/Remove Documents to/from the Index +### Add/Update/Remove Documents Add a document to the index: ```js index.add({ - id: 0, - title: "Foo", - content: "Bar" - }); -``` - -Update index with a single object or an array of objects: - -```js -index.update({ - data:{ - id: 0, - title: "Foo", - body: { - content: "Bar" - } - } + id: 0, + title: "Foo", + content: "Bar" }); ``` -Remove a single object or an array of objects from the index: +Update index: ```js -index.remove(docs); +index.update({ + id: 0, + title: "Foo", + content: "Foobar" +}); ``` -When the id is known, you can also simply remove by (faster): +Remove a document and all its contents from an index, by ID: ```js index.remove(id); ``` +Or by the document data: + +```js +index.remove(doc); +``` + + -### Field-Search +## Field-Search Search through all fields: @@ -417,13 +419,7 @@ Search through a given set of fields: index.search(query, { index: ["title", "content"] }); ``` -Same as: - -```js -index.search(query, ["title", "content"]); -``` - -Pass custom modifiers and queries to each field: +Pass custom options and/or queries to each field: ```js index.search([{ @@ -439,11 +435,21 @@ index.search([{ }]); ``` -You can perform a search through the same field with different queries. +### Limit & Offset -See all available field-search options. +> By default, every query is limited to 100 entries. Unbounded queries leads into issues. You need to set the limit as an option to adjust the size. -### The Result Set +You can set the limit and the offset for each query: + +```js +index.search(query, { limit: 20, offset: 100 }); +``` + +> You cannot pre-count the size of the result-set. That's a limit by the design of FlexSearch. When you really need a count of all results you are able to page through, then just assign a high enough limit and get back all results and apply your paging offset manually (this works also on server-side). FlexSearch is fast enough that this isn't an issue. + +[See all available field-search options](options.md) + +## The Result Set Schema of the result-set: @@ -566,51 +572,98 @@ index.search(query, { This gives you result which are tagged with one of the given tag. -> Multiple tags will apply as the boolean "or" by default. It just needs one of the tags to be existing. - -This is another situation where the `bool` property is still supported. When you like to switch the default "or" logic from the tag search into "and", e.g.: - -```js -index.search(query, { - index: "content", - tag: ["dog", "animal"], - bool: "and" -}); -``` You will just get results which contains both tags (in this example there is just one records which has the tag "dog" and "animal"). -### Tag Search - -You can also fetch results from one or more tags when no query was passed: +## Multi-Tag-Search +Assume this document schema (a dataset from IMDB): ```js -index.search({ tag: ["cat", "dog"] }); +{ + "tconst": "tt0000001", + "titleType": "short", + "primaryTitle": "Carmencita", + "originalTitle": "Carmencita", + "isAdult": 0, + "startYear": "1894", + "endYear": "", + "runtimeMinutes": "1", + "genres": [ + "Documentary", + "Short" + ] +} ``` -In this case the result-set looks like: - +An appropriate document descriptor could look like: ```js -[{ - tag: "cat", - result: [ /* all cats */ ] -},{ - tag: "dog", - result: [ /* all dogs */ ] -}] +import Charset from "flexsearch"; +const flexsearch = new Document({ + encoder: Charset.Normalize, + resolution: 3, + document: { + id: "tconst", + //store: true, // document store + index: [{ + field: "primaryTitle", + tokenize: "forward" + },{ + field: "originalTitle", + tokenize: "forward" + }], + tag: [ + "startYear", + "genres" + ] + } +}); +``` +The field contents of `primaryTitle` and `originalTitle` are encoded by the forward tokenizer. The field contents of `startYear` and `genres` are added as tags. + +Get all entries of a specific tag: +```js +const result = flexsearch.search({ + //enrich: true, // enrich documents + tag: { "genres": "Documentary" }, + limit: 1000, + offset: 0 +}); ``` -### Limit & Offset - -> By default, every query is limited to 100 entries. Unbounded queries leads into issues. You need to set the limit as an option to adjust the size. - -You can set the limit and the offset for each query: - +Get entries of multiple tags (intersection): ```js -index.search(query, { limit: 20, offset: 100 }); +const result = flexsearch.search({ + //enrich: true, // enrich documents + tag: { + "genres": ["Documentary", "Short"], + "startYear": "1894" + } +}); ``` -> You cannot pre-count the size of the result-set. That's a limit by the design of FlexSearch. When you really need a count of all results you are able to page through, then just assign a high enough limit and get back all results and apply your paging offset manually (this works also on server-side). FlexSearch is fast enough that this isn't an issue. +Combine tags with queries (intersection): +```js +const result = flexsearch.search({ + query: "Carmen", // forward tokenizer + tag: { + "genres": ["Documentary", "Short"], + "startYear": "1894" + } +}); +``` + +Alternative declaration: +```js +const result = flexsearch.search("Carmen", { + tag: [{ + field: "genres", + tag: ["Documentary", "Short"] + },{ + field: "startYear", + tag: "1894" + }] +}); +``` ## Document Store @@ -770,99 +823,6 @@ By passing the search option `merge: true` the result set will be merged into (g }] ``` - - -## Multi-Tag-Search - -Assume this document schema (a dataset from IMDB): -```js -{ - "tconst": "tt0000001", - "titleType": "short", - "primaryTitle": "Carmencita", - "originalTitle": "Carmencita", - "isAdult": 0, - "startYear": "1894", - "endYear": "", - "runtimeMinutes": "1", - "genres": [ - "Documentary", - "Short" - ] -} -``` - -An appropriate document descriptor could look like: -```js -import LatinEncoder from "./charset/latin/simple.js"; - -const flexsearch = new Document({ - encoder: LatinEncoder, - resolution: 3, - document: { - id: "tconst", - //store: true, // document store - index: [{ - field: "primaryTitle", - tokenize: "forward" - },{ - field: "originalTitle", - tokenize: "forward" - }], - tag: [ - "startYear", - "genres" - ] - } -}); -``` -The field contents of `primaryTitle` and `originalTitle` are encoded by the forward tokenizer. The field contents of `startYear` and `genres` are added as tags. - -Get all entries of a specific tag: -```js -const result = flexsearch.search({ - //enrich: true, // enrich documents - tag: { "genres": "Documentary" }, - limit: 1000, - offset: 0 -}); -``` - -Get entries of multiple tags (intersection): -```js -const result = flexsearch.search({ - //enrich: true, // enrich documents - tag: { - "genres": ["Documentary", "Short"], - "startYear": "1894" - } -}); -``` - -Combine tags with queries (intersection): -```js -const result = flexsearch.search({ - query: "Carmen", // forward tokenizer - tag: { - "genres": ["Documentary", "Short"], - "startYear": "1894" - } -}); -``` - -Alternative declaration: -```js -const result = flexsearch.search("Carmen", { - tag: [{ - field: "genres", - tag: ["Documentary", "Short"] - },{ - field: "startYear", - tag: "1894" - }] -}); -``` - ## Filter Fields (Index / Tags / Datastore) ```js @@ -898,7 +858,6 @@ const flexsearch = new Document({ }); ``` - ## Custom Fields (Index / Tags / Datastore) Dataset example: @@ -979,3 +938,6 @@ const result = flexsearch.search({ }); ``` +### Best Practices: Merge Documents + +[Read here](encoder.md#merge-documents) diff --git a/doc/encoder-workflow.svg b/doc/encoder-workflow.svg new file mode 100644 index 0000000..735783f --- /dev/null +++ b/doc/encoder-workflow.svg @@ -0,0 +1,3 @@ + + +
Normalize
Normalize
Prepare
Prepare
Split
Split
Filter
Filter
Stemmer
Stemmer
Input
Content
Input...
split
split
include
include
exclude
exclude
Filter
Filter
Mapper
Mapper
Deduplication
Deduplication
Matcher
Matcher
Replacer
Replacer
Finalize
Finalize
Encoded
Content
Encoded...
\ No newline at end of file diff --git a/doc/encoder.md b/doc/encoder.md index a160f00..6da2335 100644 --- a/doc/encoder.md +++ b/doc/encoder.md @@ -1,19 +1,80 @@ ## Encoder -Search capabilities highly depends on language processing. The old workflow wasn't really practicable. The new Encoder class is a huge improvement and fully replaces the encoding part. Some FlexSearch options was moved to the new `Encoder` instance. +> [!IMPORTANT] +> You shouldn't miss this part as it is one of the most important aspects of FlexSearch. -New Encoding Pipeline: -1. charset normalization -2. custom preparation -3. split into terms (apply includes/excludes) -4. filter (pre-filter) -5. matcher (substitute terms) -6. stemmer (substitute term endings) -7. filter (post-filter) -8. replace chars (mapper) -9. custom regex (replacer) -10. letter deduplication -11. apply finalize +Search capabilities highly depends on language processing. The Encoder class is one of the most important core functionalities of FlexSearch. + +Current Encoding Pipeline: + +1. Charset Normalization +2. Custom Preparation +3. Split Content (into terms, apply includes/excludes) +4. Filter: Pre-Filter +5. Matcher (substitute partials) +6. Stemmer (substitute term endings) +7. Filter: Post-Filter +8. Replace Chars (Mapper) +9. Custom Regex (Replacer) +10. Letter Deduplication +11. Custom Finalize + +> Encoders are basically responsible for "fuzziness". [Read here about Phonetic Search/Fuzzy Search](../README.md#fuzzy-search) + +### Default Encoder + +The default Encoder (when passing no options on creation) uses this configuration: + +```js +const encoder = new Encoder({ + normalize: true, + dedupe: true, + cache: true, + include: { + letter: true, + number: true, + symbol: false, + punctuation: false, + control: false, + char: "" + } +}); +``` + +The default configuration will: + +1. apply charset normalization, e.g. "é" to "e" +2. apply letter deduplication, e.g. "missing" to "mising" +3. just index alphanumeric content and filter everything else out + +This is important to keep in mind, because when you need a different configuration you'll have to change those settings accordingly. + +Let's assume you want including the symbols "#", "@" and "-", because those are needed to differentiate search results (otherwise it would be useless), and let's say you don't need numeric content indexed you can do this by: + +```js +const encoder = new Encoder({ + // default configuration is applied + // extend or override: + include: { + // by default everything is set to false + letter: true, + number: false, + char: ["#", "@", "-"] + } +}); +``` + +#### Built-In Universal Encoders + +1. Charset.Exact +2. Charset.Normalize (Charset.Default) + +#### Built-In Latin Encoders + +1. Charset.LatinBalance +2. Charset.LatinAdvanced +3. Charset.LatinExtra +4. Charset.LatinSoundex ### Example @@ -47,14 +108,22 @@ const encoder = new Encoder({ }); ``` -Instead of using `include` or `exclude` you can pass a regular expression to the field `split`: +Instead of using `include` or `exclude` you can pass a regular expression or a string to the field `split`: ```js -const encoder = new Encoder({ +const encoder = new Encoder({ split: /\s+/ }); ``` +E.g. this split configuration will tokenize every symbol/char from a content: + +```js +const encoder = new Encoder({ + split: "" +}); +``` + > The definitions `include` and `exclude` is a replacement for `split`. You can just define one of those 3. Adding custom functions to the encoder pipeline: @@ -73,7 +142,9 @@ const encoder = new Encoder({ }); ``` -Assign encoder to an index: +Further reading: [Encoder Processing Workflow](#encoder-processing-workflow) + +Assign an encoder to an index: ```js const index = new Index({ @@ -81,82 +152,117 @@ const index = new Index({ }); ``` -Define language specific transformations: +Define language specific normalizations/transformations: ```js const encoder = new Encoder({ - replacer: [ - /[´`’ʼ]/g, "'" - ], + stemmer: new Map([ + ["ly", ""] + ]), filter: new Set([ "and", ]), + mapper: new Map([ + ["é", "e"] + ]), matcher: new Map([ ["xvi", "16"] ]), - stemmer: new Map([ - ["ly", ""] - ]), - mapper: new Map([ - ["é", "e"] - ]) + replacer: [ + /[´`’ʼ]/g, "'" + ], }); ``` -Or use predefined language and extend it with custom options: +Further reading: [Encoder Processing Workflow](#encoder-processing-workflow) + +Or use built-in helpers alternatively: ```js -import EnglishBookPreset from "./lang/en.js"; -const encoder = new Encoder(EnglishBookPreset, { - filter: false -}); +const encoder = new Encoder() + .addStemmer("ly", "") + .addFilter("and") + .addMapper("é", "e") + .addMatcher("xvi", "16") + .addReplacer(/[´`’ʼ]/g, "'"); +``` + +Some of the built-in helpers will automatically detect inputs and use the proper helper under the hood. So theoretically you can lazily just write: + +```js +const encoder = new Encoder() + .addStemmer("ly", "") + .addFilter("and") + .addReplacer("é", "e") + .addReplacer("xvi", "16") + .addReplacer(/[´`’ʼ]/g, "'"); +``` + +You can also use presets and extend it with custom options: + +```js +import EnglishBookPreset from "flexsearch/lang/en"; +const encoder = new Encoder( + EnglishBookPreset, + // use the preset but don't filter terms + { filter: false } +); ``` Equivalent: ```js -import EnglishBookPreset from "./lang/en.js"; const encoder = new Encoder(EnglishBookPreset); encoder.assign({ filter: false }); ``` -Assign extensions to the encoder instance: +Assign multiple extensions to the encoder instance: ```js -import LatinEncoderPreset from "./charset/latin/simple.js"; -import EnglishBookPreset from "./lang/en.js"; +import Charset from "flexsearch"; +import EnglishBookPreset from "flexsearch/lang/en"; // stack definitions to the encoder instance const encoder = new Encoder() - .assign(LatinEncoderPreset) + .assign(Charset.LatinSoundex) .assign(EnglishBookPreset) -// override preset options ... + // extend or override preset options: .assign({ minlength: 3 }); -// assign further presets ... + // assign further presets ... ``` -> When adding extension to the encoder every previously assigned configuration is still intact, very much like Mixins, also when assigning custom functions. +> When adding extension to the encoder every previously assigned configuration is still intact, also when assigning custom functions, the previously added function will still execute. Add custom transformations to an existing index: ```js -import LatinEncoderPreset from "./charset/latin/default.js"; -const encoder = new Encoder(LatinEncoderPreset); -encoder.addReplacer(/[´`’ʼ]/g, "'"); +const encoder = new Encoder(Charset.Normalize); +// filter terms encoder.addFilter("and"); -encoder.addMatcher("xvi", "16"); -encoder.addStemmer("ly", ""); +// replace single chars encoder.addMapper("é", "e"); +// replace char sequences +encoder.addMatcher("xvi", "16"); +// replace single chars or char sequences +// at the end of a term +encoder.addStemmer("ly", ""); +// custom regex replace +encoder.addReplacer(/[´`’ʼ]/g, "'"); ``` Shortcut for just assigning one encoder configuration to an index: ```js -import LatinEncoderPreset from "./charset/latin/default.js"; const index = new Index({ - encoder: LatinEncoderPreset + encoder: Charset.Normalize }); ``` +### Encoder Processing Workflow + +This workflow schema might help you to understand each step in the iteration: +

+ + ### Custom Encoder Since it is very simple to create a custom Encoder, you are welcome to create your own. @@ -180,29 +286,12 @@ const index = new Index({ }); ``` +You can't extend to the built-in tokenizer "exact", "forward", "bidirectional" or "full". +If nothing of them are applicable for your task you should tokenize everything inside your custom encoder function. + If you get some good results please feel free to share your encoder. - - -#### Add custom tokenizer - -> A tokenizer split words/terms into components or partials. - -Define a private custom tokenizer during creation/initialization: -```js -var index = new FlexSearch({ - - tokenize: function(str){ - - return str.split(/\s-\//g); - } -}); -``` - -> The tokenizer function gets a string as a parameter and has to return an array of strings representing a word or term. In some languages every char is a term and also not separated via whitespaces. - - -#### Add language-specific stemmer and/or filter +### Add language-specific stemmer and/or filter > __Stemmer:__ several linguistic mutations of the same word (e.g. "run" and "running") @@ -210,8 +299,7 @@ var index = new FlexSearch({ Assign a private custom stemmer or filter during creation/initialization: ```js -var index = new FlexSearch({ - +const index = new Index({ stemmer: { // object {key: replacement} @@ -235,63 +323,37 @@ var index = new FlexSearch({ Using a custom filter, e.g.: ```js -var index = new FlexSearch({ - +const index = new Index({ filter: function(value){ - // just add values with length > 1 to the index - return value.length > 1; } }); ``` -Or assign stemmer/filters globally to a language: +Load language packs with legacy browser support (non-modules): -> Stemmer are passed as a object (key-value-pair), filter as an array. - -```js -FlexSearch.registerLanguage("us", { - - stemmer: { /* ... */ }, - filter: [ /* ... */ ] -}); -``` - -Or use some pre-defined stemmer or filter of your preferred languages: ```html - + ... ``` -Now you can assign built-in stemmer during creation/initialization: -```js -var index_en = new FlexSearch.Index({ - language: "en" -}); - -var index_de = new FlexSearch.Index({ - language: "de" -}); -``` - -In Node.js all built-in language packs files are available: +In Node.js all built-in language packs files are available by its scope: ```js -const { Index } = require("flexsearch"); - -var index_en = new Index({ - language: "en" +const EnglishPreset = require("flexsearch/lang/en"); +const index = new Index({ + encoder: EnglishPreset }); ``` -### Right-To-Left Support +## Right-To-Left Support > Set the tokenizer at least to "reverse" or "full" when using RTL. @@ -306,7 +368,7 @@ var index = new Index({ ``` -### CJK Word Break (Chinese, Japanese, Korean) +## CJK Word Break (Chinese, Japanese, Korean) Set a custom tokenizer which fits your needs, e.g.: @@ -326,544 +388,169 @@ index.add(0, "一个单词"); var results = index.search("单词"); ``` +## Built-In Language Packs -## Fuzzy-Search +- English: `en` +- German: `de` +- French: `fr` -Fuzzysearch describes a basic concept of how making queries more tolerant. FlexSearch provides several methods to achieve fuzziness: - -1. Use a tokenizer: `forward`, `reverse` or `full` -2. Don't forget to use any of the builtin encoder `simple` > `balance` > `advanced` > `extra` > `soundex` (sorted by fuzziness) -3. Use one of the language specific presets e.g. `/lang/en.js` for en-US specific content -4. Enable suggestions by passing the search option `suggest: true` - -Additionally, you can apply custom `Mapper`, `Replacer`, `Stemmer`, `Filter` or by assigning a custom `normalize(str)`, `prepare(str)` or `finalize(arr)` function to the Encoder. - -### Compare Fuzzy-Search Encoding - -Original term which was indexed: "Struldbrugs" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Encoder:LatinExactLatinDefaultLatinSimpleLatinBalanceLatinAdvancedLatinExtraLatinSoundex
Index Size3.1 Mb1.9 Mb1.8 Mb1.7 Mb1.6 Mb1.1 Mb0.7 Mb
Struldbrugs
struldbrugs
strũldbrųĝgs
strultbrooks
shtruhldbrohkz
zdroltbrykz
struhlbrogger
- -The index size was measured after indexing the book "Gulliver's Travels". - - -## Encoder - -Search capabilities highly depends on language processing. The old workflow wasn't really practicable. The new Encoder class is a huge improvement and fully replaces the encoding part. Some FlexSearch options was moved to the new `Encoder` instance. - -New Encoding Pipeline: -1. charset normalization -2. custom preparation -3. split into terms (apply includes/excludes) -4. filter (pre-filter) -5. matcher (substitute terms) -6. stemmer (substitute term endings) -7. filter (post-filter) -8. replace chars (mapper) -9. custom regex (replacer) -10. letter deduplication -11. apply finalize - -### Example - -```js -const encoder = new Encoder({ - normalize: true, - dedupe: true, - cache: true, - include: { - letter: true, - number: true, - symbol: false, - punctuation: false, - control: false, - char: "@" - } -}); -``` - -You can use an `include` __instead__ of an `exclude` definition: - -```js -const encoder = new Encoder({ - exclude: { - letter: false, - number: false, - symbol: true, - punctuation: true, - control: true - } -}); -``` - -Instead of using `include` or `exclude` you can pass a regular expression to the field `split`: - -```js -const encoder = new Encoder({ - split: /\s+/ -}); -``` - -> The definitions `include` and `exclude` is a replacement for `split`. You can just define one of those 3. - -Adding custom functions to the encoder pipeline: - -```js -const encoder = new Encoder({ - normalize: function(str){ - return str.toLowerCase(); - }, - prepare: function(str){ - return str.replace(/&/g, " and "); - }, - finalize: function(arr){ - return arr.filter(term => term.length > 2); - } -}); -``` - -Assign encoder to an index: - -```js -const index = new Index({ - encoder: encoder -}); -``` - -Define language specific transformations: - -```js -const encoder = new Encoder({ - replacer: [ - /[´`’ʼ]/g, "'" - ], - filter: new Set([ - "and", - ]), - matcher: new Map([ - ["xvi", "16"] - ]), - stemmer: new Map([ - ["ly", ""] - ]), - mapper: new Map([ - ["é", "e"] - ]) -}); -``` - -Or use predefined language and extend it with custom options: - -```js -import EnglishBookPreset from "./lang/en.js"; -const encoder = new Encoder(EnglishBookPreset, { - filter: false -}); -``` - -Equivalent: - -```js -import EnglishBookPreset from "./lang/en.js"; -const encoder = new Encoder(EnglishBookPreset); -encoder.assign({ filter: false }); -``` - -Assign extensions to the encoder instance: - -```js -import LatinEncoderPreset from "./charset/latin/simple.js"; -import EnglishBookPreset from "./lang/en.js"; -// stack definitions to the encoder instance -const encoder = new Encoder() - .assign(LatinEncoderPreset) - .assign(EnglishBookPreset) - // override preset options ... - .assign({ minlength: 3 }); - // assign further presets ... -``` - -> When adding extension to the encoder every previously assigned configuration is still intact, very much like Mixins, also when assigning custom functions. - -Add custom transformations to an existing index: - -```js -import LatinEncoderPreset from "./charset/latin/default.js"; -const encoder = new Encoder(LatinEncoderPreset); -encoder.addReplacer(/[´`’ʼ]/g, "'"); -encoder.addFilter("and"); -encoder.addMatcher("xvi", "16"); -encoder.addStemmer("ly", ""); -encoder.addMapper("é", "e"); -``` - -Shortcut for just assigning one encoder configuration to an index: - -```js -import LatinEncoderPreset from "./charset/latin/default.js"; -const index = new Index({ - encoder: LatinEncoderPreset -}); -``` - -### Custom Encoder - -Since it is very simple to create a custom Encoder, you are welcome to create your own. -e.g. -```js -function customEncoder(content){ - const tokens = []; - // split content into terms/tokens - // apply your changes to each term/token - // you will need to return an Array of terms/tokens - // so just iterate through the input string and - // push tokens to the array - // ... - return tokens; -} - -const index = new Index({ - // set to strict when your tokenization was already done - tokenize: "strict", - encode: customEncoder -}); -``` - -If you get some good results please feel free to share your encoder. - -## Languages - -Language-specific definitions are being divided into two groups: - -1. Charset - 1. ___encode___, type: `function(string):string[]` - 2. ___rtl___, type: `boolean` -2. Language - 1. ___matcher___, type: `{string: string}` - 2. ___stemmer___, type: `{string: string}` - 3. ___filter___, type: `string[]` - -The charset contains the encoding logic, the language contains stemmer, stopword filter and matchers. Multiple language definitions can use the same charset encoder. Also this separation let you manage different language definitions for special use cases (e.g. names, cities, dialects/slang, etc.). - -To fully describe a custom language __on the fly__ you need to pass: - -```js -const index = FlexSearch({ - // mandatory: - encode: (content) => [words], - // optionally: - rtl: false, - stemmer: {}, - matcher: {}, - filter: [] -}); -``` - -When passing no parameter it uses the `latin:default` schema by default. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
FieldCategoryDescription
encodecharsetThe encoder function. Has to return an array of separated words (or an empty string).
rtlcharsetA boolean property which indicates right-to-left encoding.
filterlanguageFilter are also known as "stopwords", they completely filter out words from being indexed.
stemmerlanguageStemmer removes word endings and is a kind of "partial normalization". A word ending just matched when the word length is bigger than the matched partial.
matcherlanguageMatcher replaces all occurrences of a given string regardless of its position and is also a kind of "partial normalization".
- -### 1. Language Packs: ES6 Modules +### 1. Import Language Packs: ES6 Modules The most simple way to assign charset/language specific encoding via modules is: ```js -import charset from "./dist/module/lang/latin/advanced.js"; -import lang from "./dist/module/lang/en.js"; - -const index = FlexSearch({ - charset: charset, - lang: lang +import EnglishPreset from "flexsearch/lang/en"; +const index = Index({ + charset: EnglishPreset }); ``` -Just import the __default export__ by each module and assign them accordingly. - -The full qualified example from above is: +You can stack up and combine multiple presets: ```js -import { encode, rtl } from "./dist/module/lang/latin/advanced.js"; -import { stemmer, filter, matcher } from "./dist/module/lang/en.js"; +import { Charset } from "flexsearch"; +import EnglishPreset from "flexsearch/lang/en"; -const index = FlexSearch({ - encode: encode, - rtl: rtl, - stemmer: stemmer, - matcher: matcher, - filter: filter +const index = Index({ + charset: new Encoder( + Charset.LatinAdvanced, + EnglishPreset, + { minlength: 3 } + ) }); ``` -The example above is the standard interface which is at least exported from each charset/language. - -You can also define the encoder directly and left all other options: +You can also assign the encoder preset directly: ```js -import simple from "./dist/module/lang/latin/simple.js"; - -const index = FlexSearch({ - encode: simple +const index = Index({ + encoder: Charset.Default }); ``` -#### Available Latin Encoders - -1. default -2. simple -3. balance -4. advanced -5. extra - -You can assign a charset by passing the charset during initialization, e.g. `charset: "latin"` for the default charset encoder or `charset: "latin:soundex"` for a encoder variant. - -#### Dialect / Slang - -Language definitions (especially matchers) also could be used to normalize dialect and slang of a specific language. - -### 2. Language Packs: ES5 (Language Packs) - -You need to make the charset and/or language definitions available by: - -1. All charset definitions are included in the `flexsearch.bundle.js` build by default, but no language-specific definitions are included -2. You can load packages located in `/dist/lang/` (files refers to languages, folders are charsets) -3. You can make a custom build +#### 2. Import Language Packs: ES5 Legacy Browser When loading language packs, make sure that the library was loaded before: ```html - - + ``` -When using the full "bundle" version the built-in latin encoders are already included and you just have to load the language file: - -```html - - -``` - -Because you loading packs as external packages (non-ES6-modules) you have to initialize them by shortcuts: +The language packs are registered on `FlexSearch.Language`: ```js -const index = FlexSearch({ - charset: "latin:soundex", - lang: "en" +const index = FlexSearch.Index({ + encoder: FlexSearch.Language["en"] }); ``` -> Use the `charset:variant` notation to assign charset and its variants. When just passing the charset without a variant will automatically resolve as `charset:default`. - -You can also override existing definitions, e.g.: +You can stack up and combine multiple presets: ```js -const index = FlexSearch({ - charset: "latin", - lang: "en", - matcher: {} +const index = FlexSearch.Index({ + charset: new FlexSearch.Encoder( + FlexSearch.Charset.LatinAdvanced, + FlexSearch.Language["en"], + { minlength: 3 } + ) }); ``` -> Passed definitions will __not__ extend default definitions, they will replace them. +### Share Encoders -When you like to extend a definition just create a new language file and put in all the logic. +Assigning the `Encoder` instance to the top level configuration will share the encoder to all fields. You should avoid this when contents of fields don't have the same type of content (e.g. one field contains terms, another contains numeric IDs). -#### Encoder Variants +Sharing the encoder can improve encoding efficiency and memory allocation, but when not properly used also has negative effect to the performance. You can share encoders to any type of index, also through multiple instances of indexes (also documents). -It is pretty straight forward when using an encoder variant: +You should group similar types of contents to one encoder respectively. When you have different content types then define one for each of them. -```html - - - - -``` - -When using the full "bundle" version the built-in latin encoders are already included and you just have to load the language file: - -```html - - -``` +In this example there are two Document-Indexes for two different documents "orders" and "billings". You can also share encoder to different fields of just one document. ```js -const index_advanced = FlexSearch({ - charset: "latin:advanced" -}); - -const index_extra = FlexSearch({ - charset: "latin:extra" -}); -``` - - -### Language Processing Pipeline - -This is the default pipeline provided by FlexSearch: - -

- -

- -#### Custom Pipeline - -At first take a look into the default pipeline in `src/common.js`. It is very simple and straight forward. The pipeline will process as some sort of inversion of control, the final encoder implementation has to handle charset and also language specific transformations. This workaround has left over from many tests. - -Inject the default pipeline by e.g.: - -```js -this.pipeline( - - /* string: */ str.toLowerCase(), - /* normalize: */ false, - /* split: */ split, - /* collapse: */ false +// usual term encoding +const encoder_terms = Encoder( + Charset.LatinAdvanced, + // just add letters (no numbers) + { include: { letter: true } } ); +// numeric encoding +const encoder_numeric = new Encoder(Charset.Default); + +const orders = Document({ + document: { + id: "id", + index: [{ + field: "product_title", + encoder: encoder_terms + },{ + field: "product_details", + encoder: encoder_terms + },{ + field: "order_date", + encoder: encoder_numeric + },{ + field: "customer_id", + encoder: encoder_numeric + }] + } +}); + +const billings = Document({ + document: { + id: "id", + index: [{ + field: "product_title", + encoder: encoder_terms + },{ + field: "product_content", + encoder: encoder_terms + },{ + field: "billing_date", + encoder: encoder_numeric + },{ + field: "customer_id", + encoder: encoder_numeric + }] + } +}); ``` -Use the pipeline schema from above to understand the iteration and the difference of pre-encoding and post-encoding. Stemmer and matchers needs to be applied after charset normalization but before language transformations, filters also. +### Merge Documents -Here is a good example of extending pipelines: `src/lang/latin/extra.js` → `src/lang/latin/advanced.js` → `src/lang/latin/simple.js`. +When you have multiple document types (indexed by multiple indexes) but some of the data has same fields (like in the example above) and you can refer them by any identifier or key, you should consider merging those documents into one. This will hugely improve index size. -### How to contribute? - -Search for your language in `src/lang/`, if it exists you can extend or provide variants (like dialect/slang). If the language doesn't exist create a new file and check if any of the existing charsets (e.g. latin) fits to your language. When no charset exist, you need to provide a charset as a base for the language. - -A new charset should provide at least: - -1. `encode` A function which normalize the charset of a passed text content (remove special chars, lingual transformations, etc.) and __returns an array of separated words__. Also stemmer, matcher or stopword filter needs to be applied here. When the language has no words make sure to provide something similar, e.g. each chinese sign could also be a "word". Don't return the whole text content without split. -3. `rtl` A boolean flag which indicates right-to-left encoding - -Basically the charset needs just to provide an encoder function along with an indicator for right-to-left encoding: +E.g. when you merge "orders" and "billings" from example above by ID, then you can use just one index: ```js -export function encode(str){ return [str] } -export const rtl = false; +const encoder_terms = Encoder( + Charset.LatinAdvanced, + // just add letters (no numbers) + { include: { letter: true } } +); +const encoder_numeric = new Encoder( + Charset.Default +); + +const merged = Document({ + document: { + id: "id", + index: [{ + field: "product_title", + encoder: encoder_terms + },{ + field: "product_details", + encoder: encoder_terms + },{ + field: "order_date", + encoder: encoder_numeric + },{ + field: "billing_date", + encoder: encoder_numeric + },{ + field: "customer_id", + encoder: encoder_numeric + }] + } +}); ``` diff --git a/doc/fuzzy-search.md b/doc/fuzzy-search.md deleted file mode 100644 index 6962d85..0000000 --- a/doc/fuzzy-search.md +++ /dev/null @@ -1,109 +0,0 @@ -## Fuzzy-Search - -Fuzzysearch describes a basic concept of how making queries more tolerant. FlexSearch provides several methods to achieve fuzziness: - -1. Use a tokenizer: `forward`, `reverse` or `full` -2. Don't forget to use any of the builtin encoder `simple` > `balance` > `advanced` > `extra` > `soundex` (sorted by fuzziness) -3. Use one of the language specific presets e.g. `/lang/en.js` for en-US specific content -4. Enable suggestions by passing the search option `suggest: true` - -Additionally, you can apply custom `Mapper`, `Replacer`, `Stemmer`, `Filter` or by assigning a custom `normalize(str)`, `prepare(str)` or `finalize(arr)` function to the Encoder. - -### Compare Fuzzy-Search Encoding - -Original term which was indexed: "Struldbrugs" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Encoder:LatinExactLatinDefaultLatinSimpleLatinBalanceLatinAdvancedLatinExtraLatinSoundex
Index Size3.1 Mb1.9 Mb1.8 Mb1.7 Mb1.6 Mb1.1 Mb0.7 Mb
Struldbrugs
struldbrugs
strũldbrųĝgs
strultbrooks
shtruhldbrohkz
zdroltbrykz
struhlbrogger
- -The index size was measured after indexing the book "Gulliver's Travels". diff --git a/doc/options.md b/doc/options.md index 11e2f48..190e4a4 100644 --- a/doc/options.md +++ b/doc/options.md @@ -402,3 +402,43 @@ "or" + +## Encoder Options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldCategoryDescription
encodecharsetThe encoder function. Has to return an array of separated words (or an empty string).
rtlcharsetA boolean property which indicates right-to-left encoding.
filterlanguageFilter are also known as "stopwords", they completely filter out words from being indexed.
stemmerlanguageStemmer removes word endings and is a kind of "partial normalization". A word ending just matched when the word length is bigger than the matched partial.
matcherlanguageMatcher replaces all occurrences of a given string regardless of its position and is also a kind of "partial normalization".