1
0
mirror of https://github.com/nextapps-de/flexsearch.git synced 2025-08-22 13:42:51 +02:00

update readme for v0.7.0

This commit is contained in:
Thomas Wilkerling
2021-05-11 14:22:52 +02:00
parent 2de3f31fe6
commit addab0a3c3
3 changed files with 1521 additions and 232 deletions

View File

@@ -13,7 +13,6 @@
<h2></h2> <h2></h2>
### FlexSearch v0.7.0 (Beta) ### FlexSearch v0.7.0 (Beta)
=======
Beta is now available. Please test the new version and post back issues and suggestions. The Beta will pushed to the master branch in 2 weeks. Beta is now available. Please test the new version and post back issues and suggestions. The Beta will pushed to the master branch in 2 weeks.
@@ -29,13 +28,15 @@ Source Code v0.7.0-beta available here:<br>
<h1></h1> <h1></h1>
<h3>Web's fastest and most memory-flexible full-text search library with zero dependencies.</h3> <h3>Web's fastest and most memory-flexible full-text search library with zero dependencies.</h3>
<a href="#installation">Installation Guide</a> &ensp;&bull;&ensp; <a href="#api">API Reference</a> &ensp;&bull;&ensp; <a href="#builds">Custom Builds</a> &ensp;&bull;&ensp; <a target="_blank" href="https://github.com/nextapps-de/flexsearch-server">Flexsearch Server</a> &ensp;&bull;&ensp; <a href="CHANGELOG.md">Changelog</a>
When it comes to raw search speed <a href="https://raw.githack.com/nextapps-de/flexsearch/master/test/benchmark.html" target="_blank">FlexSearch outperforms every single searching library out there</a> and also provides flexible search capabilities like multi-field search, phonetic transformations or partial matching. When it comes to raw search speed <a href="https://raw.githack.com/nextapps-de/flexsearch/master/test/benchmark.html" target="_blank">FlexSearch outperforms every single searching library out there</a> and also provides flexible search capabilities like multi-field search, phonetic transformations or partial matching.
Depending on the used <a href="#options">options</a> it also provides the <a href="#memory">most memory-efficient index</a>. FlexSearch introduce a new scoring algorithm called <a href="#contextual">"contextual index"</a> based on a <a href="#dictionary">pre-scored lexical dictionary</a> architecture which actually performs queries up to 1,000,000 times faster compared to other libraries. Depending on the used <a href="#options">options</a> it also provides the <a href="#memory">most memory-efficient index</a>. FlexSearch introduce a new scoring algorithm called <a href="#contextual">"contextual index"</a> based on a <a href="#dictionary">pre-scored lexical dictionary</a> architecture which actually performs queries up to 1,000,000 times faster compared to other libraries.
FlexSearch also provides you a non-blocking asynchronous processing model as well as web workers to perform any updates or queries on the index in parallel through dedicated balanced threads. FlexSearch also provides you a non-blocking asynchronous processing model as well as web workers to perform any updates or queries on the index in parallel through dedicated balanced threads.
FlexSearch Server is available here: <a target="_blank" href="https://github.com/nextapps-de/flexsearch-server">https://github.com/nextapps-de/flexsearch-server</a>. FlexSearch Server is available here:
<a target="_blank" href="https://github.com/nextapps-de/flexsearch-server">https://github.com/nextapps-de/flexsearch-server</a>.
<a href="#installation">Installation Guide</a> &ensp;&bull;&ensp; <a href="#api">API Reference</a> &ensp;&bull;&ensp; <a href="#builds">Custom Builds</a> &ensp;&bull;&ensp; <a target="_blank" href="https://github.com/nextapps-de/flexsearch-server">Flexsearch Server</a> &ensp;&bull;&ensp; <a href="CHANGELOG.md">Changelog</a>
Supported Platforms: Supported Platforms:
- Browser - Browser

258
doc/0.7.0-lang.md Normal file
View File

@@ -0,0 +1,258 @@
## Documentation 0.7.0-rev2
### Language Handler
Handling languages was completely replaced by a more generic approach. All language-specific definitions has excluded and was optimized for maximum dead-code elimination when using compiler/bundler. Each language exists of 5 definitions, which are divided into two groups:
1. Charset
1. ___encode___, type: `function(string):string[]`
2. ___rtl___, type: `boolean`
2. Language
1. ___matcher___, type: `{string: string}`
2. ___stemmer___, type: `{string: string}`
3. ___filter___, type: `string[]`
The charset contains the encoding logic, the language contains stemmer, stopword filter and matchers. Multiple language definitions can use the same charset encoder. Also this separation let you manage different language definitions for special use cases (e.g. names, cities, dialects/slang, etc.).
To fully describe a custom language __on the fly__ you need to pass:
```js
const index = FlexSearch({
// mandatory:
encode: (str) => [str],
// optionally:
rtl: false,
stemmer: {},
matcher: {},
filter: []
});
```
When passing no parameter it uses the `latin:default` schema by default.
<table>
<tr></tr>
<tr>
<td>Field</td>
<td>Category</td>
<td>Description</td>
</tr>
<tr>
<td><b>encode</b></td>
<td>charset</td>
<td>The encoder function. Has to return an array of separated words (or an empty string).</td>
</tr>
<tr></tr>
<tr>
<td><b>rtl</b></td>
<td>charset</td>
<td>A boolean property which indicates right-to-left encoding.</td>
</tr>
<tr></tr>
<tr>
<td><b>filter</b></td>
<td>language</td>
<td>Filter are also known as "stopwords", they completely filter out words from being indexed.</td>
</tr>
<tr></tr>
<tr>
<td><b>stemmer</b></td>
<td>language</td>
<td>Stemmer removes word endings and is a kind of "partial normalization". A word ending just matched when the word length is bigger than the matched partial.</td>
</tr>
<tr></tr>
<tr>
<td><b>matcher</b></td>
<td>language</td>
<td>Matcher replaces all occurrences of a given string regardless of its position and is also a kind of "partial normalization".</td>
</tr>
</table>
### 1. Language Packs: ES6 Modules
The most simple way to assign charset/language specific encoding via modules is:
```js
import charset from "./dist/module/lang/latin/soundex.js";
import lang from "./dist/module/lang/en.js";
const index = FlexSearch({
charset: charset,
lang: lang
});
```
Just import the __default export__ by each module and assign them accordingly.
The full qualified example from above is:
```js
import { encode, rtl, tokenize } from "./dist/module/lang/latin/soundex.js";
import { stemmer, filter, matcher } from "./dist/module/lang/en.js";
const index = FlexSearch({
encode: encode,
// assign forced tokenizer first:
tokenize: tokenize || "forward",
rtl: rtl,
stemmer: stemmer,
matcher: matcher,
filter: filter
});
```
The example above is the standard interface which is at least exported from each charset/language.
__Note:__ Some of the encoder variants limit the use of built-in tokenizer (e.g. soundex). To be save prioritize the forced tokenizer and fall back to your choice, e.g. `tokenize || "forward"`.
#### Encoder Variants
You remember the encoding variants like `simple`, `advanced`, `extra`, or `balanced`? These are also supported and provides you several variants of encoding (which differs in performance and degree of normalization).
It is pretty straight forward when using a encoder variant:
```js
import advanced from "./dist/module/lang/latin/advanced.js";
import { encode } from "./dist/module/lang/latin/extra.js";
const index_advanced = FlexSearch({
// apply all definitions:
charset: advanced
});
const index_extra = FlexSearch({
// just apply the encoder:
encode: encode
});
```
#### Available Latin Encoders
1. default
2. simple
3. advanced
4. extra
5. balance
6. soundex
You can assign a charset by passing the charset during initialization, e.g. `charset: "latin"` for the default charset encoder or `charset: "latin:soundex"` for a encoder variant.
#### Dialect / Slang
Language definitions (especially matchers) also could be used to normalize dialect and slang of a specific language.
### 2. Language Packs: ES5 Modules
You need to make the charset and/or language definitions available by:
1. All charset definitions are included in the `flexsearch.min.js` build by default, but no language-specific definitions are included
2. You can load packages located in `/dist/lang/` (files refers to languages, folders are charsets)
3. You can make a custom build
When loading language packs, make sure that the library was loaded before:
```html
<script src="dist/flexsearch.min.js"></script>
<script src="dist/lang/latin/default.min.js"></script>
<script src="dist/lang/en.min.js"></script>
```
Because you loading packs as external packages (non-ES6-modules) you have to initialize them by shortcuts:
```js
const index = FlexSearch({
charset: "latin:soundex",
lang: "en"
});
```
> Use the `charset:variant` notation to assign charset and its variants. When just passing the charset without a variant will automatically resolve as `charset:default`.
You can also override existing definitions, e.g.:
```js
const index = FlexSearch({
charset: "latin",
lang: "en",
matcher: {}
});
```
Passed definitions will __not__ extend default definitions, they will replace them. When you like to extend a definition just create a new language file and put in all the content.
#### Encoder Variants
It is pretty straight forward when using an encoder variant:
```html
<script src="dist/flexsearch.min.js"></script>
<script src="dist/lang/latin/advanced.min.js"></script>
<script src="dist/lang/latin/extra.min.js"></script>
<script src="dist/lang/en.min.js"></script>
```
```js
const index_advanced = FlexSearch({
charset: "latin:advanced"
});
const index_extra = FlexSearch({
charset: "latin:extra"
});
```
Again use the `charset:variant` notation to define charset and its variants.
### Partial Tokenizer
In FlexSearch you can't provide your own partial tokenizer, because it is a direct dependency to the core unit. The built-in tokenizer of FlexSearch splits each word into chunks by different patterns:
1. strict (supports contextual index)
2. forward
3. reverse / both
4. full
5. ngram (supports contextual index, coming soon)
### Language Processing Pipeline
This is the default pipeline provided by FlexSearch:
<p>
<img src="https://cdn.jsdelivr.net/gh/nextapps-de/flexsearch@45d02844dd65a43b0c46633c509762ae0446bb97/doc/pipeline.svg?2">
</p>
#### Custom Pipeline
At first take a look into the default pipeline in `src/common.js`. It is very simple and straight forward. The pipeline will process as some sort of inversion of control, the final encoder implementation has to handle charset and also language specific transformations. This workaround has left over from many tests.
Inject the default pipeline by e.g.:
```js
this.pipeline(
/* string: */ str.toLowerCase(),
/* normalize: */ false,
/* split: */ split,
/* collapse: */ false
);
```
Use the pipeline schema from above to understand the iteration and the difference of pre-encoding and post-encoding. Stemmer and matchers needs to be applied after charset normalization but before language transformations, filters also.
Here is a good example of extending pipelines: `src/lang/latin/extra.js``src/lang/latin/advanced.js``src/lang/latin/simple.js`.
### How to contribute?
Search for your language in `src/lang/`, if it exists you can extend or provide variants (like dialect/slang). If the language doesn't exist create a new file and check if any of the existing charsets (e.g. latin) fits to your language. When no charset exist, you need to provide a charset as a base for the language.
A new charset should provide at least:
1. `encode` A function which normalize the charset of a passed text content (remove special chars, lingual transformations, etc.) and __returns an array of separated words__. Also stemmer, matcher or stopword filter needs to be applied here. When the language has no words make sure to provide something similar, e.g. each chinese sign could also be a "word". Don't return the whole text content without split.
3. `rtl` A boolean flag which indicates right-to-left encoding
Basically the charset needs just to provide an encoder function along with an indicator for right-to-left encoding:
```js
export function encode(str){ return [str] }
export const rtl = false;
```

File diff suppressed because it is too large Load Diff