1
0
mirror of https://github.com/nextapps-de/flexsearch.git synced 2025-09-03 10:53:41 +02:00

extended result highlighting

This commit is contained in:
Thomas Wilkerling
2025-05-02 13:29:07 +02:00
parent 4add0aaf14
commit f774be9646
46 changed files with 9155 additions and 7257 deletions

View File

@@ -1,258 +0,0 @@
## Documentation 0.7.0-rev2
### Language Handler
Handling languages was completely replaced by a more generic approach. All language-specific definitions has excluded and was optimized for maximum dead-code elimination when using compiler/bundler. Each language exists of 5 definitions, which are divided into two groups:
1. Charset
1. ___encode___, type: `function(string):string[]`
2. ___rtl___, type: `boolean`
2. Language
1. ___matcher___, type: `{string: string}`
2. ___stemmer___, type: `{string: string}`
3. ___filter___, type: `string[]`
The charset contains the encoding logic, the language contains stemmer, stopword filter and matchers. Multiple language definitions can use the same charset encoder. Also this separation let you manage different language definitions for special use cases (e.g. names, cities, dialects/slang, etc.).
To fully describe a custom language __on the fly__ you need to pass:
```js
const index = FlexSearch({
// mandatory:
encode: (str) => [str],
// optionally:
rtl: false,
stemmer: {},
matcher: {},
filter: []
});
```
When passing no parameter it uses the `latin:default` schema by default.
<table>
<tr></tr>
<tr>
<td>Field</td>
<td>Category</td>
<td>Description</td>
</tr>
<tr>
<td><b>encode</b></td>
<td>charset</td>
<td>The encoder function. Has to return an array of separated words (or an empty string).</td>
</tr>
<tr></tr>
<tr>
<td><b>rtl</b></td>
<td>charset</td>
<td>A boolean property which indicates right-to-left encoding.</td>
</tr>
<tr></tr>
<tr>
<td><b>filter</b></td>
<td>language</td>
<td>Filter are also known as "stopwords", they completely filter out words from being indexed.</td>
</tr>
<tr></tr>
<tr>
<td><b>stemmer</b></td>
<td>language</td>
<td>Stemmer removes word endings and is a kind of "partial normalization". A word ending just matched when the word length is bigger than the matched partial.</td>
</tr>
<tr></tr>
<tr>
<td><b>matcher</b></td>
<td>language</td>
<td>Matcher replaces all occurrences of a given string regardless of its position and is also a kind of "partial normalization".</td>
</tr>
</table>
### 1. Language Packs: ES6 Modules
The most simple way to assign charset/language specific encoding via modules is:
```js
import charset from "./dist/module/lang/latin/soundex.js";
import lang from "./dist/module/lang/en.js";
const index = FlexSearch({
charset: charset,
lang: lang
});
```
Just import the __default export__ by each module and assign them accordingly.
The full qualified example from above is:
```js
import { encode, rtl, tokenize } from "./dist/module/lang/latin/soundex.js";
import { stemmer, filter, matcher } from "./dist/module/lang/en.js";
const index = FlexSearch({
encode: encode,
// assign forced tokenizer first:
tokenize: tokenize || "forward",
rtl: rtl,
stemmer: stemmer,
matcher: matcher,
filter: filter
});
```
The example above is the standard interface which is at least exported from each charset/language.
__Note:__ Some of the encoder variants limit the use of built-in tokenizer (e.g. soundex). To be save prioritize the forced tokenizer and fall back to your choice, e.g. `tokenize || "forward"`.
#### Encoder Variants
You remember the encoding variants like `simple`, `advanced`, `extra`, or `balanced`? These are also supported and provides you several variants of encoding (which differs in performance and degree of normalization).
It is pretty straight forward when using a encoder variant:
```js
import advanced from "./dist/module/lang/latin/advanced.js";
import { encode } from "./dist/module/lang/latin/extra.js";
const index_advanced = FlexSearch({
// apply all definitions:
charset: advanced
});
const index_extra = FlexSearch({
// just apply the encoder:
encode: encode
});
```
#### Available Latin Encoders
1. default
2. simple
3. advanced
4. extra
5. balance
6. soundex
You can assign a charset by passing the charset during initialization, e.g. `charset: "latin"` for the default charset encoder or `charset: "latin:soundex"` for a encoder variant.
#### Dialect / Slang
Language definitions (especially matchers) also could be used to normalize dialect and slang of a specific language.
### 2. Language Packs: ES5 Modules
You need to make the charset and/or language definitions available by:
1. All charset definitions are included in the `flexsearch.min.js` build by default, but no language-specific definitions are included
2. You can load packages located in `/dist/lang/` (files refers to languages, folders are charsets)
3. You can make a custom build
When loading language packs, make sure that the library was loaded before:
```html
<script src="dist/flexsearch.min.js"></script>
<script src="dist/lang/latin/default.min.js"></script>
<script src="dist/lang/en.min.js"></script>
```
Because you loading packs as external packages (non-ES6-modules) you have to initialize them by shortcuts:
```js
const index = FlexSearch({
charset: "latin:soundex",
lang: "en"
});
```
> Use the `charset:variant` notation to assign charset and its variants. When just passing the charset without a variant will automatically resolve as `charset:default`.
You can also override existing definitions, e.g.:
```js
const index = FlexSearch({
charset: "latin",
lang: "en",
matcher: {}
});
```
Passed definitions will __not__ extend default definitions, they will replace them. When you like to extend a definition just create a new language file and put in all the content.
#### Encoder Variants
It is pretty straight forward when using an encoder variant:
```html
<script src="dist/flexsearch.min.js"></script>
<script src="dist/lang/latin/advanced.min.js"></script>
<script src="dist/lang/latin/extra.min.js"></script>
<script src="dist/lang/en.min.js"></script>
```
```js
const index_advanced = FlexSearch({
charset: "latin:advanced"
});
const index_extra = FlexSearch({
charset: "latin:extra"
});
```
Again use the `charset:variant` notation to define charset and its variants.
### Partial Tokenizer
In FlexSearch you can't provide your own partial tokenizer, because it is a direct dependency to the core unit. The built-in tokenizer of FlexSearch splits each word into chunks by different patterns:
1. strict (supports contextual index)
2. forward
3. reverse / both
4. full
5. ngram (supports contextual index, coming soon)
### Language Processing Pipeline
This is the default pipeline provided by FlexSearch:
<p>
<img src="https://cdn.jsdelivr.net/gh/nextapps-de/flexsearch@45d02844dd65a43b0c46633c509762ae0446bb97/doc/pipeline.svg?2">
</p>
#### Custom Pipeline
At first take a look into the default pipeline in `src/common.js`. It is very simple and straight forward. The pipeline will process as some sort of inversion of control, the final encoder implementation has to handle charset and also language specific transformations. This workaround has left over from many tests.
Inject the default pipeline by e.g.:
```js
this.pipeline(
/* string: */ str.toLowerCase(),
/* normalize: */ false,
/* split: */ split,
/* collapse: */ false
);
```
Use the pipeline schema from above to understand the iteration and the difference of pre-encoding and post-encoding. Stemmer and matchers needs to be applied after charset normalization but before language transformations, filters also.
Here is a good example of extending pipelines: `src/lang/latin/extra.js``src/lang/latin/advanced.js``src/lang/latin/simple.js`.
### How to contribute?
Search for your language in `src/lang/`, if it exists you can extend or provide variants (like dialect/slang). If the language doesn't exist create a new file and check if any of the existing charsets (e.g. latin) fits to your language. When no charset exist, you need to provide a charset as a base for the language.
A new charset should provide at least:
1. `encode` A function which normalize the charset of a passed text content (remove special chars, lingual transformations, etc.) and __returns an array of separated words__. Also stemmer, matcher or stopword filter needs to be applied here. When the language has no words make sure to provide something similar, e.g. each chinese sign could also be a "word". Don't return the whole text content without split.
3. `rtl` A boolean flag which indicates right-to-left encoding
Basically the charset needs just to provide an encoder function along with an indicator for right-to-left encoding:
```js
export function encode(str){ return [str] }
export const rtl = false;
```

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,9 @@
## Custom Builds
The `/src/` folder of this repository requires some compilation to resolve the build flags. Those are your options:
The `/src/` folder of this repository requires some compilation to resolve the build flags.
<!--
Those are your options:
- Closure Compiler (Advanced Compilation)
- Babel + Plugin `babel-plugin-conditional-compile`
@@ -11,6 +14,7 @@ You can't resolve build flags with:
- esbuild
- rollup
- Terser
-->
You can run any of the basic builds located in the `/dist/` folder, e.g.:
@@ -62,115 +66,123 @@ The custom build will be saved to `dist/flexsearch.custom.xxxx.min.js` or when f
<tr>
<td>SUPPORT_WORKER</td>
<td>true, false</td>
<td></td>
<td>Worker Indexes</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_ENCODER</td>
<td>true, false</td>
<td></td>
<td>When not included you'll need to pass a custom <code>encode</code> method when creating an index</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_CHARSET</td>
<td>true, false</td>
<td></td>
<td>Includes: <code>LatinBalance</code>, <code>LatinAdvanced</code>, <code>LatinExtra</code>, <code>LatinSoundex</code></td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_CACHE</td>
<td>true, false</td>
<td></td>
<td>Support for <code>index.searchCache()</code></td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_ASYNC</td>
<td>true, false</td>
<td>Asynchronous Rendering (support Promises)</td>
<td>The async version of index standard methods</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_STORE</td>
<td>true, false</td>
<td></td>
<td>Document Datastore</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_SUGGESTION</td>
<td>true, false</td>
<td></td>
<td>Use the option <code>suggestions</code> when searching</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_SERIALIZE</td>
<td>true, <b>false</b></td>
<td></td>
<td>Export / Import / Serialize Index</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_DOCUMENT</td>
<td>true, false</td>
<td></td>
<td>Document Indexes</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_TAGS</td>
<td>true, false</td>
<td></td>
<td>Tag-Search</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_PERSISTENT</td>
<td>true, false</td>
<td></td>
<td>Use any of the persistent indexes</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_KEYSTORE</td>
<td>true, false</td>
<td></td>
<td>Extended size for InMemory indexes</td>
</tr>
<!--
<tr></tr>
<tr>
<td>SUPPORT_COMPRESSION</td>
<td>true, false</td>
<td></td>
</tr>
-->
<tr></tr>
<tr>
<td>SUPPORT_RESOLVER</td>
<td>true, false</td>
<td></td>
<td>Apply complex queries by chaining boolean operations</td>
</tr>
<tr></tr>
<tr>
<td>SUPPORT_HIGHLIGHTING</td>
<td>true, false</td>
<td>Result Highlighting for Document-Search (also requires <code>SUPPORT_STORE</code>)</td>
</tr>
<tr>
<td colspan="3"><br><b>Compiler Flags</b></td>
</tr>
<tr>
<td>DEBUG</td>
<td>true, <b>false</b></td>
<td>Output debug information to the console (default: false)</td>
<td>true, false</td>
<td>Apply common checks and throw errors more frequently, output debug information and helpful hints to the console</td>
</tr>
<tr></tr>
<tr>
<td>RELEASE<br><br><br><br><br></td>
<td><b>custom</b><br>custom.module<br>bundle<br>bundle.module<br>es5<br>light<br>compact</td>
<td></td>
<td>RELEASE</td>
<td>custom<br>custom.module</td>
<td>Choose build schema: custom = Legacy Browser (<code>window.FlexSearch</code>), custom.module = ES6 Modules (ESM)</td>
</tr>
<tr></tr>
<tr>
<td>POLYFILL</td>
<td>true, <b>false</b></td>
<td>Include Polyfills (based on LANGUAGE_OUT)</td>
<td>true, false</td>
<td>Include Polyfills (based on <code>LANGUAGE_OUT</code>)</td>
</tr>
<tr></tr>
<tr>
<td>PROFILER</td>
<td>true, <b>false</b></td>
<td>true, false</td>
<td>Just used for automatic performance tests</td>
</tr>
<tr></tr>
<tr>
<td>LANGUAGE_OUT<br><br><br><br><br><br><br><br><br><br><br></td>
<td>LANGUAGE_OUT</td>
<td>ECMASCRIPT3<br>ECMASCRIPT5<br>ECMASCRIPT_2015<br>ECMASCRIPT_2016<br>ECMASCRIPT_2017<br>ECMASCRIPT_2018<br>ECMASCRIPT_2019<br>ECMASCRIPT_2020<br>ECMASCRIPT_2021<br>ECMASCRIPT_2022<br>ECMASCRIPT_NEXT<br>STABLE</td>
<td>Target language</td>
</tr>

View File

@@ -4,12 +4,13 @@ Demo: <a href="https://raw.githack.com/nextapps-de/flexsearch/master/demo/autoco
> Result highlighting could be just enabled when using `Document`-Index with enabled document store by passing option `store` on creation.
Alternatively simply upgrade id-content-pairs to a flat document on-the-fly when calling `.add()`.
Alternatively you can simply upgrade id-content-pairs to a flat document when calling `.add(...)`.
```js
// create the document index
// 1. create the document index
const index = new Document({
document: {
// using store is required
store: true,
index: [{
field: "title",
@@ -19,7 +20,7 @@ const index = new Document({
}
});
// add data
// 2. add data
index.add({
"id": 1,
"title": "Carmencita"
@@ -29,38 +30,297 @@ index.add({
"title": "Le clown et ses chiens"
});
// perform a query
// 3. perform a query
const result = index.search({
query: "karmen or clown or not found",
// also get results when query has no exact match
suggest: true,
// set enrich to true (required)
enrich: true,
// highlight template
// $1 is a placeholder for the matched partial
highlight: "<b>$1</b>"
// use highlighting options or pass a template, where $1 is
// a placeholder for the matched partial
highlight: "<b>$1</b>",
// optionally pick and apply search to just
// one field and get back a flat result
pluck: "title"
});
```
The result will look like:
```js
```json
[{
"field": "title",
"result": [{
"id": 1,
"doc": {
"id": 1,
"title": "Carmencita"
},
"highlight": "<b>Carmen</b>cita"
},{
"id": 2,
"doc": {
"id": 2,
"title": "Le clown et ses chiens"
},
"highlight": "Le <b>clown</b> et ses chiens"
}
]
"id": 1,
"highlight": "<b>Carmen</b>cita"
},{
"id": 2,
"highlight": "Le <b>clown</b> et ses chiens"
}]
```
There are several options to customize result highlighting.
### Highlighting Options
<table>
<tr></tr>
<tr>
<td>Option</td>
<td>Values</td>
<td>Description</td>
<td>Default</td>
</tr>
<tr>
<td><code>template</code></td>
<td>
String
</td>
<td>The template to be applied on matches (e.g. <code>"&lt;b>$1&lt;/b>"</code>), where <code>$1</code> is a placeholder for the matched partial</td>
<td style="font-style: italic">(mandatory)</td>
</tr>
<tr></tr>
<tr>
<td><code>boundary</code></td>
<td>
<a href="#highlighting-boundary-options">Boundary Options</a><br>
Number
</td>
<td>Limit the total length of highlighted content (add ellipsis by default). The template markup does not stack to the total length.</td>
<td><code>false</code></td>
</tr>
<tr></tr>
<tr>
<td><code>ellipsis</code></td>
<td>
<a href="#highlighting-ellipsis-options">Ellipsis Options</a><br>
Boolean<br>
String
</td>
<td>
Define a custom ellipsis or disable
</td>
<td><code>"..."</code></td>
</tr>
<tr></tr>
<tr>
<td><code>merge</code></td>
<td>
Boolean
</td>
<td>Wrap consecutive matches by just a single template</td>
<td><code>false</code></td>
</tr>
<tr></tr>
<tr>
<td><code>clip</code></td>
<td>
Boolean
</td>
<td>Allow to clip terms</td>
<td><code>true</code></td>
</tr>
<tr>
<td colspan="4"><a id="highlighting-boundary-options"></a>Boundary Options</td>
</tr>
<tr>
<td><code>boundary.total</code></td>
<td>
Number
</td>
<td>Limit the total length of highlighted content</td>
<td><code>false</code></td>
</tr>
<tr></tr>
<tr>
<td><code>boundary.before</code></td>
<td>
Number
</td>
<td>Limit the length of content before highlighted parts</td>
<td style="font-style: italic">(auto)</td>
</tr>
<tr></tr>
<tr>
<td><code>boundary.after</code></td>
<td>
Number
</td>
<td>Limit the length of content after highlighted parts</td>
<td style="font-style: italic">(auto)</td>
</tr>
<tr>
<td colspan="4"><a id="highlighting-ellipsis-options"></a>Ellipsis Options</td>
</tr>
<tr>
<td><code>ellipsis.template</code></td>
<td>
String
</td>
<td>The template to be applied on ellipsis (e.g. <code>"&lt;i>$1&lt;/i>"</code>), where <code>$1</code> is a placeholder for the ellipsis</td>
<td style="font-style: italic">(mandatory)</td>
</tr>
<tr></tr>
<tr>
<td><code>ellipsis.pattern</code></td>
<td>
Boolean<br>
String
</td>
<td>
Define a custom ellipsis or disable
</td>
<td><code>"..."</code></td>
</tr>
</table>
### Boundaries & Alignment
You can limit the length of the highlighted content and also define a custom ellipsis.
By default, all matches are automatically aligned to fit into the total size. You can customize these boundaries when also passing limits for surrounded text.
Add some content to the index:
```js
index.add({
"id": 1,
"title": "Lorem ipsum dolor sit amet consetetur sadipscing elitr."
});
```
Perform a highlighted search (no boundaries):
```js
const result = index.search({
query: "sit amet",
highlight: "<b>$1</b>"
});
```
Result:
```js
"Lorem ipsum dolor <b>sit</b> <b>amet</b> consetetur sadipscing elitr."
```
___
#### Limit total boundary
```js
const result = index.search({
query: "sit amet",
highlight: {
template: "<b>$1</b>",
boundary: 32
}
});
```
> The highlight markup does not stack to the total length.
>
Result:
```js
"...um dolor <b>sit</b> <b>amet</b> consetet..."
```
___
#### Define custom ellipsis (text)
```js
const result = index.search({
query: "sit amet",
highlight: {
template: "<b>$1</b>",
boundary: 32,
ellipsis: "[...]"
}
});
```
Result:
```js
"[...] dolor <b>sit</b> <b>amet</b> conset[...]"
```
You can also apply `""` or `false` to remove ellipsis.
___
#### Do not clip terms
```js
const result = index.search({
query: "sit amet",
highlight: {
template: "<b>$1</b>",
boundary: 32,
clip: false
}
});
```
Result:
```js
"... dolor <b>sit</b> <b>amet</b> ..."
```
---
#### Merge consecutive matches
```js
const result = index.search({
query: "sit amet",
highlight: {
template: "<b>$1</b>",
boundary: 32,
merge: true
}
});
```
Result:
```js
"...um dolor <b>sit amet</b> consetet..."
```
---
#### Limit surrounded text
> Each of the boundary limits are optionally. Combine them as needed.
```js
const result = index.search({
query: "sit amet",
highlight: {
template: "<b>$1</b>",
boundary: {
// length before match
before: 3,
// length after match
after: 15,
// overall length
total: 32
}
}
});
```
Result:
```js
"...or <b>sit</b> <b>amet</b> consetetur sad..."
```
#### Use custom ellipsis (markup)
When using markup within `ellipsis`, the markup length stack up to the total boundary. You can provide a `template` also for ellipsis to apply total boundary properly by do not stack up the markup length.
```js
const result = index.search({
query: "sit amet",
highlight: {
template: "<b>$1</b>",
// limit the total length to 32 chars
boundary: 32,
ellipsis: {
// pass a template, where $1 is
// a placeholder for the ellipsis
template: "<i>$1</i>",
// define custom ellipsis
pattern: "..."
}
}
});
```
Result:
```js
"<i>...</i> dolor <b>sit</b> <b>amet</b> conset<i>...</i>"
```