Creates a new Kiwi instance. -Note: Even though this method is async, the construction of the Kiwi instance happens in the same -JavaScript context. This means that this method can hang your application if not called in a worker.
-a Kiwi instance that is ready for morphological analysis.
-Static
createCreates a new KiwiBuilder instance. This internally loads the wasm file.
-Path to the kiwi-wasm.wasm file. This is located at /dist/kiwi-wasm.wasm
in the npm package.
-It is up to the user to serve this file. See the package-demo
project for an example of how to include this file as a static asset with vite.
Describes matching options when performing morphological analysis. -These options can be combined using the bitwise OR operator.
-Additionally to the requirements of the main project, you need to install Emscripten and npm.
-To build the package, simply run ./build.sh
.
This is currently only supported on Linux and macOS. You can run the build script on Windows by using WSL.
-You can pass the --demo
flag to build the demo in package-demo
as well.
-If you pass --demo-dev
, a development server for the demo will be started.
Running the above command also automatically upgrades to package version if it doesn't match the version in the main project.
-The documentation for the package can be generated by running npm run doc
inside the package
directory.
The main entry point for the API is KiwiBuilder
, which is used to create instances Kiwi
.
import { KiwiBuilder, Match } from 'kiwi-nlp';
async function example() {
const builder = await KiwiBuilder.create('path to kiwi-wasm.wasm');
const kiwi = await builder.build({
modelFiles: {
'combiningRule.txt': '/path/to/model/combiningRule.txt',
'default.dict': '/path/to/model/default.dict',
'extract.mdl': '/path/to/model/extract.mdl',
'multi.dict': '/path/to/model/multi.dict',
'sj.knlm': '/path/to/model/sj.knlm',
'sj.morph': '/path/to/model/sj.morph',
'skipbigram.mdl': '/path/to/model/skipbigram.mdl',
'typo.dict': '/path/to/model/typo.dict',
}
});
const tokens = kiwi.analyze('다음은 예시 텍스트입니다.', Match.allWithNormalizing);
/* Output: {
"score": -39.772212982177734,
"tokens": [
{
"length": 2,
"lineNumber": 0,
"pairedToken": 4294967295,
"position": 0,
"score": -6.5904083251953125,
"sentPosition": 0,
"str": "다음",
"subSentPosition": 0,
"tag": "NNG",
"typoCost": 0,
"typoFormId": 0,
"wordPosition": 0
},
{
"length": 1,
"lineNumber": 0,
"pairedToken": 4294967295,
"position": 2,
"score": -1.844599723815918,
"sentPosition": 0,
"str": "은",
"subSentPosition": 0,
"tag": "JX",
"typoCost": 0,
"typoFormId": 0,
"wordPosition": 0
},
...
]
} */
}
-
-
-Optional
integrateIf true
, unify phonological variants.
-Outputs endings that change form depending on the positivity/negativity of the preceding vowel, such as /아/ and /어/ or /았/ and /었/, as one.
-Defaults to true
Optional
loadIf true
, the default dictionary is loaded.
-The default dictionary consists of proper noun headings extracted from Wikipedia and Namuwiki.
-Defaults to true
.
Optional
loadIf true
, the built-in polysemous dictionary is loaded.
-The polysemous dictionary consists of proper nouns listed in WikiData.
-Defaults to true
.
Optional
loadIf true, the built-in typo dictionary is loaded.
-The typo dictionary consists of a subset of common misspellings and variant endings that are commonly used on the internet.
-Defaults to true
.
The model files to load. Required.
-Optional
modelSpecifies the language model to use for morphological analysis. Defaults to 'knlm'.
-knlm
: Fast and can model the relationships between morphemes within a short distance (usually two or three) with high accuracy. However, it has the limitation that it cannot take into account the relationships between morphemes over a long distance.sbg
: Driven by internally calibrating the results of SkipBigram to the results of KNLM. At a processing time increase of about 30% compared to KNLM, it is able to model relationships between morphemes over large distances (up to 8 real morphemes) with moderate accuracy.Optional
preanalyzedPreanalyzed words to load.
-Optional
typoThe maximum typo cost to consider when correcting typos. Typos beyond this cost will not be explored. Defaults to 2.5.
-Optional
typosThe typo information to use for correction.
-Can be one of the built in none
, basic
, continual
, basicWithContinual
typo sets, or a custom TypoTransformer.
-Defaults to none
, which disables typo correction.
Optional
userAdditional user dictionaries to load. Used files must appear in the modelFiles
object.
Optional
userAdditional user words to load.
-Interface that performs the actual morphological analysis. -Cannot be constructed directly, use KiwiBuilder to create a new instance.
-Performs morphological analysis. Returns a single list of tokens along with an analysis score. Use tokenize
if the result score is not needed. Use analyzeTopN
if you need multiple results.
String to analyze
-Optional
matchOptions: MatchSpecifies the special string pattern extracted. This can be set to any combination of Match
by using the bitwise OR operator.
Optional
blockList: number | Morph[]Specifies a list of morphemes to prohibit from appearing as candidates in the analysis.
-Optional
pretokenized: PretokenizedSpan[]Predefines the result of morphological analysis of a specific segment of text prior to morphological analysis. The section of text defined by this value will always be tokenized in that way only.
-A single TokenResult
object.
Performs morphological analysis. Returns multiple list of tokens along with an analysis score. Use tokenizeTopN
if the result scores are not needed. Use analyze
if you need only one result.
String to analyze
-Number of results to return
-Optional
matchOptions: MatchSpecifies the special string pattern extracted. This can be set to any combination of Match
by using the bitwise OR operator.
Optional
blockList: number | Morph[]Specifies a list of morphemes to prohibit from appearing as candidates in the analysis.
-Optional
pretokenized: PretokenizedSpan[]Predefines the result of morphological analysis of a specific segment of text prior to morphological analysis. The section of text defined by this value will always be tokenized in that way only.
-A list of TokenResult
objects.
Creates a reusable morpheme set from a list of morphemes. This is intended to be used as the blockList
parameter for the analyse and tokenize methods.
-NOTE: The morpheme set must be destroyed using destroyMorphemeSet
when it is no longer needed. Otherwise, it will cause a memory leak.
-If you are using the morpheme set only once, you can pass the morpheme list directly to the blockList
parameter instead of creating a morpheme set.
List of morphemes to create a set from
-an handle to the created morpheme set
-Destroys a morpheme set created by createMorphemeSet
.
Handle to the morpheme set to destroy
-Tells you if the current Kiwi object was created with typo correction turned on.
-true
if typo correction is turned on.
Combines morphemes and restores them to a sentence. Endings are changed to the appropriate form to match the preceding morpheme.
-List of morphemes to combine
-Optional
lmSearch: booleanWhen there is an ambiguous morpheme that can be restored in more than one form, if this value is true
, the language model is explored to select the best form. If false
, no exploration is performed, but restoration is faster.
Optional
withRanges: booleanWehther to include the ranges of the morphemes in the returned SentenceJoinResult
object.
Tells whether the current Kiwi object is ready to perform morphological analysis.
-true
if it is ready for morphological analysis.
Returns the input text split into sentences. This method uses stemming internally during the sentence splitting process, so it can also be used to get stemming results simultaneously with sentence splitting.
-String to split
-Optional
matchOptions: MatchSpecifies the special string pattern extracted. This can be set to any combination of Match
by using the bitwise OR operator.
Optional
withTokenResult: booleanSpecifies whether to include the result of morphological analysis in the returned SentenceSplitResult
object.
A SentenceSplitResult
object.
Performs morphological analysis. Returns a single list of tokens. Use analyze
if the result score is needed. Use tokenizeTopN
if you need multiple results.
String to analyze
-Optional
matchOptions: MatchSpecifies the special string pattern extracted. This can be set to any combination of Match
by using the bitwise OR operator.
Optional
blockList: number | Morph[]Specifies a list of morphemes to prohibit from appearing as candidates in the analysis.
-Optional
pretokenized: PretokenizedSpan[]Predefines the result of morphological analysis of a specific segment of text prior to morphological analysis. The section of text defined by this value will always be tokenized in that way only.
-A list of TokenInfo
object.
Performs morphological analysis. Returns multiple lists of tokens. Use analyzeTopN
if the result scores are needed. Use tokenize
if you need only one result.
String to analyze
-Number of results to return
-Optional
matchOptions: MatchSpecifies the special string pattern extracted. This can be set to any combination of Match
by using the bitwise OR operator.
Optional
blockList: number | Morph[]Specifies a list of morphemes to prohibit from appearing as candidates in the analysis.
-Optional
pretokenized: PretokenizedSpan[]Predefines the result of morphological analysis of a specific segment of text prior to morphological analysis. The section of text defined by this value will always be tokenized in that way only.
-A list of lists of TokenInfo
objects.
Optional
endEbd position of the token in the preanalyzed word. If omitted, all token positions are automatically calculated.
-Form of the token.
-Optional
startStart position of the token in the preanalyzed word. If omitted, all token positions are automatically calculated.
-Part-of-speech tag of the token.
-Describes a single morpheme in the input string of the morphological analysis.
-Length of the morpheme in the input string.
-Line index in the input string.
-The id of the morpheme information in the used Kiwi object. -1 indicates OOV.
-For morphemes belonging to SSO, SSC part of speech tags, the position of the paired morpheme (-1 means no corresponding morpheme).
-The start position in the input string.
-Language model score of the morpheme.
-Sentence index in the input string.
-The form of the morpheme.
-The index of the sub-sentence enclosed in quotation marks or parentheses. Starts at 1. A value of 0 indicates that it is not a subordinate sentence.
-Part of speech tag of the morpheme.
-Cost of the typo that was corrected. If no typo correction was performed, this value is 0.
-Typo correction form if typo correction was performed. Id of pretokenized span if no typo correction was performed.
-Word index in the input string (space based).
-Optional
conditionConditions under which typos can be replaced.
-One of none
, any
(after any letter), vowel
(after a vowel), or applosive
(after an applosive).
-Defaults to none
when omitted.
Optional
costReplacement cost. Defaults to 1.
-The typos to be replaced
-Source strings
-Optional
continualThe cost of continual typos. Defaults to 1.
-A list of TypoDefinition that define typo generation rules.
-A single user word to add.
-Optional
origThe original morpheme of the morpheme to be added. -If the morpheme to be added is a variant of a particular morpheme, the original morpheme can be passed as this argument. -If it is not present, it can be omitted.
-Optional
scoreThe weighted score of the morpheme to add. -If there are multiple morpheme combinations that match the form,the word with the higher score will be prioritized. -Defaults to 0.
-Optional
tagPart-of-speech tag. Defaults to 'NNP'.
-The word to add.
-Interface that performs the actual morphological analysis.
-Same as Kiwi
, but with all methods returning promises. This can be used when the original Kiwi
object is constructed with a Web Worker.
-Cannot be constructed directly.
A single file to be loaded. The key is the name of the file and the value is the file data. -The file data can be a string representing a URL or an ArrayBufferView directly containing the file data.
-
Used to create Kiwi instances. Main entry point for the API. -It is recommended to create a KiwiBuilder and the Kiwi instances in a worker to prevent blocking the main thread.
-