Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added JavaScript/wasm target via emscripten #171

Merged
merged 23 commits into from
Jun 30, 2024
Merged

Conversation

RicBent
Copy link
Contributor

@RicBent RicBent commented Jun 13, 2024

This pull requests adds a build target for a JavaScript library, usable in any modern Browser.
For this the emscripten toolchain is utilized.

Solves #169

You can check out a running demo here: https://ricbent.github.io/Kiwi/demo/

Things that can be improved:

  • More of the base API functionality (multiple results, custom builder arguments, etc)
  • Implementation of wasm 128-bit SIMD functions (supported in all current browsers: https://caniuse.com/wasm-simd)
  • Addition of GitHub actions to build this new target
  • Creating a proper npm package (package.json, helpers like the WebWorker wrapper from the demo, TypeScript types, etc), deployed via GitHub actions
  • Usage of wizer to pre-initialize the wasm module with a dictionary. This would entirely bypass the slow Kiwi initialization times.
  • Multi-thread support via WebWorkers
  • Add Documentation

Because I haven't set up GitHub actions for this target yet, you need to manually build. This requires the emscripten toolchain to be installed:

mkdir build
cd build
emcmake cmake -DCMAKE_BUILD_TYPE=Release -DKIWI_USE_CPUINFO=OFF -DKIWI_BUILD_TEST=OFF -DKIWI_USE_MIMALLOC=OFF -DKIWI_BUILD_CLI=OFF -DKIWI_BUILD_EVALUATOR=OFF -DKIWI_BUILD_MODEL_BUILDER=OFF ../
make

This will generate kiwi-wasm.js and kiwi-wasm.wasm. The former can be directly used in any modern browser.

Sorry if I didn't follow some contribution guidelines correctly as I don't speak much Korean (yet).


Things changed since the creation of the PR:

  • Bindings implement the same functionality as the Java bindings, instead of just the tokenize/build functions
  • Typing for all api via TypeScript
  • Inline code documentation, HTML docs can be generated, just like the C++/C/Python bindings
  • npm package should be publishable by release workflow
  • demo project that uses the npm package is provided, replicating the linked demo, now with proper typing

@RicBent
Copy link
Contributor Author

RicBent commented Jun 13, 2024

Is this change correct? RicBent@3f0eb0c

Passing 0 for numThreads to prevent the ThreadPool creation in KiwiBuilder did not work as that triggered some assertion.

@RicBent
Copy link
Contributor Author

RicBent commented Jun 14, 2024

Attempted to implement the release workflow. Not sure how to test it properly without triggering a release.

I guess a workflow that triggers on new pull requests still needs to be added.

@bab2min
Copy link
Owner

bab2min commented Jun 15, 2024

@RicBent
Wow, it's amazing! Thank you for your contribution! The demo seems to work very well.
I'll review it as soon as possible.

@RicBent
Copy link
Contributor Author

RicBent commented Jun 17, 2024

Great!

The latest commit adds a wrapper package that could be directly imported in any npm project. Along with proper types for the entirety of the so far exposed API (https://github.com/bab2min/Kiwi/blob/2161cf137b996471383e7c7b65370e9f28981f14/bindings/wasm/package/src/kiwi.ts).

I'd be happy to implement the remaining API functionality to bring it to the same level as the Python library once you had a look at the PR.

@bab2min
Copy link
Owner

bab2min commented Jun 18, 2024

@RicBent
Oh, it looks good to me. 👍
I would really appreciate it if the rest of the functionality was completed as well.
If it is difficult for you to implement all the functionality, I think it is okay to implement only the core, merge and release them first, and then supplement the rest later.

@RicBent
Copy link
Contributor Author

RicBent commented Jun 19, 2024

I had a look at the Java bindings and it seems like the only missing API is the following:

KiwiBuilder:

  • Init with TypoTransformer
  • Add word
  • Add pre-analyzed word
  • Load additional dictionaries

Kiwi:

  • Analyze/tokenize with blocklist
  • Analyze/tokenize with pre-tokenized spans

I would also like to add documentation to the bindings. However I am not able to make them in Korean, only in English. Is that a problem @bab2min ?

Oh and we would also need an npm package name as 'kiwi' is already taken. I chose 'kiwi-nlp' for now, but I can change it if you have a better suggestion.

@bab2min
Copy link
Owner

bab2min commented Jun 21, 2024

@RicBent
It's okay to write the documentation in English. If you write the documentation in English, I can translate it for Korean documentation.
I think kiwi-nlp is a good name for the package, showing that it is a NLP library.
Thank you for your contribution!

@RicBent
Copy link
Contributor Author

RicBent commented Jun 24, 2024

Alright, I am almost there then. I already wrote pretty much all the required documentation and just the 4 points from KiwiBuilder are missing to match the Java API functionality.

I also amended the release workflow to automatically build and publish the resulting package to npm:

build-emscripten:
name: Emscripten
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
submodules: true
lfs: true
- uses: mymindstorm/setup-emsdk@v14
- name: Build
run: |
cd bindings/wasm
./build.sh
- uses: JS-DevTools/npm-publish@v3
with:
token: ${{ secrets.NPM_TOKEN }}
package: bindings/wasm/package

The workflow requires NPM_TOKEN to be added to the repo's secrets on your end when this gets merged.

To generate an npm token you need to do the following:

  1. Register an account on npm if you didn't already
  2. Follow this to create a token (the token will need write access to the kiwi-nlp package)

To add it to the repo's secrets you can follow this:
https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions

@RicBent
Copy link
Contributor Author

RicBent commented Jun 25, 2024

@bab2min everything on my end should be done now:

  • The TypeScript bindings implement the same functionality as the Java bindings
  • Inline code documentation is done, HTML docs can be generated, just like the Python bindings
  • npm package should be publishable by release workflow (needs token as described)
  • demo project that uses the npm package is provided, replicating this: https://ricbent.github.io/Kiwi/demo/

Let me know if anything needs to be improved/changed!

@bab2min
Copy link
Owner

bab2min commented Jun 27, 2024

@RicBent
Great!!! Thank you for your contribution.
I'll add documentations for Korean and tokens for release workflow.

@bab2min bab2min merged commit b2709dc into bab2min:main Jun 30, 2024
14 checks passed
@ghj7211
Copy link

ghj7211 commented Sep 6, 2024

@RicBent
If it's not too much a hassle, could you consider building it without eval syntax?
I'm trying to import your work into a chrome extension. While it works great on manifest v2, CSP update on manifest v3 made it impossible to run any evaluations of strings in the implemented code.
According to issue raised here emscripten-core/emscripten#20994
It seems possible to suppress eval usage in resulting js by providing [-sDYNAMIC_EXECUTION=0] and [-sEMBIND_AOT] options.
I'll try resolving it by myself, but would really appreciate your help!

@RicBent
Copy link
Contributor Author

RicBent commented Sep 6, 2024

@ghj7211 I actually did the WASM port to use this inside a MV3 extension I built for my personal use.

To make it work properly, also together with vite you need to following flags added to the current ones:
-s DYNAMIC_EXECUTION=0
-s ENVIRONMENT=worker
-s USE_ES6_IMPORT_META=0

When I made the PR I was unsure if there is a sane way to distribute different versions of the package and thus just went for the most general and minimal build flags for broadest compat and performance. Especially the dyn exection flag had a significant performance hit when I worked on something in the past. Only thing that I figured would be possible was multiple different published packages for different build flags, which I didn't bother to do considering the limited target audience and added complexity.

If you do not want to build this package yourself, I could provide you with the build I am using. Not sure if you want to reply on pre-built, unverifiable binaries though.

If you have any problems with your project, let me know. Chances are high that I had to deal with similar issues when making my tool :)

@ghj7211
Copy link

ghj7211 commented Sep 7, 2024

@RicBent Thank you for your reply!
I managed to build it with -s DYNAMIC_EXECUTION=0 and -s EMBIND_AOT, modifyingCMakeLists.txt.
It works well on MV3 extension service worker, and also with vite dev environment.
emscripten/src/settings.js

// Embind specific: If enabled, generate Embind's JavaScript invoker functions
// at compile time and include them in the JS output file. When used with
// DYNAMIC_EXECUTION=0 this allows exported bindings to be just as fast as
// DYNAMIC_EXECUTION=1 mode, but without the need for eval(). If there are many
// bindings the JS output size may be larger though.
var EMBIND_AOT = false;

EMBIND_AOT option is relatively new, since it was first implemented on last December. Using it I found no noticeable degradation in performance even with DYNAMIC_EXECUTION=0.

-s ENVIRONMENT=worker -s USE_ES6_IMPORT_META=0 seems to be optional for my use case. Not specifying ENVIRONMENT supported any environments, and extension service worker could handle import.meta keywords.

Since unsafe eval() scripts are going deprecated on web for security issues, I thought it would be nice for kiwi-nlp to be eval-free on default. I made a repo to check the difference between 2 versions of kiwi-wasm.js :
ghj7211/kiwi-nlp-noeval@bfc99c8

You can also git clone and run the demo to see it running smoothly.
If proven error proof, I believe it is worth an update?

@RicBent
Copy link
Contributor Author

RicBent commented Sep 8, 2024

Nice research! I totally agree that getting rid of eval by default is a good idea if the performance does not take a noticeable hit.

This also seems to work flawlessly in the few tests I did on my end. Seems like worth opening a PR for from your end then? (also those dist files should go onto the gitignore/be removed oops).

Might also be a nice idea to have automated JS side tests at some point to check not only for correctness but also performance. Especially considering the broad configurability emscripten has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants