Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ws-irc3sp: Split value into sentences #5

Merged
merged 4 commits into from
Mar 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion applications/ws-irc3sp/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ RUN apk add --update-cache --no-cache \
openssl-dev \
perl \
perl-json \
ca-certificates \
&& \
gunzip public/v1/CoL.txt.gz && \
mv package-app.json package.json && \
Expand Down
6 changes: 1 addition & 5 deletions applications/ws-irc3sp/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ws-irc3sp@1.0.0
# ws-irc3sp@1.1.0

IRC3: Indexation par Recherche et Comparaison de Chaînes de Caractères

Expand All @@ -8,10 +8,6 @@ See [original
program](https://gitbucket.inist.fr/scodex/IRC3/tree/master/IRC3sp) (French
description of IRC3sp).

> 💡 The treatment is much quicker when you send an array containing a tokenized
> text (sentence by sentence).
> The payload may be like `[{"id":1,"value":["sentence 1", "sentence 2"]}]`.

## Test

```bash
Expand Down
2 changes: 1 addition & 1 deletion applications/ws-irc3sp/package.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"private": true,
"name": "ws-irc3sp",
"version": "1.0.0",
"version": "1.1.0",
"description": "Lodex workers for ws-irc3sp",
"repository": {
"type": "git",
Expand Down
2 changes: 1 addition & 1 deletion applications/ws-irc3sp/public/swagger.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"info": {
"version": "1.0.0"
"version": "1.1.0"
}
}
6 changes: 4 additions & 2 deletions applications/ws-irc3sp/public/v1/irc3sp.ini
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,18 @@ post.parameters.1.description = Indent or not the JSON Result
plugin = @ezs/spawn
# JSONParse
plugin = @ezs/basics
# sentences
plugin = ./v1/local.js

[JSONParse]
legacy = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parmentf fast but dangerous...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangerous ?
Because of the lack of testing ?
Sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad structure causes a crash cf. tsouza/yajs#5

separator = $

[sentences]

[expand]
path = env('path', 'value')
size = 100
# A cache is not a good idea on long texts
# cacheName = irc3sp-post-v1-irc3sp

[expand/exec]
# command should be executable !
Expand Down
45 changes: 45 additions & 0 deletions applications/ws-irc3sp/public/v1/local.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
const LETTERS = "ABCDEFHIJKLMNOPQRSTUVWXYZ";
const SENTENCE_ENDING = ".?!";

const sentences = (data, feed, ctx) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be moved to a generic package for reuse and documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and tested ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be moved to a generic package for reuse and documentation

Yes, but which one ?
This is so specific, that it can't fit into @ezs/basics.
I thought of @ezs/teeft for a while, but the real purpose of this is to reduce the size of the texts to be analyzed by irc3sp.
That is to say, not spltting sentences containing abbreviated species name, so that they can be retrieved by the IRC3sp algorithm...
Maybe we could start an @ezs/nlp package, but even there, this sentence cutter would not really be in its place, because of its specificity.
🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezs/basics contains TXTParser which allows to split text into segments, we can imagine adding TXTSentences which splits into sentences.
Otherwise you need to create a new package which will contain tools for text

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was eventually integrated into @ezs/strings.

if (ctx.isLast()) {
return feed.close();
}

let value = data?.value;
if (Array.isArray(value)) {
if (value.length === 1) {
value = value[0];
}
}
if (typeof value !== 'string') {
return feed.send({ ...data, value });
}

value = value.split("").reduce((a, c) => {
const currentSentence = a.slice(-1);
const [prev1, prev2] = a.slice(-1)[0].slice(-2);
if (SENTENCE_ENDING.includes(c)) {
if (c !== ".") {
return [...a.slice(0, -1), currentSentence + c, " "];
}

if (prev1 !== " ") {
return [...a.slice(0, -1), currentSentence + c, " "];
}

if (!LETTERS.includes(prev2)) {
return [...a.slice(0, -1), currentSentence + c, " "];
}
}
return [...a.slice(0, -1), currentSentence + c]
},
[" "])
.filter(sentence => sentence !== " ")
.map(s => s.trimStart());
feed.send({ ...data, value });
};

module.exports = {
sentences,
};