ws-irc3sp: Split value into sentences #5

parmentf · 2023-03-24T15:01:26Z

And respect initials of species names.

This accelerates the treatment for long texts.

It's important to not split a sentence at species names initials: Canis lupus, .... C. lupus should not be split into two sentences.

And respect initials of species names. This accelerates the treatment for long texts.

touv · 2023-03-24T22:12:56Z

applications/ws-irc3sp/public/v1/irc3sp.ini

@@ -18,16 +18,18 @@ post.parameters.1.description = Indent or not the JSON Result
 plugin = @ezs/spawn
 # JSONParse
 plugin = @ezs/basics
+# sentences
+plugin = ./v1/local.js

 [JSONParse]
 legacy = false


@parmentf fast but dangerous...

Dangerous ?
Because of the lack of testing ?
Sure.

bad structure causes a crash cf. tsouza/yajs#5

touv · 2023-03-24T22:15:04Z

applications/ws-irc3sp/public/v1/local.js

+const LETTERS = "ABCDEFHIJKLMNOPQRSTUVWXYZ";
+const SENTENCE_ENDING = ".?!";
+
+const sentences = (data, feed, ctx) => {


could be moved to a generic package for reuse and documentation

and tested ;-)

could be moved to a generic package for reuse and documentation

Yes, but which one ?
This is so specific, that it can't fit into @ezs/basics.
I thought of @ezs/teeft for a while, but the real purpose of this is to reduce the size of the texts to be analyzed by irc3sp.
That is to say, not spltting sentences containing abbreviated species name, so that they can be retrieved by the IRC3sp algorithm...
Maybe we could start an @ezs/nlp package, but even there, this sentence cutter would not really be in its place, because of its specificity.
🤔

@ezs/basics contains TXTParser which allows to split text into segments, we can imagine adding TXTSentences which splits into sentences.
Otherwise you need to create a new package which will contain tools for text

It was eventually integrated into @ezs/strings.

perf(ws-irc3sp): Split value into sentences

4edd7c7

And respect initials of species names. This accelerates the treatment for long texts.

parmentf added the enhancement New feature or request label Mar 24, 2023

parmentf added 3 commits March 24, 2023 16:11

refactor(ws-irc3sp): Factorize trimStart

0c31b1c

build(ws-irc3sp): Remove useless apk

8aa96f9

release [email protected]

162afe0

parmentf merged commit 4b59c93 into main Mar 24, 2023

parmentf deleted the ws-irc3sp-split-sentences branch March 24, 2023 15:58

touv reviewed Mar 24, 2023

View reviewed changes

parmentf mentioned this pull request Mar 27, 2023

ezs/basics: add txt-sentences Inist-CNRS/ezs#322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ws-irc3sp: Split value into sentences #5

ws-irc3sp: Split value into sentences #5

parmentf commented Mar 24, 2023 •

edited

Loading

touv Mar 24, 2023

parmentf Mar 27, 2023

touv Mar 27, 2023

touv Mar 24, 2023

touv Mar 24, 2023

parmentf Mar 27, 2023

touv Mar 27, 2023

parmentf Jun 29, 2023

ws-irc3sp: Split value into sentences #5

ws-irc3sp: Split value into sentences #5

Conversation

parmentf commented Mar 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parmentf commented Mar 24, 2023 •

edited

Loading