-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ws-irc3sp: Split value into sentences #5
Conversation
And respect initials of species names. This accelerates the treatment for long texts.
@@ -18,16 +18,18 @@ post.parameters.1.description = Indent or not the JSON Result | |||
plugin = @ezs/spawn | |||
# JSONParse | |||
plugin = @ezs/basics | |||
# sentences | |||
plugin = ./v1/local.js | |||
|
|||
[JSONParse] | |||
legacy = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parmentf fast but dangerous...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dangerous ?
Because of the lack of testing ?
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad structure causes a crash cf. tsouza/yajs#5
const LETTERS = "ABCDEFHIJKLMNOPQRSTUVWXYZ"; | ||
const SENTENCE_ENDING = ".?!"; | ||
|
||
const sentences = (data, feed, ctx) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be moved to a generic package for reuse and documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and tested ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be moved to a generic package for reuse and documentation
Yes, but which one ?
This is so specific, that it can't fit into @ezs/basics
.
I thought of @ezs/teeft
for a while, but the real purpose of this is to reduce the size of the texts to be analyzed by irc3sp.
That is to say, not spltting sentences containing abbreviated species name, so that they can be retrieved by the IRC3sp algorithm...
Maybe we could start an @ezs/nlp
package, but even there, this sentence cutter would not really be in its place, because of its specificity.
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezs/basics contains TXTParser which allows to split text into segments, we can imagine adding TXTSentences which splits into sentences.
Otherwise you need to create a new package which will contain tools for text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was eventually integrated into @ezs/strings
.
And respect initials of species names.
This accelerates the treatment for long texts.
It's important to not split a sentence at species names initials:
Canis lupus, .... C. lupus
should not be split into two sentences.