Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ws-irc3sp: Split value into sentences #5

Merged
merged 4 commits into from
Mar 24, 2023
Merged

Conversation

parmentf
Copy link
Collaborator

@parmentf parmentf commented Mar 24, 2023

And respect initials of species names.

This accelerates the treatment for long texts.

It's important to not split a sentence at species names initials: Canis lupus, .... C. lupus should not be split into two sentences.

And respect initials of species names.

This accelerates the treatment
for long texts.
@parmentf parmentf added the enhancement New feature or request label Mar 24, 2023
@parmentf parmentf merged commit 4b59c93 into main Mar 24, 2023
@parmentf parmentf deleted the ws-irc3sp-split-sentences branch March 24, 2023 15:58
@@ -18,16 +18,18 @@ post.parameters.1.description = Indent or not the JSON Result
plugin = @ezs/spawn
# JSONParse
plugin = @ezs/basics
# sentences
plugin = ./v1/local.js

[JSONParse]
legacy = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parmentf fast but dangerous...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dangerous ?
Because of the lack of testing ?
Sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad structure causes a crash cf. tsouza/yajs#5

const LETTERS = "ABCDEFHIJKLMNOPQRSTUVWXYZ";
const SENTENCE_ENDING = ".?!";

const sentences = (data, feed, ctx) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be moved to a generic package for reuse and documentation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and tested ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be moved to a generic package for reuse and documentation

Yes, but which one ?
This is so specific, that it can't fit into @ezs/basics.
I thought of @ezs/teeft for a while, but the real purpose of this is to reduce the size of the texts to be analyzed by irc3sp.
That is to say, not spltting sentences containing abbreviated species name, so that they can be retrieved by the IRC3sp algorithm...
Maybe we could start an @ezs/nlp package, but even there, this sentence cutter would not really be in its place, because of its specificity.
🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezs/basics contains TXTParser which allows to split text into segments, we can imagine adding TXTSentences which splits into sentences.
Otherwise you need to create a new package which will contain tools for text

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was eventually integrated into @ezs/strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants