Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greek stemmer returns an error #204

Open
subnix opened this issue Dec 30, 2024 · 2 comments
Open

Greek stemmer returns an error #204

subnix opened this issue Dec 30, 2024 · 2 comments

Comments

@subnix
Copy link

subnix commented Dec 30, 2024

We use snowball with manticoresearch and have encountered the error (manticoresoftware/manticoresearch#2888) with the Greek stemmer:

$ echo -n "ισαισα" | ./stemwords -l el
Out of memory

Enabling debug output in runtime/utilities.c produces the following error message:

faulty slice operation:
{|ισα[}'

Tested with master (b72b71f), v2.2.0 (48a67a2), v2.1.0 (4764395), v2.0.0 (c70ed64).

@ojwb
Copy link
Member

ojwb commented Dec 30, 2024

It's happening in "steps3" - this patch seems a plausible fix, but I've only done minimal testing so far and not checked against the paper to confirm if this is what's actually intended here:

diff --git a/algorithms/greek.sbl b/algorithms/greek.sbl
index df856ac9..06854119 100644
--- a/algorithms/greek.sbl
+++ b/algorithms/greek.sbl
@@ -214,7 +214,7 @@ backwardmode (
       '{y}{s}{a}' '{y}{s}{e}{s}' '{y}{s}{e}' '{y}{s}{a}{m}{e}' '{y}{s}{a}{t}{e}' '{y}{s}{a}{n}' '{y}{s}{a}{n}{e}' (
         delete
         unset test1
-        ('{y}{s}{a}' atlimit <- '{y}{s}') or
+        (['{y}{s}{a}'] atlimit <- '{y}{s}') or
         ([] substring atlimit among (
           '{a}{n}{a}{m}{p}{a}' '{a}{th}{r}{o}' '{e}{m}{p}{a}' '{e}{s}{e}' '{e}{s}{oo}{k}{l}{e}' '{e}{p}{a}' '{x}{a}{n}{a}{p}{a}' '{e}{p}{e}' '{p}{e}{r}{y}{p}{a}'
           '{s}{u}{n}{a}{th}{r}{o}' '{d}{a}{n}{e}' '{k}{l}{e}' '{ch}{a}{r}{t}{o}{p}{a}' '{e}{x}{a}{r}{ch}{a}' '{m}{e}{t}{e}{p}{e}' '{a}{p}{o}{k}{l}{e}'

It looks like this bug has been there since this stemmer was first added.

@ojwb
Copy link
Member

ojwb commented Jan 2, 2025

A related issue I noticed is that ισα produces an empty stem. Producing an empty stem for a non-empty input is undesirable in general for a stemmer; in this case ισα seems to mean "same" and so probably stemming it to ισ would be good.

The essential cause of this is that the has_min_length test is only applied to the input, but removed suffixes can reduce the stem length below that, even removing the whole word as a suffix (or combination of suffixes). If I test tweaking the code to prevent the -ισα suffix removal then -α is removed instead giving the desired result. Probably this stemmer would benefit from a "minimum length of the remaining stem" check before suffix removal, like most of the other stemmers have. That wouldn't address the error reported here though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants