-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Swedish stemming of -ös #152
Comments
I've not tried this out myself yet, but it sounds sensible from your description. The four lines you suggest aren't a bad way to implement it (each But perhaps a better option here is to drop the
(Maybe there's a better name for the new grouping.) If you want to submit a patch, please do - otherwise I'm happy to look at this, but not until after we've landed the patch for #47. |
That's beautiful. Using this pattern I think we can probably add some more prefixes. Or maybe a negated prefix? My understanding so far is that very few words end with
I believe a perfect rule for
Can you come up with such a pattern for the |
You can use Note that only the last of your examples will actually be considered here anyway since this removal is restricted to suffixes which are entirely in region R1 - here R1 is the the region after the first non-vowel following a vowel, or starts at least 3 characters from the start of the word if the first non-vowel following a vowel is before that. So
(This also means that if the |
The only exceptions left to consider are these three words then:
These are the only words to consider outside of |
I think
|
That seems to be more complicated than expected, so I think we should actually try to merge this one first as it looks like that's more easily achievable. |
That does have slightly different semantics as it doesn't require the |
Reviewing the discussion we have a few suggested and inferred options which in perl-like regexp terms are:
(Possibly plus some exceptions to also apply this rule to cases that are too short by the R1 rule, but I'll worry about those separately.) I'd probably lean slightly towards an inclusive list, as perhaps less likely to have unintended effects on proper nouns, foreign words, obscure words, etc, but only weakly - e.g. if we had about half the alphabet to include or exclude I'd pick inclusive. This rule is applied as the final step after other suffixes have potentially been removed, so e.g. "senbösten" -> "senböst" by I looked at this for the current vocab by commenting out the call to
This only shows words where the stem being formed ("proto-stem") ends "ös" or "öst" at the point in the process where this rule would be applied (and I've excluded those ending "lös" or "löst" as those are already handled, or aren't handled because of the R1 rule). The Looking over this list, there are 22 different proto-stems ending "öst" at that point. The only cases where there's another word in the vocab we could conflate one of these with by removing the final "t" are:
There's also these which are too short by the R1 rule:
I'm currently crunching Swedish wikipedia data to get a larger wordlist and will try this on that too. |
You might want to add the very rare
|
Thanks for the feedback. My wikipedia data is still crunching (I think the Python module the script uses must have got significantly slower, but this Swedish wikipedia dump is also the largest I've run it on so far). |
Looking at the 98888 words which occur at least 36 times in Swedish wikipedia, and checking what gets presented to the final step of the stemmer (like I did above) I also found "generöst" (neuter singular of "generös" - like English "generous"), "poröst" (ditto for "porös" - English "porous") and "rigoröst" (ditto "rigorös" - English "rigorous"), which are all "r" cases. We don't want to conflate "rös"/"röst" but the R1 check means we wouldn't anyway. So "r" is a candidate if there aren't other problematic cases (but there aren't a huge number of "r" words, or indeed really of anything except the "l" we already handle). The only other words ending "-öst" don't have matching "-ös" words in the list, and I don't see any words ending "-töst" or "-köst" (or "-uöst"). Here are the counts for all (but not taking into account whether the R1 requirement excludes them):
(I checked that the "j" word is and it's "sjöst" so isn't in R1.) Reducing the frequency threshold to 20 (150940 words) finds me "incestuös"/"incestuöst" and "luxuös"/"luxuöst", but otherwise doesn't find any more words with both "-ös" and "-öst" forms. The only other letter we gain is "f" from "föste"/"föstes" which reach this rule as "föst", but are excluded from it by the R1 check anyway. Reducing the frequency threshold to 5 (429604 words) finds more "r" words: "fibrös"/"fibröst", "glamorös"/"glamoröst", but also "överöser" (which reaches this stage as "överös") vs "överöst"/"överöste"/"överöstes" which seem to not be forms of the same word as best I can make our, but they're already conflated by the current algorithm anyway. There's also a "p" word: "pompös"/"pompöst". And at last a "k" word: "viskös"/"visköst" (same example you gave); still no "t" but there are your examples and no counterexamples. The grouping test does a max/min check and then a bitmap test so there's no runtime overhead there from adding extra characters within the range we'd be testing anyway, so I think we might as well include "k", "p" and "t". Perhaps "r" too - I haven't spotted a problematic case for "-röst" but there are other words it affects so there's more to worry about, but also more cases and more common cases than "k", "p" or "t". |
I was wrong about that - the I tested changing it to work how I thought it was now working and it's better for a few cases and worse for none, so let's do that too - it is more logical for "-ös" and doesn't affect the other suffixes handled by this step. |
Here's my proposed change without "r" (and with-"r" variant commented):
I've been working on a script to analyse stemmer output before and after a change. I tried the no-r and r variants on a huge wordlist of the 988946 words which occur twice or more in Swedish wikipedia and the analysis reported is:
284 words stemmed differently So including "r" here looks good to me, but I'd appreciate feedback from someone fluent in Swedish. |
Looks good! But not entirely sure about "r". Words ending with "röst" could either be a "-rös" word or "röst" (voice). E.g.:
I guess you could say that the stem is e.g. "basrös" and "talrös". I would expect it to be unique, as it cannot collide with a "-rös" word. It's probably up to you to decide what you consider a stem. I.e. Would we be fine with "voice" being stemmed as "voi"? If yes, then "röst" may be stemmed as "rös". Except for the word "röst" itself! |
While the stems almost all look a lot like the word they're a stem of, and often actually are what you might think of as the linguistic root, we aren't actually aiming to produce the linguistic root, so "voice" to "voi" would be OK and in fact the English stemmer currently stems "voice" -> "voic" which kind of illustrates this point. (The underlying reason for this is that a final "e" is elided in English when adding some suffixes and when removing these we can for example easily reduce "voicing" -> "voic", but it's then hard to come up with a rule to know whether to append an "e" to the stem after removing "-ing", and much easier to write a rule to remove the "-e" when stemming "voice".) So the cases you highlight are fine as long as these stems don't collide with the stems of unrelated words, and even my overlarge wordlist didn't uncovered any cases where they did. The only caveat there is that while the vocabulary of wikipedia should be fairly broad, it may be lacking words which are only used in particular regions or dialects, have fallen out of usage (but could be encountered while indexing a data set including older documents), etc.
Yes, and "röst" is excluded by the R1 requirement. Thanks for the feedback - I'll get this change merged. |
snowballstem/snowball#152 improves the handling of -öst endings. This change expands the test data to fully cover the changes, and updated the expected output.
This now reflects the -öst suffix changes from snowballstem/snowball#152
I have identified a group of words that are incorrectly stemmed. Please assist me. I don't know how to patch swedish.sbl.
I realize it's kind of hard to improve the stemmer without over-stemming. However, there are around 100 Swedish adjectives ending with
ös
(English equivalentous
) that do not conflict with anyöst
words.Suffix
t
not identifiedThese word endings are correctly stemmed with a suffix of
a
, but nott
(English equivalently
). E.g.:Now, this is almost handled in the existing stemmer. Recall the handling of "lös" at line 58.
Proposed improvement
Not sure if regex/ranges are supported, or what the syntax would be. But, this is a 100% verified improvement (no over-stemming):
Change
'l{o"}st' (<-'l{o"}s')
to something wherel
is changed to[ilnuv]
. Somehow...Otherwise, we could simply add these four lines:
The text was updated successfully, but these errors were encountered: