Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching for a phrase including english stopwords in the standard analyser returns no matches #23

Open
marcemarc opened this issue Nov 17, 2017 · 0 comments

Comments

@marcemarc
Copy link

Had a site using EZSearch, and client reported problem searching for the phrase:

Can I dance with a bear if I'm lying to my fitness instructor

There existed a page in Umbraco with a nodeName with that exact matching text, but EZsearch does not return it as a match!

This is because the index was using the Standard Analyzer and the search phrase included English Stopwords that are excluded by the Analyzer from the index, eg "with", "if" ,"a" (https://stackoverflow.com/questions/17527741/what-is-the-default-list-of-stopwords-used-in-lucenes-stopfilter)

In EZSearch for each word in the search term for each search field, a lucene query is built up here:

  // Ensure page contains all search terms in some way
        foreach (var term in model.SearchTerms)
        {
   
            var groupedOr = new StringBuilder();
            foreach (var searchField in model.SearchFields)
            {
          
                groupedOr.AppendFormat("{0}:{1}* ", searchField, term);
            }
            query.Append("+(" + groupedOr.ToString() + ") ");
        }

But you can't have '+if' as 'if' will never be in the index as it's a stopword and the result is the entire phrase won't return a match - if you leave out the stopwords from the phrase then it will match.

So I think it would be neat to ignore English stopwords in search terms, and also words of 3 chars length, something like this:

 // Splits a string on space, except where enclosed in quotes (ignore stopwords)
    public IEnumerable<string> Tokenize(string input)
    {
    var tokens = Regex.Matches(input, @"[\""].+?[\""]|[^ ]+")
            .Cast<Match>()
            .Select(m => m.Value.Trim('\"'))
            .ToList();
          tokens = tokens.Where(x => !StopAnalyzer.ENGLISH_STOP_WORDS_SET.Contains(x.ToLower()) && x.Length > 3).ToList(); 
        return tokens;
    }

stick a

@using Lucene.Net.Analysis

and now if you search for the phrase: Can I dance with a bear if I'm lying to my fitness instructor
it's the same as searching for keywords 'Can dance bear lying fitness instructor' and the matching article is found! as EZsearch isn't insisting the search should include words that can't be in the index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant