Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bm25 & keyword search #564

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

fix bm25 & keyword search #564

wants to merge 1 commit into from

Conversation

devxpy
Copy link
Member

@devxpy devxpy commented Dec 15, 2024

Q/A checklist

  • If you add new dependencies, did you update the lock file?
poetry lock --no-update
  • Run tests
ulimit -n unlimited && ./scripts/run-tests.sh
  • Do a self code review of the changes - Read the diff at least twice.
  • Carefully think about the stuff that might break because of this change - this sounds obvious but it's easy to forget to do "Go to references" on each function you're changing and see if it's used in a way you didn't expect.
  • The relevant pages still run when you press submit
  • The API for those pages still work (API tab)
  • The public API interface doesn't change if you didn't want it to (check API tab > docs page)
  • Do your UI changes (if applicable) look acceptable on mobile?
  • Ensure you have not regressed the import time unless you have a good reason to do so.
    You can visualize this using tuna:
python3 -X importtime -c 'import server' 2> out.log && tuna out.log

To measure import time for a specific library:

$ time python -c 'import pandas'

________________________________________________________
Executed in    1.15 secs    fish           external
   usr time    2.22 secs   86.00 micros    2.22 secs
   sys time    0.72 secs  613.00 micros    0.72 secs

To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:

def my_function():
    import pandas as pd
    ...

Legal Boilerplate

Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Here's my summary of the key changes in this pull request focused on improving BM25 and keyword search functionality:

Added support for separate keyword queries in vector search with proper control character handling for Vespa compatibility. The changes include:

  • Added keyword_query field to DocSearchRequest in vector_search.py to support dedicated keyword-based searches
  • Simplified Vespa ranking profiles in setup_vespa_db.py by removing inheritance and standardizing query parameter names
  • Modified generate_final_search_query() in query_generator.py to prioritize request values over response/context data
  • Updated variables_input() in variables_widget.py to support excluding specific variables via new exclude parameter
  • Upgraded pyvespa dependency from 0.39.0 to 0.51.0 to leverage newer search capabilities

The changes improve search accuracy by separating semantic and keyword-based searches while maintaining backward compatibility.

💡 (1/5) You can manually trigger the bot by mentioning @greptileai in a comment!

9 file(s) reviewed, 3 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines 235 to 244
def query_vespa(
search_query: str,
keyword_query: str | list[str] | None,
file_ids: list[str],
limit: int,
embedding_model: EmbeddingModels,
semantic_weight: float = 1.0,
threshold: float = 0.7,
rerank_count: float = 1000,
) -> dict:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The rerank_count parameter is defined as float but used for integer operations. Should be typed as int.

Comment on lines +991 to +993
def remove_control_characters(s):
# from https://docs.vespa.ai/en/troubleshooting-encoding.html
return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: The remove_control_characters function could be more efficient using str.translate() with a translation table

Comment on lines +79 to 80
inputs=[("query(queryEmbedding)", EMBEDDING_TYPE)],
first_phase="closeness(field, embedding)",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The semantic profile now uses queryEmbedding instead of q parameter, make sure all query code is updated to use the new parameter name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant