-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bm25 & keyword search #564
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
Here's my summary of the key changes in this pull request focused on improving BM25 and keyword search functionality:
Added support for separate keyword queries in vector search with proper control character handling for Vespa compatibility. The changes include:
- Added
keyword_query
field toDocSearchRequest
invector_search.py
to support dedicated keyword-based searches - Simplified Vespa ranking profiles in
setup_vespa_db.py
by removing inheritance and standardizing query parameter names - Modified
generate_final_search_query()
inquery_generator.py
to prioritize request values over response/context data - Updated
variables_input()
invariables_widget.py
to support excluding specific variables via newexclude
parameter - Upgraded pyvespa dependency from 0.39.0 to 0.51.0 to leverage newer search capabilities
The changes improve search accuracy by separating semantic and keyword-based searches while maintaining backward compatibility.
💡 (1/5) You can manually trigger the bot by mentioning @greptileai in a comment!
9 file(s) reviewed, 3 comment(s)
Edit PR Review Bot Settings | Greptile
def query_vespa( | ||
search_query: str, | ||
keyword_query: str | list[str] | None, | ||
file_ids: list[str], | ||
limit: int, | ||
embedding_model: EmbeddingModels, | ||
semantic_weight: float = 1.0, | ||
threshold: float = 0.7, | ||
rerank_count: float = 1000, | ||
) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The rerank_count parameter is defined as float but used for integer operations. Should be typed as int.
def remove_control_characters(s): | ||
# from https://docs.vespa.ai/en/troubleshooting-encoding.html | ||
return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: The remove_control_characters function could be more efficient using str.translate() with a translation table
inputs=[("query(queryEmbedding)", EMBEDDING_TYPE)], | ||
first_phase="closeness(field, embedding)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The semantic profile now uses queryEmbedding instead of q parameter, make sure all query code is updated to use the new parameter name
Q/A checklist
You can visualize this using tuna:
To measure import time for a specific library:
To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:
Legal Boilerplate
Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.