fix bm25 & keyword search #564

devxpy · 2024-12-15T13:37:46Z

Q/A checklist

If you add new dependencies, did you update the lock file?

poetry lock --no-update

Run tests

ulimit -n unlimited && ./scripts/run-tests.sh

Do a self code review of the changes - Read the diff at least twice.
Carefully think about the stuff that might break because of this change - this sounds obvious but it's easy to forget to do "Go to references" on each function you're changing and see if it's used in a way you didn't expect.
The relevant pages still run when you press submit
The API for those pages still work (API tab)
The public API interface doesn't change if you didn't want it to (check API tab > docs page)
Do your UI changes (if applicable) look acceptable on mobile?
Ensure you have not regressed the import time unless you have a good reason to do so.
You can visualize this using tuna:

python3 -X importtime -c 'import server' 2> out.log && tuna out.log

To measure import time for a specific library:

$ time python -c 'import pandas'

________________________________________________________
Executed in    1.15 secs    fish           external
   usr time    2.22 secs   86.00 micros    2.22 secs
   sys time    0.72 secs  613.00 micros    0.72 secs

To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:

def my_function():
    import pandas as pd
    ...

Legal Boilerplate

Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.

greptile-apps

PR Summary

Here's my summary of the key changes in this pull request focused on improving BM25 and keyword search functionality:

Added support for separate keyword queries in vector search with proper control character handling for Vespa compatibility. The changes include:

Added keyword_query field to DocSearchRequest in vector_search.py to support dedicated keyword-based searches
Simplified Vespa ranking profiles in setup_vespa_db.py by removing inheritance and standardizing query parameter names
Modified generate_final_search_query() in query_generator.py to prioritize request values over response/context data
Updated variables_input() in variables_widget.py to support excluding specific variables via new exclude parameter
Upgraded pyvespa dependency from 0.39.0 to 0.51.0 to leverage newer search capabilities

The changes improve search accuracy by separating semantic and keyword-based searches while maintaining backward compatibility.

_{💡 (1/5) You can manually trigger the bot by mentioning @greptileai in a comment!}

_{9 file(s) reviewed, 3 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2024-12-23T03:42:56Z

daras_ai_v2/vector_search.py

 def query_vespa(
    search_query: str,
+    keyword_query: str | list[str] | None,
    file_ids: list[str],
    limit: int,
    embedding_model: EmbeddingModels,
    semantic_weight: float = 1.0,
+    threshold: float = 0.7,
+    rerank_count: float = 1000,
 ) -> dict:


logic: The rerank_count parameter is defined as float but used for integer operations. Should be typed as int.

greptile-apps · 2024-12-23T03:42:56Z

daras_ai_v2/vector_search.py

+def remove_control_characters(s):
+    # from https://docs.vespa.ai/en/troubleshooting-encoding.html
+    return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C")


style: The remove_control_characters function could be more efficient using str.translate() with a translation table

greptile-apps · 2024-12-23T03:43:35Z

scripts/setup_vespa_db.py

+                    inputs=[("query(queryEmbedding)", EMBEDDING_TYPE)],
                    first_phase="closeness(field, embedding)",


logic: The semantic profile now uses queryEmbedding instead of q parameter, make sure all query code is updated to use the new parameter name

fix bm25 & keyword search

ef5caf6

greptile-apps bot reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bm25 & keyword search #564

fix bm25 & keyword search #564

devxpy commented Dec 15, 2024

greptile-apps bot left a comment

greptile-apps bot Dec 23, 2024

greptile-apps bot Dec 23, 2024

greptile-apps bot Dec 23, 2024

		inputs=[("query(queryEmbedding)", EMBEDDING_TYPE)],
		first_phase="closeness(field, embedding)",

fix bm25 & keyword search #564

Are you sure you want to change the base?

fix bm25 & keyword search #564

Conversation

devxpy commented Dec 15, 2024

Q/A checklist

Legal Boilerplate

greptile-apps bot left a comment

Choose a reason for hiding this comment

PR Summary

greptile-apps bot Dec 23, 2024

Choose a reason for hiding this comment

greptile-apps bot Dec 23, 2024

Choose a reason for hiding this comment

greptile-apps bot Dec 23, 2024

Choose a reason for hiding this comment