ANN: Neural Search #6707

williamstein · 2023-05-13T00:15:01Z

williamstein
May 13, 2023
Maintainer

"Neural AI Search" is now live in cocalc. Refresh your browser. The actual application right now is minimal compared to what it could be. I just want to get the backend foundations in place, and make it so content starts getting indexed, before building a bunch of new frontend capabilities on this. Right now the only thing you can do is click on the Find page in a project, click "Neural Search" off to the right, and do a search in that directory. It searches only jupyter, tasks, chat, whiteboards, and slides that you have opened for at least 7.5 seconds after I made this live a few minutes ago. It then updates the backend search index as you edit them.

The potential with this is extensive, and this is just a VERY tiny step. E.g., the underlying thing could work across many projects whether or not they are running, and of course it would also be extremely useful to search only within a specific file (like this chat). Also, this provides the foundation to make it so when interacting with ChatGPT it can be aware of content across your files and in relevant technical documentation (e.g., sagemath docs from now instead of 2021).

Technical Architectural Remarks

The basic thing seems to work fine, and the design I finally came up with (after numerous painful iterations this week!) uses git and sync like trickery to I think be very efficient and robust, and the expense of an $\varepsilon$ chance of a wrong answer (which hardly matters for search).

In admin settings there is a new box:

When this is "no", everything is disabled, including any backend api's and frontend UI. When set to "yes", a person can put in the address and api key of a qdrant server, e.g., from https://cloud.qdrant.io/ or run their own, and then they automatically get neural network search working. This involves three tables:

postgres: openai_embeddings_logs -- logs any time that somebody calls the openai embeddings api, and how much it costs. It has some "elaborate" throttling strategy to ensure that we don't spend too much...
postgres: openai_embeddings_cache -- a cache of the expensive to compute map from text to a vector in $\mathbf{R}^{1536}$ that comes from the openai embeddings api. Entries in this cache expire after 6 weeks of not being touched. That said, postgresql seems to store vectors of doubles pretty compactly, and we aren't doing anything but just using this as a key:value cache.
qdrant: cocalc -- a "vector collection" of embeddings and metadata

Yes, this is all available in cocalc-docker.

The "robust" part of the design is that if you delete any data from any subset of the above tables, things will just keep humming along fine - there's no dependence. Delete some of the cache and we just pay more (and things are a little slower), delete some of the vector database, and you'll just get less search results. This is very different than my original design, which tightly couple qdrant and postgres, in such a way that it was very easy for one to break the other.

The data model for qdrant uses a lot of techniques to ensure security and limited data access (similar to what we do with postgresql), which is fairly easy to do with qdrant, but NOT with more basic vector databases. It also, wouldn't have worked with qdrant back in Nov 2022, since they have improved a lot recently.

The final piece in this whole puzzle is that for cocalc.com, we run qdrant itself in our Kubernetes cluster, and have regular snapshots that we backup. Qdrant's design is very much NOT a pig -- it's written in tight memory efficient Rust, and uses quantization to massively reduce the space used to store vectors, so I think it'll scale pretty well for us.

There's also the potential of providing this vector search capability via our api on a "pay for what you use" basis, and that could be of interest as its own product, since I developed a way to have a large number of independent organized vector databases that are "multi-tenant", so the cost is excellent per user. It's something to explore for "cocalc.ai", since it could be useful for to sell for a lot of people. It's actually already available (for free), and just not documented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANN: Neural Search #6707

{{title}}

Replies: 0 comments

Select a reply

ANN: Neural Search #6707

williamstein May 13, 2023 Maintainer

Technical Architectural Remarks

Replies: 0 comments

williamstein
May 13, 2023
Maintainer