Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: Use shared embeddings for group data rather than mass duplicates #49

Closed
ga-it opened this issue Jun 7, 2024 · 11 comments
Closed
Labels
bug Something isn't working

Comments

@ga-it
Copy link

ga-it commented Jun 7, 2024

Describe the bug
Context chat allows questions without a specified file or folder context. To the extent that this answer is produced from existing embeddings / vector database, what is the risk that this answer will include information from a context the user does not have access to?

This will not be impacted via referred source documents as if these include sources the user does not have access to, they will now be able to follow the link through to the document. But if contents from these documents are included in the answer, clearly this breaches security.

For example, if there is a group folder with salary information that is only accessible to finance staff, but embeddings are generated and then available for general query, this would breaches folder security.

Further, if this is correctly handled, are changes to a users permissions retroactively applied to previously generated embeddings?

I am uncertain if this is possible, but cannot see any documentation as to how this is handled.

To Reproduce
Steps to reproduce the behavior:

  1. Go to AI Assistant, Context Chat
  2. Leave "Selective context" unselected
  3. Under "Input" ask a question regarding information the user does not have access to.

Expected behavior
Context chat answers should be based on a combination of the LLM used by Context Chat and only embeddings that are consistent with their Nextcloud permissions. Technically I am unsure how this would be accomplished, but conceptually imagine that a users permission matrix could be applied to embeddings to block answers from content they do not have access to.

One embeddings database with permissions applied seems to be what is outlined here (another application, not Nextcloud, by way of example):

https://www.osohq.com/post/authorizing-llm

Setup Details (please complete the following information):

  • Nextcloud Version: [28.0.5]
  • AppAPI Version: [2.6.0]
  • Context Chat PHP Version [8.2.18]
  • Context Chat Backend Version [2.1.0]
  • Nextcloud deployment method: [Nextcloud AIO]
  • Context Chat Backend deployment method: [ manual remote server]

Additional context
Add any other context about the problem here.

@ga-it ga-it added the bug Something isn't working label Jun 7, 2024
@marcelklehr
Copy link
Member

marcelklehr commented Jun 7, 2024

Groupfolders' user access lists are respected when crawling for a user's data. What might not work yet, is adding a user to a groupfolder after the users files have been crawled as well as the other way around.

@ga-it
Copy link
Author

ga-it commented Jun 7, 2024

Thanks @marcelklehr - does that imply that the embeddings databases are separately created for each user?

I am not sure of the overhead of that design on a large corporate installation with TBs of data (e.g. multiple embeddings databases and multiple crawls)?

For that reason and changes in permissions, do you think the linked approach may work better - i.e. one embeddings database with permissions applied dynamically per user context?

@marcelklehr
Copy link
Member

does that imply that the embeddings databases are separately created for each user?

Yes, that's how it is at the moment. We are aware that this is not ideal.

@ga-it
Copy link
Author

ga-it commented Jun 7, 2024

Is a change on the roadmap or should I file the above as a feature request?

@marcelklehr
Copy link
Member

It's on the roadmap, yes :)

@ga-it
Copy link
Author

ga-it commented Jun 19, 2024

I cannot find a way to log this as a feature request on this repo.

As per chat with @marcelklehr

As I understand it, context chat currently creates embeddings for each user for the files they have access to.

The context chat backend should instead create one set of embeddings and overlay permissions on the LMM/RAG query.

We have a huge number of shared files in group folders - Nextcloud is our document management system - TBs.

If I understand the embedding process correctly, this results in a multiplied factor of users by embeddings.

Our GPUs (RTX 4090) have been fully occupied for weeks by the Context Chat backend in its crawl process.

A further impact is that as permissions change, this can result in out-of-sync embeddings with user permissions.

One set of content embeddings should be created and kept up to date for content in the Nextcloud instance.

A separate permissions matrix should be maintained and applied to LLM/RAG queries.

By way of an example from another development, one embeddings database with permissions applied seems to be what is outlined here (another application, not Nextcloud, by way of example):

https://www.osohq.com/post/authorizing-llm

I understand this is on the roadmap.

The urgency for us - our servers and GPUs have been running for weeks non-stop on crawling and embedding current files.

If possible, when this is implemented provide a migration utilising current embeddings to a shared embeddings model to prevent yet another crawl!

@ga-it ga-it changed the title [bug]: How are group folder permissions handled in context chat answers? [bug]: Use shared embeddings for group data rather than mass duplicates Jun 19, 2024
@github-actions github-actions bot added the stale label Jul 20, 2024
@nextcloud nextcloud deleted a comment from github-actions bot Jul 20, 2024
@marcelklehr marcelklehr removed the stale label Jul 20, 2024
Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@kyteinsky
Copy link
Contributor

And for that reason I made the following point "If possible, when this is implemented provide a migration utilising current embeddings to a shared embeddings model to prevent yet another crawl!"

we planned to do this in the case of schema change but then we changed the embedding model too to support more languages than english so it is almost the same thing as starting fresh where a db migration would just introduce points of failures.

On group folders containing TBs of documents and with many users, the crawl time projections are months with associated GPU and energy requirements.

Fulltextsearch has a similar issue - nextcloud/fulltextsearch#878

It seems critical that the architecture for both incorporate some sort of abstraction layer between the data and the users - such as some sort of role based security which users would then access embeddings and indexing through. This would also then have the benefit that on changes to permissions, recrawling would not need to take place as user access would be immediately turned off through access to roles in the abstraction layer. Take that advice as from the layman from where it comes, but the immediate prospect is we lose months of embedding time for this change and any future changes to the embedding model which has a cost and functionality hit.

The upcoming feature would be doing access control on files with a users list kind of way so a document would have a set of users associated to it which have access to it. If shares are made/revoked, only the users list changes. All the documents are indexed only once.
We have not planned group based access controls in the near future.

Fundamentally, even with an abstraction layer, changes to the embedder in future are likely to result in the same issue. But with the duplicated embedding process, the cost will be multiplied by the number of users with access to the group folders

We always try to make the upgrade frictionless but this time it was inevitable so we're trying to rip the bandaid at once and do all the breaking changes in the upcoming major release.
FWIW, the documents would now take (total time previously taken/no. of users) time to index all the documents. Alternatively, you can use "Selective context" for your queries.

@ga-it
Copy link
Author

ga-it commented Oct 30, 2024

Thanks @kyteinsky

Do you have an ETA on that release?

@kyteinsky
Copy link
Contributor

No, sorry, but it can take a few weeks.

@kyteinsky
Copy link
Contributor

Fixed with the latest release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants