Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

245: add image comments index and image comments filter #267

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lessej
Copy link
Collaborator

@lessej lessej commented Sep 17, 2024

Context

Implements tnc-ca-geo/animl-frontend#245

Add comments filter so users can filter images by the content of comments on the image. Implemented using the $search $text operator in MongoDB. Note: the $search filter needs to be part of the first $match of the pipeline.

https://www.mongodb.com/docs/manual/reference/operator/query/text/#mongodb-query-op.-text

Fixes
Noticed the LSP was showing a warning that await had no effect on MongoPaging.aggregate

Related PRs

animl-frontend

@lessej lessej added enhancement New feature or request In Progress PR is still in progress and should not be merged labels Sep 17, 2024
@lessej lessej removed the In Progress PR is still in progress and should not be merged label Sep 18, 2024
Copy link
Member

@nathanielrindlaub nathanielrindlaub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JesseLeung97, save that one comment I have in there about awaiting the Mongo aggregation, this code looks good and it's all behaving well in my tests. That said I started to dig into the docs a bit and I'm a bit concerned about adding a text search index in prod (it seems like it's really expensive and can impact write speed).

I should have done more homework before setting you off on this, but it got me curious about Atlas Search (we do use MongoDB Atlas, so that would be an option). It sounds far more robust and offers a lot more functionality than Mongo DB's $text search, but I'm also concerned and unsure of what the RAM implications are for the Lucene indexes it manages. I'm also concerned about locking us into Atlas Search as it's tied to Atlas and would prevent moving to a self-managed or local MongoDB instance down the road if that was ever desirable. It also just might be overkill for what we need.

At any rate, I think we should pause on this and meet to talk about this and some of the other tasks you're looking to take on. I'd be curious if you have an opinion on Atlas Search and whether you took a look at it while implementing the $text search. Here's one blog post that compares the difference but I'm sure there are more.

I'll reach out via email and try to get a meeting set up.

@@ -124,7 +124,7 @@ export class ImageModel {
hasNext: false,
} as AggregationOutput<ImageSchema>;
}
return await MongoPaging.aggregate(Image.collection, {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JesseLeung97, can you elaborate on the warning you were seeing and where it was coming from? It looks like the MongoPaging.aggregate() function is asynchronous, and this would make sense as it's a DB query, so I do think awaiting it is important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what my lsp is saying.

Screenshot 2024-10-08 at 22 17 48

Here's the source code. It seems weird that it would be synchronous but the types.d.ts file isn't declaring it async.

// src/@types/mongo-cursor-pagination.d.ts

  const defaultExport: {
    mongoosePlugin: any;
    aggregate: <T extends Model>(
      model: T['schema'],
      input: AggregationInput,
    ) => AggregationOutput<T>;
  };

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I think that defaultExport.aggregate return type needs to be wrapped in a Promise type. mongo-cursor-pagination doesn't include Typescript types so we wrote these ones ourselves and this must have been an oversight on our part. I think the following should resolve that warning:

  const defaultExport: {
    mongoosePlugin: any;
    aggregate: <T extends Model>(
      model: T['schema'],
      input: AggregationInput,
    ) => Promise<AggregationOutput<T>>;
  };

@nathanielrindlaub
Copy link
Member

Perhaps not surprisingly, MongoDB pushes for using Atlas Search over $text and $regex in most cases unless you're self-hosting your DB. Their decision framework is helpful, although probably a bit biased.

Having read that I'm now leaning towards Atlas Search slightly because it does more performant and scalable:

When you add a database index in MongoDB, you should consider tradeoffs to write performance in cases where database write performance is important. Search Indexes don’t degrade cluster write performance.

and...

If you have lots of documents, your queries will linearly get slower. In Atlas Search, the inverted index enables fast document retrieval at very large scales.

The main downsides are (a) getting ourselves locked into Atlas and (b) not being able to create Atlas Search Indexes from code (e.g. w/ Mongoose) or infra as code. I'm not totally positive on this last point but I think you have to create indexes via the MongoDB Atlas user interface. That said, our MongoDB Atlas setup is already outside of our Serverless/Cloudformation config files so anyone trying to spin up Animl on their own would have to manually muck around with the AWS console and MongoDB Atlas app anyhow.

There also seem to be some indexing pitfalls with Atlas Search we should be aware of and try to avoid. Setting alerts would be a good idea too.

@lessej
Copy link
Collaborator Author

lessej commented Oct 9, 2024

I think I ended up reading a bunch of the same resources as you listed (docs, decision tree, some stack overflow threads). The text index stores an entry per word in the comment in RAM which may get expensive depending on the volume of comments you expect to see in the database. Looking at the metrics for the past two months, it seems like on average, the instances are using 25% - 40% of their configured 2gb of memory. I think there's space now for an index but I am also concerned because it seems like it will grow endlessly as more comments are added. The deciding factor is whether this growth would force us to scale the instances in 2 months or 2 years.

Like you mentioned, Atlas indexes to my understanding are far more feature-rich and scalable because they're hosted separately to the DB instances (on lucene instances?). Some things that would be nice to implement if we go with an atlas index are fuzzy finding and auto completion for searches. It's probably overkill to be honest, but because we already pay for Atlas, I think that's less of a factor in the decision.

I took a quick look around and I found some IAC examples of creating Atlas indexes using CDK or Terraform
https://github.com/mongodb/mongodbatlas-cloudformation-resources/blob/master/examples/search-index/searchIndex.json
It also seems like you can configure the index using JSON from within the Atlas UI so the config could at least be version controlled. In the worst case, there is an API available so we could hack together an IAC script.
https://www.mongodb.com/docs/atlas/reference/api-resources-spec/v2/#tag/Atlas-Search/operation/listAtlasSearchIndexesCluster

Overall, I am also leaning towards an Atlas index. For other teams who are considering using Animl but don't want to use Atlas, we could provide a small doc about how to update it to use the standard text index instead.

@nathanielrindlaub
Copy link
Member

@JesseLeung97 thanks for doing some reading on Atlas vs text search and weighing in; I think let's move forward with the Atlas indexes and see how it goes.

Good find with the IAC templates. We want to move away from Serverless to something else in the near future so this is great to flag. For now though, I like the idea of just using the Atlas UI with a JSON config for the index and committing it somewhere. In general I need to write up some documentation for standing up new Animl instances so I can better track the pieces that live outside of Serverless and steps to set them up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants