Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo / lightning talk for plankton image data flow #8

Closed
metazool opened this issue Jul 9, 2024 · 4 comments
Closed

Demo / lightning talk for plankton image data flow #8

metazool opened this issue Jul 9, 2024 · 4 comments
Labels
documentation Improvements or additions to documentation

Comments

@metazool
Copy link
Collaborator

metazool commented Jul 9, 2024

Updated the issue title to reflect this has grown some extra dimensions! Come back here after some shared discussion and outline what it is we'd like to show

The work in #5 and #6 serves as a proof of concept of minimal-effort approaches to learning from image collections without undertaking model training or costly labelling; but it's at the edge of what's meant to be a deeper investigation of pipelines and workflows that can apply to related projects - most immediately AMI-system. This Discussion on DataLabs computer vision needs for a combination physical sample / imaging field site shows likely demand.

Putting together a short show-and-tell / demo that can be presented to the Environmental Data Science group and the research group is a nice motivator to draw a line under the low-hanging ML parts, shift focus to architecture choices and cross-project common ground

  1. model choice and overview
  2. image similarity search by vector embeddings
  3. unsupervised clustering approaches to the above

Of these, 2. needs expanded a bit to become more visually interesting and to probe for areas where the approach is weak. 3. we haven't tried at all, got lost in the wash between pipeline/workflow #9 on the one hand and experimental model choice #10 on the other, but it should be quick to try (DBScan etc)

See also the section on transfer learning / feature extraction in this workshop paper:
https://aslopubs.onlinelibrary.wiley.com/doi/full/10.1002/lno.12101#lno12101-sec-0025-title

@metazool
Copy link
Collaborator Author

metazool commented Aug 5, 2024

Did a small rendering of k-means clustering of the plankton embeddings which had visually similar outcomes to the similarity search, this is on the clustering_visualisation branch.

It's outgrowing a notebook, wondering if streamlit is the right fit for this rather than shifting to Javascript - @matthewcoole 's demo of retrieval augmented generation document search has similar components (including chromadb) https://github.com/NERC-CEH/embeddings_app/ - either repurpose this or borrow from it

Focus of this is to show naively-minimal output to plankton researchers and enlist their help either in finding flaws, or in refining which path to take is actually useful to them. Should be quite timeboxed, ideally no more than a day, max 2...

@metazool
Copy link
Collaborator Author

metazool commented Aug 5, 2024

Note to self that embeddings_app assumes some data that's generated by methods in discoverability

This shows use of UMAP to do dimensionality reduction on embeddings; which is probably worth trying in the notebook to see if that helps DBSCAN not to see everything as noise

@metazool
Copy link
Collaborator Author

metazool commented Aug 8, 2024

Another note to self that while it's not necessary now, the next visit to this should involve

  • ease of pointing to a different image collection (it's already all driven from chromadb which uses URLs of objects in s3 as identifiers)
  • ease of pointing to a collection of different embeddings for the same image sources (whether that's BioCLIP or the more recent model the Turing Inst folks are releasing with the paper from @noushineftekhari ... )

@metazool metazool added the documentation Improvements or additions to documentation label Sep 11, 2024
@metazool metazool changed the title Demo / lightning talk for plankton feature search Demo / lightning talk for plankton image data flow Sep 11, 2024
@metazool metazool moved this from Todo to Done in Plankton data pipelines Dec 2, 2024
@metazool metazool closed this as completed by moving to Done in Plankton data pipelines Dec 2, 2024
@metazool
Copy link
Collaborator Author

metazool commented Dec 2, 2024

There's a very quick walkthrough of the current state in the presentation directory now, I've not closed this though because we'll keep extending it (and re-run for a less research, more infrastructure audience)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Done
Development

No branches or pull requests

1 participant