-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can this work with cluster made by top2vec ? #20
Comments
If I understand correctly, columns Assuming you have different versions of
If that means that resulting cluster centers are not the mean/median of the values, then yes. If you know them, you can use If you can provide some minimal example, I can try to work it out. Also note, that both |
I have indeed teh cluster centers and try to use the I think I could construct easely the Lets assume I have cluster centers which have 10 dimensions, and 1 , 2 and 3 clusters. So the cluster centers dictionary should be
correct ? |
But I cannot see what the |
Assuming you have a similar option in |
The cluster centers dict above looks alright. You may just need to wrap each into a numpy array to get something like this: centers = {
1: np.array([[0, 0]]),
2: np.array([[-1, -1], [1, 1]]),
3: np.array([[-1, -1], [1, 1], [0, 0]]),
} |
By "label" you mean "which cluster" ? "In the situation of 2 clusters, observation_1 was in cluster 1, observation 2 in cluster 0, observation 3 in cluster 1" So the table has one row for each observation, correct ? |
Yes, precisely. |
Ok, I will give it a try. I use top2vec to cluster 55000 documents. The initial run of top2vec created 401 clusters, which I can "reduce to any size", which I would the do step by step and go from 401 to 0 So my labels table would be big... 55000 * 401 Do you think it makes any sense to create a clustergram as big as this... |
Clustergram itself should deal with it but keep in mind that you'll need to be able to interpret it. The new interactive exploration can help you with that but still, that is a lot of options to look at. There's no assumption about the data? I.e. you normally know if you're looking for 5, 25 or 150 clusters. |
As we deal with text, 55000 scientific paper abstracts, and word/paragraph vectors, So frankly, we don't have a clue on how many clusters to expect. top2vec does something sensitive, and chooses a certain number of topics automatically by some internal criteria. |
In that case, I'd suggest trying to get maximum from |
@behrica did manage to make it work by any chance? |
@behrica I have the same goal as you but I'm using BERTopic ... I would be interested in seeing what you did if you managed to use it |
@doubianimehdi Can you share a reproducible example of your problem? So I could try playing with that and figure out the solution? |
I have an hdbscan method with the clusters informations but I don't know how to use it in clustergram ... |
If you can share the code and some sample data so I can reproduce what you're doing, I can have a look at the way of using the result within a clustergram. You can check this guide on how to prepare such example - https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports |
@martinfleis @doubianimehdi In the meanwhile we found an implementation of a metric, so I did not explore further the usage of |
Thanks for your interesting package.
Do you think Clustergram could work with top2vec ?
https://github.com/ddangelov/Top2Vec
I saw that there is the option to create a clustergram from a DataFrame.
In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.
So I could indeed generate a data frame, like this:
Does Clustergram assume anything on the rows of this data frame ?
I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.
In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?
top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.
The text was updated successfully, but these errors were encountered: