Can this work with cluster made by top2vec ? #20

behrica · 2021-05-25T10:42:52Z

Thanks for your interesting package.

Do you think Clustergram could work with top2vec ?
https://github.com/ddangelov/Top2Vec

I saw that there is the option to create a clustergram from a DataFrame.

In top2vec, each "document" to cluster is represented as a embedding of a certain dimension, 256 , for example.

So I could indeed generate a data frame, like this:

x0	x1	...	x255	topic
0.5	0.2	....	-0.2	2
0.7	0.2	....	-0.1	2
0.5	0.2	....	-0.2	3

Does Clustergram assume anything on the rows of this data frame ?
I saw that the from_data method either takes "mean" or "medium" as method to calculate the cluster centers.

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

top2vec calculates as well the "topic vectors" as a mean of the "document vectors", I believe.

martinfleis · 2021-05-25T11:12:04Z

If I understand correctly, columns x0 ... x255 are input data while topic is a resulting cluster label? Then you should be able to use Clustergram.from_data.

Assuming you have different versions of topic result, you need to create a df with as many topic columns as your results (ideally sorted).

In word vector, we use typically the cosine distance to calculate distances between the vectors. Does this have any influence ?

If that means that resulting cluster centers are not the mean/median of the values, then yes. If you know them, you can use Clustergram.from_centers method instead to pass them directly.

If you can provide some minimal example, I can try to work it out.

Also note, that both from_data and from_centers may be buggy :). Worth playing with them to catch and fix them though.

behrica · 2021-05-26T13:43:59Z

I have indeed teh cluster centers and try to use the from_centers method.

I think I could construct easely the cluster_centers dictionary, but I have no idea what the labels data frame should contain.

Lets assume I have cluster centers which have 10 dimensions, and 1 , 2 and 3 clusters.

So the cluster centers dictionary should be

{
1: [[0,0,1,3,0,5,3,2,7,8]]
2: [
     [1,0,1,3,0,5,3,2,3,8],
     [4,0,5,3,7,5,3,2,9,8]]

3: [
     [0,0,1,3,0,5,3,2,7,8],
     [7,1,1,3,0,5,3,2,0,8],
     [0,0,5,3,0,5,3,2,7,8]]
}

correct ?

behrica · 2021-05-26T13:46:47Z

But I cannot see what the labels dataframe should be in this case.
Do we need still the original data as depicted above in some way as input to the from_vectors?
(I use now 10 dimensions, but the same applies to 255 dimensions)

martinfleis · 2021-05-26T14:23:19Z

labels dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.

	k=2	k=3	k=4	k=5
observation_1	1	1	3	0
observation_2	0	0	1	4
observation_3	1	2	2	2

Assuming you have a similar option in top2vec, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values in min_count?

martinfleis · 2021-05-26T14:26:06Z

The cluster centers dict above looks alright. You may just need to wrap each into a numpy array to get something like this:

centers = {
             1: np.array([[0, 0]]),
             2: np.array([[-1, -1], [1, 1]]),
             3: np.array([[-1, -1], [1, 1], [0, 0]]),
          }

behrica · 2021-05-26T14:49:40Z

labels dataframe contains labelling of individual observations from different clustering options. So in the most typical case of K-Means done between 2 and 5 clusters, the first column contains labels for k=2, second for k=3, third for k=4 and fourth for k=5.

k=2 k=3 k=4 k=5
observation_1 1 1 3 0
observation_2 0 0 1 4
observation_3 1 2 2 2
Assuming you have a similar option in top2vec, the first column will contain labels for a result A, second for a result B... From quickly looking at the code, I guess that your options will be based on differenct values in min_count?

By "label" you mean "which cluster" ?
So I read the table above as:

"In the situation of 2 clusters, observation_1 was in cluster 1, observation 2 in cluster 0, observation 3 in cluster 1"
...
In the situation of 5 clusters, observation_1 was in cluster 0, observation_2 was in cluster 4, observation_3 in cluster 2

So the table has one row for each observation, correct ?

martinfleis · 2021-05-26T14:53:06Z

Yes, precisely.

behrica · 2021-05-26T14:57:24Z

Ok, I will give it a try.

I use top2vec to cluster 55000 documents.

The initial run of top2vec created 401 clusters, which I can "reduce to any size", which I would the do step by step and go from 401 to 0

So my labels table would be big...

55000 * 401

Do you think it makes any sense to create a clustergram as big as this...
Our final goal is obviously to find the "best cluster size" from the clustergram...

martinfleis · 2021-05-26T15:17:38Z

Clustergram itself should deal with it but keep in mind that you'll need to be able to interpret it. The new interactive exploration can help you with that but still, that is a lot of options to look at. There's no assumption about the data? I.e. you normally know if you're looking for 5, 25 or 150 clusters.

behrica · 2021-05-26T19:23:18Z

As we deal with text, 55000 scientific paper abstracts, and word/paragraph vectors,
any mathematical assumptions are very difficult.
The vector representation of the text is so far away from the text , that this is very tricky
The notion of "how many topics are present in a given text corpus" is not well defined, and a continuum.

So frankly, we don't have a clue on how many clusters to expect.

top2vec does something sensitive, and chooses a certain number of topics automatically by some internal criteria.
That's one reason, why we like the top2vec approach

martinfleis · 2021-05-26T19:29:11Z

In that case, I'd suggest trying to get maximum from bokeh() visualisation of clustergram so you can explore different parts of it.

martinfleis · 2021-07-12T19:36:45Z

@behrica did manage to make it work by any chance?

doubianimehdi · 2021-10-08T13:37:36Z

@behrica I have the same goal as you but I'm using BERTopic ... I would be interested in seeing what you did if you managed to use it

martinfleis · 2021-10-08T13:45:49Z

@doubianimehdi Can you share a reproducible example of your problem? So I could try playing with that and figure out the solution?

doubianimehdi · 2021-10-08T13:47:27Z

I have an hdbscan method with the clusters informations but I don't know how to use it in clustergram ...

martinfleis · 2021-10-08T13:50:34Z

If you can share the code and some sample data so I can reproduce what you're doing, I can have a look at the way of using the result within a clustergram. You can check this guide on how to prepare such example - https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

behrica · 2021-11-19T09:06:58Z

@martinfleis @doubianimehdi
Our overall goal was to do (automatic) hyper parameter optimisation with top2vec.
The top2vec code does not come with an implementation of a metric, so I was exploring some other forms of "cluster evaluation" and landed here.

In the meanwhile we found an implementation of a metric, so I did not explore further the usage of clustergram for top2vec.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can this work with cluster made by top2vec ? #20

Can this work with cluster made by top2vec ? #20

behrica commented May 25, 2021 •

edited

Loading

martinfleis commented May 25, 2021

behrica commented May 26, 2021 •

edited by martinfleis

Loading

behrica commented May 26, 2021

martinfleis commented May 26, 2021

martinfleis commented May 26, 2021

behrica commented May 26, 2021

martinfleis commented May 26, 2021

behrica commented May 26, 2021

martinfleis commented May 26, 2021

behrica commented May 26, 2021

martinfleis commented May 26, 2021

martinfleis commented Jul 12, 2021

doubianimehdi commented Oct 8, 2021

martinfleis commented Oct 8, 2021

doubianimehdi commented Oct 8, 2021

martinfleis commented Oct 8, 2021

behrica commented Nov 19, 2021

Can this work with cluster made by top2vec ? #20

Can this work with cluster made by top2vec ? #20

Comments

behrica commented May 25, 2021 • edited Loading

martinfleis commented May 25, 2021

behrica commented May 26, 2021 • edited by martinfleis Loading

behrica commented May 26, 2021

martinfleis commented May 26, 2021

martinfleis commented May 26, 2021

behrica commented May 26, 2021

martinfleis commented May 26, 2021

behrica commented May 26, 2021

martinfleis commented May 26, 2021

behrica commented May 26, 2021

martinfleis commented May 26, 2021

martinfleis commented Jul 12, 2021

doubianimehdi commented Oct 8, 2021

martinfleis commented Oct 8, 2021

doubianimehdi commented Oct 8, 2021

martinfleis commented Oct 8, 2021

behrica commented Nov 19, 2021

behrica commented May 25, 2021 •

edited

Loading

behrica commented May 26, 2021 •

edited by martinfleis

Loading