State of the cluster module #1105

MaxHalford · 2022-11-30T00:03:09Z

MaxHalford
Nov 30, 2022
Maintainer

Hey @hoanganhngo610 and @Dennis1989 👋

I've just spent some time looking at the cluster module. I've never done a lot of clustering, so I guess you can consider me like a new user. The first thing I have to admit is that it's not easy to get started with the module. Apart from the k-means algorithm, I struggled to get a decent performance with all the other methods. In particular, I noticed the algorithms are very sensitive to the scale of the data. It helps a lot if the data is standard scaled. Yet, this is not documented anywhere. Neither is there any scaling done in the docstring examples. In your experience, is scaling an important component of clustering?

Another point is that I don't feel comfortable refactoring the code as it stands. The issue is that we don't have unit tests setup for clustering models. Therefore, if I change something in the code, I don't know whether I broke something or not. For classification and regression models, we have several unit tests which are run on each model. It would be great to have this too for clustering. Is this something you have already done in the past? How did you test your implementations were correct? I find this to be a difficult topic for clustering.

Regarding TextClust, I have to admit I'm having a hard time diving into the code to refactor it. There are several small issues, such as the fact that learn_one doesn't return self, that it doesn't work if t isn't provided, that prints are used in the code, etc. Apart from me allocating more time, it would help a lot to have some tests/benchmarks available to

Finally, I think we can really improve on the documentation. I think online clustering is a really interesting topic, and yet I feel we could largely improve how we present things. We seem to have great algorithms, but there is not a lot of documentation to get started.

Other that this, as authors of the cluster module, I wanted to ask: do you have future plans for it?

Dennis1989 · 2022-12-07T10:27:13Z

Dennis1989
Dec 7, 2022
Maintainer

Hey, Max et al.,

Sorry for my late reply.

First, to your scaling question: well, this is kinda tricky. For metric data, I agree it makes sense to normalize the data. The question then would be, "how can we achieve this"? For benchmarking datasets, it is possible as we can iterate over all data points and then calculate the mean and sd. However, in theory, streams are unbounded, and it is not feasible to iterate over all data points. Here we have to find some kind of heuristic to adjust the scaling over time. I am currently unaware of a concrete approach, but we should consider different options here. For clustering algorithms working on text data (e.g., tfidf representations), I usually do not scale the vectors as the cosine similarity already normalizes the vector length during comparison.

As for the creation of test suites, I 100% agree. Tbh when we implemented the algorithm, we just used super simple, manually crafted test cases to see if they worked as intended. I am unfamiliar with the creation of unit tests, so I am lost here a bit (implying that I would need to learn this skill).

Clustering is indeed a super exciting topic. In contrast to supervised models, which can always be evaluated based on performance metrics such as F1 or ACC, it is a little bit more complicated for these unsupervised approaches. Of course, external evaluation metrics (e.g., RAND index) are available, but this is only feasible if the underlying cluster belonging is known, which is usually not the case in real-world settings.

On the other hand, I think that the previously mentioned lack of labels is a blessing for "practitioners" as these algorithms can be just directly applied to ANY information stream without the need of having access to an annotated corpus. And this advantage is what we should highlight within this module and show that we can hook it to any input source we like (Twitter, Reddit, Twitch, etc.) and get some (more or less) meaningful clusters. That can and should be then analyzed more in detail (which is the human-in-the-loop part that you often have with unsupervised approaches).

Anyway, just my two cents on this and how I think we should "advertise" the clustering module with all its pros and cons.

3 replies

MaxHalford Dec 7, 2022
Maintainer Author

First, to your scaling question: well, this is kinda tricky. For metric data, I agree it makes sense to normalize the data. The question then would be, "how can we achieve this"? For benchmarking datasets, it is possible as we can iterate over all data points and then calculate the mean and sd. However, in theory, streams are unbounded, and it is not feasible to iterate over all data points. Here we have to find some kind of heuristic to adjust the scaling over time. I am currently unaware of a concrete approach, but we should consider different options here. For clustering algorithms working on text data (e.g., tfidf representations), I usually do not scale the vectors as the cosine similarity already normalizes the vector length during comparison.

Good point about TFIDF not needing scaling! I didn't think about. With regards to scaling over time, I'm going to explore that topic in the upcoming weeks/months. I'll let you know.

As for the creation of test suites, I 100% agree. Tbh when we implemented the algorithm, we just used super simple, manually crafted test cases to see if they worked as intended. I am unfamiliar with the creation of unit tests, so I am lost here a bit (implying that I would need to learn this skill).

Of course, I understand, I would have done the same. Maybe you don't have to right a unit test. A benchmark with a performance metric at the end would be good enough.

Thanks for the detailed answer. We chatted with @hoanganhngo610 yesterday. We agreed that we should towards writing a paper about the cluster module. But first, we agreed we should publish something in the User Guide. We should also create a separate repository with a full-fledged example of using River to cluster Twitch/Twitter data. We should also aim to clear-up the code and clarify the documentation. Once all that is done, the paper should write itself. What do you think?

Dennis1989 Dec 8, 2022
Maintainer

Sounds good to me. I will likely find some spare time in the holiday season to finally prepare a nice jupyter documentation of textclust using different inputs and, clean up the code + maybe think about a suitable benchmarking.

What exactly do you mean by a separate repository? Only for the clustering module or for the application of river in general?

MaxHalford Dec 8, 2022
Maintainer Author

Sounds good to me. I will likely find some spare time in the holiday season to finally prepare a nice jupyter documentation of textclust using different inputs and, clean up the code + maybe think about a suitable benchmarking.

🎄🤓💪

What exactly do you mean by a separate repository? Only for the clustering module or for the application of river in general?

I was thinking of something a bit like this, but dedicated to clustering. It would be only to show-case textClust, and maybe another clustering method too.

hoanganhngo610 · 2023-01-17T14:45:55Z

hoanganhngo610
Jan 17, 2023
Maintainer

Dear @Dennis1989 and @MaxHalford,

First of all, Happy New Year! I am really sorry that I haven't been able to give you any updates sooner, since I was away most of December while getting ill late December until only a few days before.

After a discussion with Albert, my supervisor, regarding the continuation of the work, we have agreed that I will be working on the following points to improve the current state of the module:

Implement further state-of-the-arts stream clustering algorithms, for example BIRCH, BICO, ClusTree, E-Stream, etc. while understand the current performance of the already implemented methods.
There have also been questions regarding the difference between the performance of the methods in River in comparison with the original results reported in associated papers, which means that I will have to have a much much closer and more careful look into the issue.
In order to compare with textClust, I am also thinking of implementing one or two more text clustering algorithms, for example MstreamF or OSDM. This will help make the module richer, while, at the same time, give us a chance to put these algorithms head-to-head within the same platform.
Create a User Guide, and from it, write a brief paper on the module. I will have a more detailed discussion with Max to form some ideas to start with.

This is the preliminary plan for now, but I believe that there will be much more things that we can add to it.

What do you think?

4 replies

MaxHalford Jan 17, 2023
Maintainer Author

Hey there @hoanganhngo610. That sounds like a very exciting plan, I like it very much. Personally, I would do things slightly the other way round, and first focus on user experience and documentation. As I said, I don't think River users have a good understanding of how to do online clustering. The methods we have are rather well documented, but overall guidance is missing.

Dennis1989 Jan 18, 2023
Maintainer

Hey, both of you.

I am also sorry for not participating for a while. Like @hoanganhngo610, I was sick at the beginning of this year.

@hoanganhngo610 I find it a fantastic idea to implement other text-based stream clustering algorithms as well. I remember it being a real pain to compare them in my latest paper (https://link.springer.com/chapter/10.1007/978-3-031-21743-2_1), and I am as well curious if the comparison I did will align with the results when we use the river implementations. In general, I am super interested in the differences in benchmarking when using standardized frameworks vs. patching everything together. I think this would be an excellent self-contained paper itself. Let me know if I can help. Let's collaborate on this topic.

Again, I am really planning to contribute to documentation as well. The three of us should have a call over teams or zoom. I am busy until the end of January, but I am confident we can schedule something for the beginning of February.

MaxHalford Jan 18, 2023
Maintainer Author

Glad to hear that Dennis! We were thinking about the 3rd February, what do you think? We were planning to come up with some ideas before the meeting so we have something concrete to discuss.

hoanganhngo610 Jan 23, 2023
Maintainer

@Dennis1989 Hey Dennis! Since we are confirming on the beginning of February, would you mind sending us your email so that I can send you the meeting request?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of the cluster module #1105

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

State of the cluster module #1105

MaxHalford Nov 30, 2022 Maintainer

Replies: 2 comments · 7 replies

Dennis1989 Dec 7, 2022 Maintainer

MaxHalford Dec 7, 2022 Maintainer Author

Dennis1989 Dec 8, 2022 Maintainer

MaxHalford Dec 8, 2022 Maintainer Author

hoanganhngo610 Jan 17, 2023 Maintainer

MaxHalford Jan 17, 2023 Maintainer Author

Dennis1989 Jan 18, 2023 Maintainer

MaxHalford Jan 18, 2023 Maintainer Author

hoanganhngo610 Jan 23, 2023 Maintainer

MaxHalford
Nov 30, 2022
Maintainer

Replies: 2 comments 7 replies

Dennis1989
Dec 7, 2022
Maintainer

MaxHalford Dec 7, 2022
Maintainer Author

Dennis1989 Dec 8, 2022
Maintainer

MaxHalford Dec 8, 2022
Maintainer Author

hoanganhngo610
Jan 17, 2023
Maintainer

MaxHalford Jan 17, 2023
Maintainer Author

Dennis1989 Jan 18, 2023
Maintainer

MaxHalford Jan 18, 2023
Maintainer Author

hoanganhngo610 Jan 23, 2023
Maintainer