State of the cluster module #1105
Replies: 2 comments 7 replies
-
Hey, Max et al., Sorry for my late reply. First, to your scaling question: well, this is kinda tricky. For metric data, I agree it makes sense to normalize the data. The question then would be, "how can we achieve this"? For benchmarking datasets, it is possible as we can iterate over all data points and then calculate the mean and sd. However, in theory, streams are unbounded, and it is not feasible to iterate over all data points. Here we have to find some kind of heuristic to adjust the scaling over time. I am currently unaware of a concrete approach, but we should consider different options here. For clustering algorithms working on text data (e.g., tfidf representations), I usually do not scale the vectors as the cosine similarity already normalizes the vector length during comparison. As for the creation of test suites, I 100% agree. Tbh when we implemented the algorithm, we just used super simple, manually crafted test cases to see if they worked as intended. I am unfamiliar with the creation of unit tests, so I am lost here a bit (implying that I would need to learn this skill). Clustering is indeed a super exciting topic. In contrast to supervised models, which can always be evaluated based on performance metrics such as F1 or ACC, it is a little bit more complicated for these unsupervised approaches. Of course, external evaluation metrics (e.g., RAND index) are available, but this is only feasible if the underlying cluster belonging is known, which is usually not the case in real-world settings. On the other hand, I think that the previously mentioned lack of labels is a blessing for "practitioners" as these algorithms can be just directly applied to ANY information stream without the need of having access to an annotated corpus. And this advantage is what we should highlight within this module and show that we can hook it to any input source we like (Twitter, Reddit, Twitch, etc.) and get some (more or less) meaningful clusters. That can and should be then analyzed more in detail (which is the human-in-the-loop part that you often have with unsupervised approaches). Anyway, just my two cents on this and how I think we should "advertise" the clustering module with all its pros and cons. |
Beta Was this translation helpful? Give feedback.
-
Dear @Dennis1989 and @MaxHalford, First of all, Happy New Year! I am really sorry that I haven't been able to give you any updates sooner, since I was away most of December while getting ill late December until only a few days before. After a discussion with Albert, my supervisor, regarding the continuation of the work, we have agreed that I will be working on the following points to improve the current state of the module:
This is the preliminary plan for now, but I believe that there will be much more things that we can add to it. What do you think? |
Beta Was this translation helpful? Give feedback.
-
Hey @hoanganhngo610 and @Dennis1989 👋
I've just spent some time looking at the cluster module. I've never done a lot of clustering, so I guess you can consider me like a new user. The first thing I have to admit is that it's not easy to get started with the module. Apart from the k-means algorithm, I struggled to get a decent performance with all the other methods. In particular, I noticed the algorithms are very sensitive to the scale of the data. It helps a lot if the data is standard scaled. Yet, this is not documented anywhere. Neither is there any scaling done in the docstring examples. In your experience, is scaling an important component of clustering?
Another point is that I don't feel comfortable refactoring the code as it stands. The issue is that we don't have unit tests setup for clustering models. Therefore, if I change something in the code, I don't know whether I broke something or not. For classification and regression models, we have several unit tests which are run on each model. It would be great to have this too for clustering. Is this something you have already done in the past? How did you test your implementations were correct? I find this to be a difficult topic for clustering.
Regarding TextClust, I have to admit I'm having a hard time diving into the code to refactor it. There are several small issues, such as the fact that
learn_one
doesn't returnself
, that it doesn't work ift
isn't provided, thatprint
s are used in the code, etc. Apart from me allocating more time, it would help a lot to have some tests/benchmarks available toFinally, I think we can really improve on the documentation. I think online clustering is a really interesting topic, and yet I feel we could largely improve how we present things. We seem to have great algorithms, but there is not a lot of documentation to get started.
Other that this, as authors of the cluster module, I wanted to ask: do you have future plans for it?
Beta Was this translation helpful? Give feedback.
All reactions