apply to big data? #7

maodouchen · 2018-08-16T11:32:41Z

Does this apply to big data? 500 million？

dt-rush · 2018-11-19T02:58:46Z

Consider how the internal data structure (an RB tree of centroids) will grow with N, M, where

N = number of input values
M = range of values

If you look at the implementation or the paper, you'll see that at a simplified level, we can say that if the data point is within a certain distance of an existing centroid, it will add to its weight. Else we create a new centroid or possibly merge two centroids.

So basically the number of centroids scales with M. Given that M is fixed, you have constant storage. But data points are not individually stored.

You can always run a test to confirm your suspicions about a data structure being usable for large input or not ;)

It's generally a safe bet, as well, that online algorithms (processing streams of data) tend to be pursued exactly because storage scaling with input size is a problem at scale, and the online algorithm will have some space guarantees.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apply to big data? #7

apply to big data? #7

maodouchen commented Aug 16, 2018

dt-rush commented Nov 19, 2018 •

edited

Loading

apply to big data? #7

apply to big data? #7

Comments

maodouchen commented Aug 16, 2018

dt-rush commented Nov 19, 2018 • edited Loading

dt-rush commented Nov 19, 2018 •

edited

Loading