deduplicate function is stuck for a long time when applied to large dataset #839

kailashsp · 2023-11-25T11:58:57Z

kailashsp
Nov 25, 2023

I have been trying out the deduplicate feature. It seems to work fine when I apply it a subset of dataset of size 100 items. But when I apply it to the whole data of 32000. It is stuck and I have tried to change the n_jobs and still no success

Vincent-Maladiere · 2023-11-27T10:34:02Z

Vincent-Maladiere
Nov 27, 2023
Maintainer

Hey @kailashsp, thank you for opening this discussion!

The deduplicate function uses hierarchical clustering using scipy.cluster.hierarchy.linkage whose time complexity is $O(N^2)$ for all linkage methods, which probably explains the slowness you observe.

In the long term, we want to address this using blocking methods or probabilistic data structures.

Before that, could you send a snippet of the code you used and a sample of the data? This might help us debug this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deduplicate function is stuck for a long time when applied to large dataset #839

{{title}}

Replies: 1 comment

{{title}}

Select a reply

deduplicate function is stuck for a long time when applied to large dataset #839

kailashsp Nov 25, 2023

Replies: 1 comment

Vincent-Maladiere Nov 27, 2023 Maintainer

kailashsp
Nov 25, 2023

Vincent-Maladiere
Nov 27, 2023
Maintainer