[BUG]: Leiden clustering numbering is off #4791

abs51295 · 2024-11-27T19:27:47Z

Version

24.12

Which installation method(s) does this occur on?

No response

Describe the bug.

Hey,

We ran leiden clustering on our dataset after the recent fix #4730 and found that it skips some cluster numbers randomly. I wonder if it's just a label issue and not a problem with the algorithm itself.

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.

No response

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

ChuckHastings · 2024-12-02T21:45:31Z

I believe this is a label numbering problem and not a clustering problem. Leiden is a hierarchical clustering algorithm. At each level, we combine clusters, the cluster is numbered by one of the vertices in the cluster. That assignment is arbitrary (based upon when the algorithm decides to move things). For example, if vertex 10 is being evaluated and it is determined it should merge into cluster 5, then it will be assigned to cluster 5. Cluster 10 would then be empty.

The vertices/clusters are renumbered when we move to a new level of the hierarchy, but I don't believe that we renumber the vertices/clusters unless we move to another level of the hierarchy.

abs51295 · 2024-12-03T01:30:12Z

Thanks @ChuckHastings for clarifying this up. I wonder though I wasn't able to see that in the previous versions of rapids (24.10) and has something to do with the recent update.

ChuckHastings · 2024-12-03T16:32:25Z

The PR you referenced that fixed Leiden corrected a bug where the convergence criteria was wrong and was causing the algorithm to abort early. It is likely that this bug was masking this effect.

Do you have a small example that you can share where this is occurring? I can try and recreate to get a better understanding of what you're seeing.

abs51295 · 2024-12-05T00:04:02Z

Hey Chuck,

Here's the file for the adjacency matrix: https://cedars.box.com/s/4mg82y2u0m77pi8c4i9izt3yzx52xq1c. I just use this function (from rapids-singlecell) to get a weighted graph

def _create_graph(adjacency, use_weights=True):
    from cugraph import Graph

    sources, targets = adjacency.nonzero()
    weights = adjacency[sources, targets]
    if isinstance(weights, np.matrix):
        weights = weights.A1
    df = cudf.DataFrame({"source": sources, "destination": targets, "weights": weights})
    g = Graph()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        if use_weights:
            g.from_cudf_edgelist(
                df, source="source", destination="destination", weight="weights"
            )
        else:
            g.from_cudf_edgelist(df, source="source", destination="destination")
    return g

and then run leiden using the following:

    from cugraph import leiden as culeiden

    leiden_parts, _ = culeiden(
        g,
        resolution=1,
        random_state=0,
        max_iter=100,
    )

which generates the following output:

where cluster number 18 is skipped.

Thanks for all your help.

rlratzel · 2024-12-18T16:19:33Z

update: @jnke2016 is also taking a look

jnke2016 · 2025-01-02T21:03:13Z

I am looking at the issue which I was able to reproduce on a smaller datasets with 2676 vertices and 20480 edges

import cudf

import cugraph

from cugraph import leiden as culeiden

from cugraph.testing.mg_utils import (

    generate_edgelist_rmat,
)
from cudf.testing.testing import assert_series_equal

if __name__ == "__main__":

    # num_vertices = 2676, num_edges = 20480

    scale = 12
    edgefactor = 5
    seed = 2

    df = generate_edgelist_rmat(
        scale=scale, edgefactor=edgefactor, seed=seed, unweighted=True, mg=False
    )

    G = cugraph.Graph(directed=False)

    G.from_cudf_edgelist(
        df, source="src", destination="dst"
    )
    
    leiden_parts, _ = culeiden(
        G,
        resolution=1,
        random_state=0,
        max_iter=100,
    )

    parition_col = leiden_parts["partition"].drop_duplicates().sort_values(ascending=True).reset_index(drop=True)
    
    # The indices are numbered from 0 to  len(parition_col) - 1
    idx_col = cudf.Series(parition_col.index)

    assert_series_equal(parition_col, idx_col, check_dtype=False, check_names=False)

abs51295 · 2025-01-02T22:42:20Z

@jnke2016 Do you still think it's just a label numbering issue? We are using this algorithm actively and would like to know if this is a bug that can affect downstream analysis. Thanks and Happy New Year!

jnke2016 · 2025-01-03T01:52:47Z

@abs51295 thanks and Happy New Year to you too.

Do you still think it's just a label numbering issue?

Yes it is indeed a numbering issue. In fact we already have an internal utility function that relabels the cluster IDs but it is unused (perhaps for performance reason, @ChuckHastings or @naimnv can provide more details here). I tested it locally and it resolved the issue on a single GPU. However, I am still debugging the Multi GPU case

abs51295 · 2025-01-03T03:12:59Z

Thanks @jnke2016 for confirmation. I know this is a different request but can you give an example (possibly referring to my code example) as to how you run Leiden on multiple GPUs (more than 2 ideally)?

naimnv · 2025-01-03T15:45:34Z

Yes, it looks like a renumbering issue.
@jnke2016 I looked into the code and and it looks like calling relabel_cluster_ids would resolves the issue.

jnke2016 · 2025-01-04T01:05:21Z

but can you give an example (possibly referring to my code example) as to how you run Leiden on multiple GPUs (more than 2 ideally)?

@abs51295 Sure, If you are looking to run Leiden on a single compute node with more than 2 GPUs with respect to the example you referred, it is as simple as the script below. The PR I linked to the issue also fixes a multi_gpu bug which requires each GPU to provide a different random_state when calling Leiden. If you have any question regarding the example below please let me know. For multi node, we are currently updating our documentation with utility script supporting that

import numpy as np
from cugraph.dask import leiden as culeiden
import dask
import cudf
import dask_cudf
import warnings
from cugraph.dask.common.mg_utils import get_visible_devices

from cugraph.testing.mg_utils import (
    start_dask_client,
    stop_dask_client,
)


def _create_graph(adjacency, use_weights=True):
    from cugraph import Graph

    sources, targets = adjacency.nonzero()
    weights = adjacency[sources, targets]
    if isinstance(weights, np.matrix):
        weights = weights.A1
    df = cudf.DataFrame({"source": sources, "destination": targets, "weights": weights})

    # Convert GPU dataframe to distributed GPU dataframe
    df = dask_cudf.from_cudf(df, len(get_visible_devices()))
    g = Graph()
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        if use_weights:
            g.from_dask_cudf_edgelist(
                df, source="source", destination="destination", weight="weights"
            )
        else:
            g.from_dask_cudf_edgelist(df, source="source", destination="destination")
    return g

if __name__ == "__main__":
    setup_objs = start_dask_client()

    leiden_parts, _ = culeiden(
        g,
        resolution=1,
        random_state=0,
        max_iter=100,
    )

    stop_dask_client(*setup_objs)

abs51295 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 27, 2024

rlratzel assigned jnke2016 Dec 18, 2024

jnke2016 linked a pull request Jan 3, 2025 that will close this issue

Address Leiden numbering issue #4845

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Leiden clustering numbering is off #4791

[BUG]: Leiden clustering numbering is off #4791

abs51295 commented Nov 27, 2024

ChuckHastings commented Dec 2, 2024

abs51295 commented Dec 3, 2024

ChuckHastings commented Dec 3, 2024

abs51295 commented Dec 5, 2024

rlratzel commented Dec 18, 2024

jnke2016 commented Jan 2, 2025

abs51295 commented Jan 2, 2025

jnke2016 commented Jan 3, 2025 •

edited

Loading

abs51295 commented Jan 3, 2025

naimnv commented Jan 3, 2025

jnke2016 commented Jan 4, 2025 •

edited

Loading

[BUG]: Leiden clustering numbering is off #4791

[BUG]: Leiden clustering numbering is off #4791

Comments

abs51295 commented Nov 27, 2024

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.

Code of Conduct

ChuckHastings commented Dec 2, 2024

abs51295 commented Dec 3, 2024

ChuckHastings commented Dec 3, 2024

abs51295 commented Dec 5, 2024

rlratzel commented Dec 18, 2024

jnke2016 commented Jan 2, 2025

abs51295 commented Jan 2, 2025

jnke2016 commented Jan 3, 2025 • edited Loading

abs51295 commented Jan 3, 2025

naimnv commented Jan 3, 2025

jnke2016 commented Jan 4, 2025 • edited Loading

jnke2016 commented Jan 3, 2025 •

edited

Loading

jnke2016 commented Jan 4, 2025 •

edited

Loading