Clustering: your friends tend to know each other, and we form triangles.
In real networks, we form triangle and also larger scale structures called communities.
Proteins form protein complexes to do certain job. Each complex is a molecular machine that does something.
- How can we define these communities?
- How can we identify them?
- What is the significance of these communities in network science?
It is a fundamental method and aspect of many networks. Many analysis start with finding communities.
k-means clustering
k-means clustering is a general data science algorithm.
Clustering is a basic and fundamental unsupervised learning method that gives useful insight into a data set.
Usually the communities correspond to some units in the network that bears some significance. Functional groups can be identified by structural communities. Network communities can serve as really basic analysis, but tend to correspond to important functional units (in bio system) or some important units. In social systems, network community often correspond to meaningful groups: data science program, our school, faculty group, etc.
Early ideas:
- graph cuts: with a graph, and a defined flow, which cut can you make which breaks the network into two parts while minimizing the sum of the weights destroyed. How to cut a network into two pieces in a nice way.
- hierarchical clustering: define some measure of similarity between nodes
- betweenness: measures how a node participates in shortest path based communication in the network
- Communities should be separable: "Let's find cuts"
- divisive method: break the network into parts
- Communities should be dense: "let's find dense regions"
- agglomerative method: start from some seed, and grow clusters
- originally for the layout of digital circuits
- assume almost equal size groups
Objective: to "break" network into separable groups
- so that they may be stored in a distributed database (for size and scale)
- minimize crosstalk: we don't want database nodes to have to 'talk' with each other all the time...
Approach:
- start from almost equal size groups (two groups), randomly assigned
- swap random pair of nodes
- see whether there is a decrease in the sum of weight that goes across the clusters
Goal:
- find the configuration where the sum of weight is minimized
- fewer edges between communities
betweenness measures how a node participates in shortest path based communication in the network
Edge Betweenness: similar notion, how many times does an edge participate in the shortest path in all pairs of nodes compared to the total number of shortest paths.
Edges connecting communities (groups, clusters) tend to have large edge betweenness because they are between the groups.
Approach:
- find the edge with maximum edge betweenness
- cut that edge
- calculate edge betweenness again
- repeat
- eventually the communities will become clear
Perfect communities: cliques (the most dense structure in a network)
clique: given a set of nodes, a clique is a subgraph that has every possible edge between them; every node is connected to every other node in the subgraph
Idea: just find cliques in the graph, and there are the communities
Problem: cliques are not really prevalent in networks, and they will not cover the whole graph.
Extension of the Idea: expand to more imperfect cliques, can simply generalize imperfect clique to graph density
Given a set of nodes, and edges, we can calculate graph density:
L = number of edges
n nodes, "n choose 2" possibilities of edges
Density can be used as a guide to measure how close the subgraph is to a clique.
Start from a single node as its own community. Expand by finding what's the aggregation that you can get the most dense structure.
One Approach:
distance or similarity measure of nodes, based on overlap of neighbors, then did hierarchical clustering
Hierarchical Clustering |
---|
The starting point of hierarchical clustering is a similarity matrix whose elements |
Once we have Two different approaches: |
The dendrogram visualizes the order in which the nodes are assigned to specific communities. |
dendrogram: a tree diagram, especially one showing taxonomic relationships.
When to stop? How good is it?
We need to be able to measure how good the partition is!
Modularity is the degree to which a system's components may be separated and recombined. The meaning of the word, however, can vary somewhat by context.
- In the study of networks, modularity (networks) is a benefit function that measures the quality of a division of a network into groups or communities.
Modularity is one of the most basic measure of community quality. It can be used to find communities, it can be used to measure communities.
sum over all pair of nodes,
This is comparing the ACTUAL number of edges inside a community [which is done by
We want to maximize the gap between the ACTUAL density and the EXPECTED density given a partition.
Modularity value largely depends on network size, the total number of edges. As networks become larger, we tend to find larger communities. It cannot be solved below certain resolution.
Modularity cannot find small communities in larger networks. This is not necessarily always a drawback, because you may just want a "rough description" of the network. But it seems counterintuitive that you cannot even find perfect clique (fully connected subgraph) when a network becomes large.
Sever degeneracies: there are many different clustering configurations / community assignments (i.e., structural differences) that have very similar (or the same) modularity values. There is no clear global optima.
There are many configurations where modularity is similarly high while the structure is very different. By the nature of the network structure it is very hard to find global optimum.
Philosophical and fundamental problem is Modularity and many other methods assume non-overlapping and disjoint community structures. While in society we often encounter highly overlapping communities.
One nice alternative method which helps overcome some of the issues of Modularity is the infomap method which uses the quantity called Map equation.
Similar to Modularity, it measures the quality of a given clustering. It uses information theory to measure the best number of bits needed to encode a territories of random walkers.
nice qualities:
- robustness against resolution limit
- as well as other good properties
- tends to perform really nicely
The basic idea of the infomap algorithm is to try to encode a random walkers trajectory.
The random walker's trajectory is recorded as a sequence of encoded words representing the node to node or teleportation traverse of the network.
When we have string communities, it is better to have a hierarchical representation. First we encode the community: the identity (e.g., color). Once within a community, then use the codebook for that community as long as it stays within the community. There are special codes that represent exit from a community, then you identify the next community being entered.
A node is rarely confined to a single community.
Tamas Vicsek et al, Uncovering the overlapping community structure in complex networks in nature and society (2005, Nature)
The clique percolation algorithm (CFinder) views a community as the union of overlapping cliques.
- two k-cliques are considered adjacent if they share
$k - 1$ nodes - A k-clique community is the largest connected subgraph obtained by the union of all adjacent k-cliques
- k-cliques that cannot be reached from a particular k-clique belong to other k-clique communities
CFinder |
---|
The CFinder algorithm identifies all cliques and then builds an |
While nodes often belong to multiple communities, links tend to be community specific. They capture the precise relationship which defines a node's membership in a community.
- where
$n_{+}(i)$ is the list of neighbors of node$i$ , including itself. -
$S$ measures the relative number of common neighbors$i$ and$j$ have. -
$S = 1$ if$i$ and$j$ have the same neighbors.
The key idea of link communities is pervasive overlap. And pervasive overlap means that the overlap between communities is pervasive and not assuming that communities are almost/essentially disjoint.
In modularity we strive to find the "cut" which divides the network and identifies the separate communities. But in link communities it is argued that this is not possible.
The consequence of this pervasive overlap is: if you look at local structure, it may be rather simple. But when you assemble all of the small local structures, it creates a big tangle. (Example: word association network.)
Link similarity is defined between two edges, and we consider only those edges which share a node (if they do not share a node, they are separate and are likely not to be very close). Next, count the neighbors of the nodes and compute the Jaccard Coefficient.
Jaccard Coefficient: intersection divided by union
With the link similarity -- a matrix of similarity measures between edges where every row and column represents an edge -- perform hierarchical clustering.
Partition Density - essentially graph density for a given community. Calculate the density: the number of edges within the subgraph divided by the number of possible edges. And subtract the number of edges in the minimally connected subgraph with the same number of nodes.
With partition density, for each link community we can calculate all of the densities.
Algorithm
- define the similarity between all of the individual edges
- construct hierarchical tree: dendrogram
- calc partition density for each level
- produces a curve
- find the maximal partition density cut to show the final community structure
group/community membership is naturally given by the edges
intentionally choose the simplest way to obtain link community (define similarity, do hierarchical clustering)
POV: relationships between nodes is more fundamental than the nodes, and is more useful discovering complex structure in networks
(Newman, M.E.J.)
A network is assortative if a significant fraction of the edges in the network run between vertices of the same type. A simple way to quantify assortativity would be to measure that fraction. But that does not prove to be a good measure because it would be
What is desired is the measure to be large in non-trivial cases, and small in trivial cases.
A good measure instead is: find the fraction of edges that run between vertices of the same type, and then subtract from that the fraction of edges we would expect to find if edges were positioned at random without regard for vertex type.
Some math brings us to the formula for Modularity: the measure of the extent to which like is connected to like in a network.
assortative |əˈsôrtətiv|
adjective [ attrib. ]
denoting or involving the preferential mating of animals or marrying of people with similar characteristics.
If network vertices with similar values of a scalar characteristic (particularly ordered characteristics, like age or income) tend to be connected more often than those with different values then the network is considered assortatively mixed according to that characteristic.
Example: if people are friends with others around the same age as them, then the network is assortatively mixed by age. Sometimes this is called stratified by age, which means the same thing.
stratify |ˈstratəˌfī|
verb (stratifies, stratifying, stratified) [ with obj. ] (usu. as adj. stratified)
form or arrange into strata: socially stratified cities | [ no obj. ] : the residues have begun to stratify.
- arrange or classify: stratifying patients into well-defined risk groups.
- place (seeds) close together in layers in moist sand or peat to preserve them or to help them germinate.
- [ no obj. ] (of seeds) be germinated by stratifying.
A special case of assortative mixing by scalar quantity, degree. In a network that shows assortative mixing by degree the high-degree vertices will be preferentially connected to other high-degree vertices, and the low to low.
In an assortative network where high-degree nodes tend to stick together, we expect a "clump" or core of such high-degree nodes in the network surrounded by a less dense periphery of nodes with lower degree. This core/periphery structure is a common feature of social networks, many of which are found to be assortatively mixed by degree.
If a network is disassortatively mixed by degree, then high-degree vertices tend to be connected to low-degree vertices. This creates star-like features in the network which are often readily visible.
Question | Answer |
---|---|
Betweenness centrality can be used to cut important bridge edges and detect communities | True It was one of the earliest method. It works when the network has clear community structure but tends to fail in more tricky cases. |
Which of the following principles aren't consistent with finding communities: | Cutting a network based on distance from hubs. |
Inspecting all possible community partitions of a network is a feasible method for determining the best community structure because it can give us the truly best community structure. | False |
Which of the following explains the modularity approach to community detection? | Find the partitions that maximize the distance between the actual and expected density. |
The modularity approach is an efficient algorithm that works well across all scales. It can finds very small to very large communities. | False While the algorithm is very efficient, it has a severe limitation in resolution that causes it to merge distinct communities below a detection threshold that will depend upon properties of the network. |
Which of the following describes the modularity best? | Modularity is the difference between the actual number of edges within communities and the expected number of edges within communities |
The resolution limit of modularity depends on what two components: | Size of network and connectivity between modules |
Modularity has no intrinsic scale. | False (see reading: Resolution limit in community detection) |
Modularity tends to do well on graphs with power-law distributed community sizes. | False |
Modularity is usually okay for detecting communities when they are moderately sized with respect to the network. | True |
Modularity tends to be good for detecting very small communities on large graphs. | False |
Which best describes how infomap works: | Finds community partitions that minimize the encoding scheme. |
Infomap can readily be used with directed and weighted graphs without throwing out information. | True |
Aside from just detecting communities infomap also _____ | Detects the flow of information between communities. |
In the K-clique method why are the cliques compared with those of a random graph? | To determine if the communities could have been generated by chance. |
How does Link clustering identify nodes that belong to multiple communities? | Nodes with links that belong to different communities have multiple membership. |
Link communities can produce hierarchical memberships from which a dendrogram can be derived. | True |
Which is not one of the basic ideas of the network community structure? | --> (1) The size of a community should be comparable to the size of the network Communities do not need to be larger than certain threshold. Almost no method assumes size threshold. (2) Communities should be cohesive (dense), with many internal connections (3) Communities should be separable from the others. |
Modularity of a given community measures | The actual number of internal edges minus the expected number of internal edges. |
In society, we belong to many social groups. Can you name some of the categories of social groups? For instance, if you are working in a company, everyone in your team should know each other and thus will form a network community. You can also imagine larger groups such as departments also form network communities (you tend to know more people in the same department than other department, etc.). What are the other types of social groups that you participate?
Imagine other networks (e.g. biological, neural, ...). Will they exhibit network communities? Would they have relevance to the function of the whole network? Name some examples of network communities formed in networks other than the social network.
Now that you have been exposed to a whole range of community detection algorithms and have seen some of their advantages and disadvantages think about how they could be used in practice.
Describe some real-world scenarios where some of these algorithms would succeed or where they would fail.
How do you think these various algorithms do at capturing what it means to be a "community"?