Main references
- For the original idea and for the method applied on bipartite networks, see Tumminello, Miccichè, Lillo, Pilio, Mantegna (2011)
- For the application on unipartite networks see Hatzopoulos, Iori, Mantegna, Miccichè, Tumminello (2015)
Take a network and filter out all links with the exception of over-expressed (or under-expressed) ones with respect to a randomized null model
This is a theoretical and algorithmic methodology designed to filter a complex network to its backbone structure. It can be performed both on unipartite and bipartite networks. In the bipartite case the method provides a filter of the projected network, either on the first or on the second module. The filtering is done by statistically comparing the input network with a randomized version of the same network (the Null model), obtained by fixing some properties of the real network (strength/degree distribution) and by letting links to be drawn completely at random once conditioned on the imposed constraints. The statistical filter removes all the links that come with high probability in the randomized version of the network. Namely, if one link is very likely to be there (or to have the same or greater weight) in the null model then the existence of that link (or the size of that weight) is just a statistical consequence of the general structure of the network (i.e. the degree distribution) and not a feature peculiar to that specific network.
Below you see an example of unipartite weighted network. The graph represents the traffic fluxes in the Tuscany region in Italy. The color of nodes identifies different Administrative areas. Edges have the color of their target node. Data are collected from GPS vehicle tracking over a time interval of roughly 1 month and a half. The network as 29569 links and 287 nodes. The node size is proportional to the PageRank score of the node. As you may see, the network is pretty dense, with a lot of long-distance links as well. However, colors rouhgly clusters in connected regions, meaning that intra-area traffic is dominant over inter-area traffic, as expected.
Different statistical methods can be applied for validation, some more drastic, typically removing a lot of links, controlling the Family-wise Error Rate (Bonferroni correction), and some more conservative, controlling the False Discovery Rate (Benjamini-Hochberg-Yekuteli correction).
Below you see some examples of filtering via different multiple testing corrections and you also have a comparison between communities detected (via modularity) in the BY-filtered network and actual Administrative areas of Tuscany.
Color of nodes corresponds to Administrative areas of Tuscany
Color of nodes corresponds to Administrative areas of Tuscany
Color of nodes corresponds to Administrative areas of Tuscany; communites (artificially exploded) are detected via modularity maximization by Louvain algorithm (with resolution 1) on the BY-filtered network. Administrative areas are well preserved, with some exceptions (for example the merging of Pisa and Livorno areas) that are however in good agreement with Tuscan geography.
Here the same input network after the application of a statistical validation that identifies the under-epxressed links: the links whose weights are significantly lower than expected by a randomized null model. As you may notice, the resulting network is extremely dense: indeed, since the random null model knows nothing about geography, one expects most of the long-distance nodes to be under-expressed.
network | number of links | density (%) | link fraction (%) |
---|---|---|---|
original | 29570 | 35.8 | 25.5 |
over-bonferroni | 3290 | 4.0 | 82.7 |
over-BY | 3524 | 4.2 | 81.8 |
under-BY | 43297 | 52.3 | 6.4 |
As the above Table clearly shows, the filtered networks are extremely sparser than the original input one, and the statistical validation for over-expression is quite efficient in highlighting the links connecting nodes in the same Administrative area: only less than 20% of reteined links connect nodes in two different Administrative areas.
The method goes along the following steps (for over-expression):
For under-expression the procedure goes along the same steps, mutatis mutandis.
Here you can download a compressed file containing the R code implementing the Statistical Validation together with a workflow example script, guiding the user through a filtering of a bipartite network.
Download source code and example
You can access to the Statistical Validation method also through the SoBigData platform. Have a look at the method and data catalgue here –> SoBigData Catalog.