Statistical Validation of networks

The method in one line

Take a network and filter out all links with the exception of over-expressed (or under-expressed) ones with respect to a randomized null model

Main references

For the original idea and for the method applied on bipartite networks, see Tumminello, Miccichè, Lillo, Pilio, Mantegna (2011)
For the application on unipartite networks see Hatzopoulos, Iori, Mantegna, Miccichè, Tumminello (2015)

The basic idea

This is a theoretical and algorithmic methodology designed to filter a complex network to its backbone structure. It can be performed both on unipartite and bipartite networks. In the bipartite case the method provides a filter of the projected network, either on the first or on the second module. The filtering is done by statistically comparing the input network with a randomized version of the same network (the Null model), obtained by fixing some properties of the real network (strength/degree distribution) and by letting links to be drawn completely at random once conditioned on the imposed constraints. The statistical filter removes all the links that come with high probability in the randomized version of the network. Namely, if one link is very likely to be there (or to have the same or greater weight) in the null model then the existence of that link (or the size of that weight) is just a statistical consequence of the general structure of the network (i.e. the degree distribution) and not a feature peculiar to that specific network.

An example on Tuscan Mobility

Below you see an example of unipartite weighted network. The graph represents the traffic fluxes in the Tuscany region in Italy. The color of nodes identifies different Administrative areas. Edges have the color of their target node. Data are collected from GPS vehicle tracking over a time interval of roughly 1 month and a half. The network as 29569 links and 287 nodes. The node size is proportional to the PageRank score of the node. As you may see, the network is pretty dense, with a lot of long-distance links as well. However, colors rouhgly clusters in connected regions, meaning that intra-area traffic is dominant over inter-area traffic, as expected.

Different statistical methods can be applied for validation, some more drastic, typically removing a lot of links, controlling the Family-wise Error Rate (Bonferroni correction), and some more conservative, controlling the False Discovery Rate (Benjamini-Hochberg-Yekuteli correction).

Below you see some examples of filtering via different multiple testing corrections and you also have a comparison between communities detected (via modularity) in the BY-filtered network and actual Administrative areas of Tuscany.

bonferroni

Color of nodes corresponds to Administrative areas of Tuscany

BY-FDR

Color of nodes corresponds to Administrative areas of Tuscany

BY-FDR exploded communites

Color of nodes corresponds to Administrative areas of Tuscany; communites (artificially exploded) are detected via modularity maximization by Louvain algorithm (with resolution 1) on the BY-filtered network. Administrative areas are well preserved, with some exceptions (for example the merging of Pisa and Livorno areas) that are however in good agreement with Tuscan geography.

BY-FDR under-expressed

Here the same input network after the application of a statistical validation that identifies the under-epxressed links: the links whose weights are significantly lower than expected by a randomized null model. As you may notice, the resulting network is extremely dense: indeed, since the random null model knows nothing about geography, one expects most of the long-distance nodes to be under-expressed.

Links counting

density of links and link fraction among the same Administrative areas
network	number of links	density (%)	link fraction (%)
original	29570	35.8	25.5
over-bonferroni	3290	4.0	82.7
over-BY	3524	4.2	81.8
under-BY	43297	52.3	6.4

As the above Table clearly shows, the filtered networks are extremely sparser than the original input one, and the statistical validation for over-expression is quite efficient in highlighting the links connecting nodes in the same Administrative area: only less than 20% of reteined links connect nodes in two different Administrative areas.

The workflow

The method goes along the following steps (for over-expression):

calculates the strength of all the nodes,
calulates, for every link of the input network, the probability that, if everything is random given the strength sequence, one has the same or higher weight for that link (see the references for details). In statistics, this is referred to as p-value,
having the p-values for all the input links, the method performs a multiple hypothesis testing via two different approaches: control the Famoly-wise Error Rate via Bonferroni prescription, and controlo the False Discovery Rate via the Benjamini-Yekuteli prescription (see the references for details),
choosing a specific confidence threshold (default to 5%), the method filters the input network on the basis of the corrected p-values and returns a new (filtered) weighted graph, with only the links having a p-value smaller than the chosen threshold. Namely, only the links that are statistically over-expressed with respect to the Null model are reteined

For under-expression the procedure goes along the same steps, mutatis mutandis.

Reproducible example and R code

Here you can download a compressed file containing the R code implementing the Statistical Validation together with a workflow example script, guiding the user through a filtering of a bipartite network.

Download source code and example

SoBigData e-Infrastructure

You can access to the Statistical Validation method also through the SoBigData platform. Have a look at the method and data catalgue here –> SoBigData Catalog.