Inferring vertex properties from
topology in large networks
Janne Aukia*, Juuso Parkkinen†, Samuel Kaski†, and Janne Sinkkonen*
* Xtract Ltd, Helsinki, Finland ,
† Adaptive Informatics Research Centre
and Helsinki Institute for Information
Technology, Helsinki University of
Technology, Finland
Contacts:
janne.sinkkonen@xtract.com
janne.aukia@xtract.com
http://www.cis.hut.fi/projects/mi/papers/mlg07-13-sinkkonen-final.pdf
The conditional probability for an edge cluster z0 con-
ditioned on all the other edges:
k�, n�, and E� denote counts as if the link connecting
nodes i0 and j0 would not exist at all.
Optimized implementation:
• Uses sparse data structures (hash tables, binary-tree
representation of probabilities).
• When E and N are in the same order of magnitude,
the average running time is essentially
O(IE log C). That is, excluding the number of
iterations I , the running time scales linearly in the
number of edges and logarithmically in the number
of components.
• The memory consumption scales as O(L + C).
Results
An example of components found with the model is
shown in Figure 1. The data set is jazz musicians from
different cities in the US. The algorithm divided musi-
cian correcly into two cities. One of the cities further
subdivided into two groups.
Figure 3 shows the clustering of a large friendship net-
work of 147,610 Last.fm users claiming to be from the
US, crawled from the Last.fm web site.
• For each user: demographics (age, country, sex) and
music taste (artists).
•
In addition, tags for over 188,565 artists were crawled.
• Clusters found on the basis of topology differ in
terms of music tastes of the users.
Conclusions
• Method is computationally efficient, however, sub-
optimal hierarchical clustering methods are even
faster.
• Provides information on the confidence of the
clustering results.
Bibliography
P. Gleiser and L. Danon, Adv. Complex Syst. 6, 565
(2003). Data at: http://deim.urv.cat/~aarenas/data/
welcome.htm
J. Sinkkonen, J.