# Learning Social Network Embeddings for Predicting Information Diffusion - WSDM 2014

### Full Paper : http://dl.acm.org/citation.cfm?doid=2556195.2556216

Slides (french): http://www-connex.lip6.fr/~denoyer/slides_isis.pdf

Source Code: https://github.com/ludc/social_network_diffusion_embeddings

### Abstract

Analyzing and modeling the temporal diffusion of information on social media has mainly been treated as a diffusion process on known graphs or proximity structures. The underlying phenomenon results however from the interactions of several actors and media and is more complex than what these models can account for and cannot be explained using such limiting assumptions.
We introduce here a new approach to this problem whose goal is to learn a mapping of the observed temporal dynamic onto a continuous space. Nodes participating to diffusion cascades are projected in a latent representation space in such a way that information diffusion can be modeled efficiently using a heat diffusion process. This amounts to learning a diffusion kernel for which the proximity of nodes in the projection space reflects the proximity of their infection time in cascades. The proposed approach possesses several unique characteristics compared to existing ones. Since its parameters are directly learned from cascade samples without requiring any additional information, it does not rely on any pre-existing diffusion structure.
Because the solution to the diffusion equation can be expressed in a closed form in the projection space, the inference time for predicting the diffusion of a new piece of information is greatly reduced compared to discrete models.
Experiments and comparisons with baselines and alternative models have been performed on both synthetic networks and real datasets. They show the effectiveness of the proposed method both in terms of prediction quality and of inference speed.

### Notations

Traditionally, diffusion on networks is represented with the notion of cascade. A cascade is a sequence of users infected by some information (for instance, it could be the list of users who "liked" a specific YouTube video). A cascade describes to whom and when an item spreads through the network, but not how diffusion happens: while it is easy to know when a user got infected by some content, it is usually not possible to know who infected him.

Given a social network composed of a set of $N$ users $\mathcal{U}=(u_1,....,u_N)$,
cascades correspond to sets of users infected by the propagated information. Depending on the kind of network and the task in concern, the propagated information can for instance correspond to a given topic, a particular url, a specific tag, etc.
In the following, we consider $\mathcal{C}$ to be a set of cascades over a given network, divided in two subsets of distinct cascades: $\mathcal{C}_\ell \subseteq \mathcal{C}$ the set of training cascades and $\mathcal{C}_t \subseteq \mathcal{C}$ the set of testing cascades.  A cascade $c \in \mathcal{C}$ is defined as:

• A source $s^c \in \mathcal{U}$ which is the user at the source of the cascade - i.e, the first user that published the item concerned by the diffusion.
• A set of contaminated users $S^c \subset \mathcal{U}$ such that $u_i \in S^c$ means that $u_i$ has participated to the cascade $c$ - i.e., the user has performed some action (retweet, like, comment...) that is considered as an infection by $c$ (liking a video, publishing a tweet with a specific hashtag...); $\bar{S^c}$ is the set of users who have not participated in $c$.
• A contamination timestamp function defined over $S^c$ such that $t^c(u_i)$ corresponds to the timestamp at which $u_i \in S^c$ has first participated in the cascade. We consider that the contamination timestamp of the source is equal to $0$.
• A feature vector $q_c \in \mathbb{R}^Q$  that characterizes the content of the cascade $c$, with $Q$ the size of the content features space - For example, when dealing with textual information, the feature vector $q_c$ may be a tf-idf vector. This features vector is usually defined as the content of the publication.

### Principles of the proposed approach

The proposed models aim at predicting information diffusion. The central idea of these models is to map the observed information diffusion process into a heat diffusion process in a continuous (euclidean) space. To perform this, we learn diffusion kernels that capture the dynamics of diffusion from a set of training cascades. Let us denote $\mathcal{Z} = \mathbb{R}^n$ an euclidean space of $n$ dimensions - also called latent space. Learning such a diffusion kernel comes down in our case to learn a mapping of each node of the network to a particular location in $\mathcal{Z}$ such that, for a given metric, the latent space explains the contamination timestamps observed in the training cascades.

•   The learning problem is mapped to a continuous optimization problem that can easily be solved using classical optimization methods.
• The propagation model is learned directly from the observations, without the need of a graph structure, and without making strong assumptions about how information propagates.
• The inference of the diffusion can be performed  in the continuous space, which allows a very fast computation of the prediction without the need of an explicit discrete simulation. Simulation has an expensive processing cost and may lead to unreliable results, with a high variance between results of successive simulations of the same diffusion.
• Finally, the approach allows an easy integration of the content information by making the geometry of the latent space dependent on the information that spreads.

### Conclusion

We have presented a new family of information diffusion models based on the heat diffusion kernel. Their originality is to formulate diffusion as a process in a continuous space, built using an embedding of the nodes learned from observed cascades.
These models have some interesting characteristics 1) they learn directly from the observations and therefore do not require a predefined diffusion graph structure which is often not available for social applications. 2) they run much faster (1 or 2 orders of magnitude) than classical discrete models due to the continuous context and 3) they allow an easy integration of the content by modifying the geometry of the latent space. Performance obtained on real-world and artificial datasets show the ability of these methods to model information spread, and to take into account the content information in the diffusion process. They are competitive with and sometimes better than state of the art reference models.

Two research directions are currently considered: the first one consists in developing alternative models for a better use of the content information. Particularly, we are studying metric learning paradigms that should offer new possibilities to incorporate this information in the geometry of the latent space. A second direction consists in applying these methods to other diffusion tasks not restricted to social networks.