# Deep Sequential Neural Networks - Workshop DL NIPS 2014

### Abstract

Neural Networks sequentially build high-level features through their successive layers. We propose here a new neural network model where each layer is associated with a set of candidate mappings. When an input is processed, at each layer, one mapping among these candidates is selected  according to a sequential decision process. The resulting model is structured according to a DAG like architecture, so that a path from the root to a leaf node defines a sequence of transformations. Instead of considering global transformations, like in classical multilayer networks, this model allows us for learning a set of local transformations. It is thus able to process data with different characteristics through specific sequences of such local transformations, increasing the expression power of this model w.r.t a classical multilayered network. The learning algorithm is inspired from policy gradient techniques coming from the reinforcement learning domain and is used here instead of the classical back-propagation based gradient descent techniques. Experiments on different datasets show the relevance of this approach.

### Formalism

Let us consider $\mathcal{X} = \mathbb{R}^X$ the input space, and $\mathcal{Y}=\mathbb{R}^Y$ the output space, $X$ and $Y$ being respectively the dimension of the input and output spaces. We denote $\{(x_1,y_1),...,(x_\ell,y_\ell)\}$ the set of labeled training instances such that $x_i \in \mathcal{X}$ and $y_i \in \mathcal{Y}$. $\{(x_{\ell+1},y_{\ell+1}),...,(x_T,y_T)\}$ will denote the set of testing examples.\linebreak

The DSNN model has a DAG-structure defined as follow:

• Each node $n$ is in $\{n_1,...,n_N\}$ where $N$ is the total number of nodes of the DAG
• The root node is $n_1$, $n_1$ does not have any parent node.
• $c_{n,i}$ corresponds to the $i$-th child of node $n$ and $\#n$ is the number of children of $n$ so, in that case, $i$ is a value be between $1$ and $\#n$.
• $leaf(n)$ is true if node $n$ is a leaf of the DAG - i.e a node without children.
• Each node is associated to a particular representation space $\mathbb{R}^{dim(n)}$ where $dim(n)$ is the dimension associated to this space. Nodes play the same role than layers in classical neural networks.
• $dim(n_1) = X$ i.e the dimension of the root node is the dimension of the input of the model.
• For any node $n$, $dim(n) = Y$ if $leaf(n)=true$ i.e the dimension of the leaf nodes is the output space dimension.
• We consider mapping functions $f_{n,n'} : \mathbb{R}^{dim(n)} \rightarrow \mathbb{R}^{dim(n')}$ which are functions associated with edge $(n,n')$. $f_{n,n'}$ computes a new representation of the input $x$ in node $n'$ given the representation of $x$ in node $n$. The output produced by the model is a sequence of $f$-transformation applied to the input like in a neural network.
• In addition, each node is also associated with a \textit{selection function} denoted $p_n : \mathbb{R}^{dim(n)} \rightarrow \mathbb{R}^{\#n}$ able, given an input in $\mathbb{R}^{dim(n)}$, to compute a score for each child of node $n$. This function defines a probability distribution over the children nodes of $n$ such as, given a vector $z \in \mathbb{R}^{dim(n)}$ : $P(c_{n,i} | z) = \frac{e^{p^i_n(z)}}{\sum\limits_{j=1}^{\#n} e^{p^j_n(z)}}$. Selection functions aim at selecting which $f$-functions to use by choosing a path in the DAG from the root node to a leaf node.

### Learning in DSNN

The training procedure we propose aims at simultaneously learning both the \textit{mapping functions} $f_{i,j}$ and the selection functions $p_i$ in order to minimize a given learning loss denoted $\Delta$. Our learning algorithm is based on an extension of policy gradient techniques inspired from the Reinforcement Learning literature. See full paper......

### Experiments

Experiments have been made on UCI datasets, MNIST and Checkerboard. See full paper...

### Conclusion

We have proposed a new family of model called \textit{Deep Sequential Neural Networks} which differ from neural networks since, instead of always applying the same set of transformations, they are able to choose which transformation to apply depending on the input. The learning algorithm is based on the computation of the gradient which is obtained by an extension of policy-gradient techniques. In its simplest shape, DSNNs are equivalent to DNNs. Experiments on different datasets have shown the effectiveness of these models.