An additional layer may be added to any position in a layered, feed forward neural network in a way that guarantees no degradation in performance compared to the original network. Combined with Incremental Development, this technique makes it possible to build and train efficiently an arbitrarily deep neural network with arbitrarily accurate performance on any designated training set.
The process consists of growing the network in a succession of steps, adding a new layer in each step. The new layer may be inserted between any two layers of the previous network, including between the input layer and the first layer. Each new network is initialized to perform the same computation as the previous network, so before additional training its performance is already as good as that of the previous network. This initialization is in contrast to the usual procedure for growing deeper networks, which is to train each new network from scratch.
From this initialization, the new network is trained by stochastic gradient descent until a stopping criterion is met. Because of the initialization, this training improves the performance of the new network over that of the previous network, even if the previous network has been successfully trained to its global obtimum.
It is recommended that, in the course of this training, that performance be tested on new data that has not been used in training or testing during the development of previous network in the succession of steps. This development test data may be used for tuning hyperparameters and adjusting regularization to achieve good performance on the new development test data as well as on the training data.
The process of training the new network may be enhanced by using Imitation Learning, to help the new network remember anything usefule that was learned by the previous network. In addition, because the new network contains copies of all the nodes in the previous network, a node-to-node Knowledge Sharing link may be established between any node in the previous network and the corresponding node in the new network. The relationship associated with a knowledge sharing link may be data-dependent. Optionally, each knowledge sharing link from the previous network may apply its relationship only on data for which the previous network does not make an error.
In addition to the knowledge sharing links from the previous network, the new network may have knowledge sharing links of its own. Knowledge sharing links increase regularization and decrease the effective number of degrees of freedom, the opposite of new connections. Thus, adding additional knowledge sharing links will tend to reduce variance rather than increase it.
As the number of layers is increased, it is beneficial to add vertical knowledge sharing links that skip over many layers. Not only do this links improve regularization, they provide a means for implicit coordination across those layers without adding all the connections that would be needed for direct connections.
by James K Baker and Bradley J Baker