In life long or continual learning, a network may encounter a datum or a set of data that represent a new category or a datum or data that represent an existing category but that are very different from previous examples of that category. In these cases, the network may be grown by adding a network that is a detector for the new datum. For example, the subnetwork may model the new data as produced by a simple parameter probability distribution such as a multi-variant Guassian with a diagonal covariance matrix. If only a single new datum is available, the means of the probality model may be set to the values of datum and the variances may be initially set to values determined by the system designer or by a human + AI learning management system. The addition to the network may be made without degradation in prior performance by initially setting the weights of the outgoing connections to zero.
A Gaussian distribution with diagonal covariance may be modeled with a small network. The activation for each node in the first layer is the square of the difference between its one input and the node bias. Other exponential families may be represented with a different activation function. For example, for a bilateral exponential distribution, the activation function is the absolute value. In general, the activation function will be non-monotonic with a single minimum. The node bias in the mean of the input variable. Input connection weight may be permanently set to one or may be a scaling of the input variable. The output connection weigh is the reciprocal of the variance. The second layer is a score combining node. For the score to represent a logarithm of a probability density function, the combining node is linear. For the score to represent a probability density, the activation function of combining node is the exponential function. The exponential function may be used if the next operation is a sum of probabilities or a softmax operation.
A Gaussian detector or other unimodal probability distribution is trained only on data that are representatives of the event or category being detected. As more data becomes available, a Gaussian model may be updated by computing standard statistical estimates such as the maximum likelihod estimator or the zero-bias estimator.
To correct an error, a discriminator node may be created. A simple discriminator may be built from a single example of an error plus an example of the category with which the error datum is confused. The discriminator may be a single node with an activation function with its inflection point at zero for the weighted sum of the inputs. The input connection weights node bias may be set so that zero value occurs for the perpendicular bisector of the two examples.
More generally, with more examples and for adaptive training, a small discriminator network may be built from a combining node whose activation is the difference of two detector nodes. The activation of each detector node may be an estimate of the log probability of a parametric probability distribution such as a Gaussian distribution or a mixture of of Gaussians.
A separate network that is trained by one-shot learning may be merged into a primary network by creating connections between the two networks. A human + AI learning managment system may decide which connections to make by evaluating the usefulness of each potential connection.
by James K Baker