For the purpose of this page, an "experimenter" is anyone trying out new ideas that have a high chance of failure. For example, a student who is new to deep learning, but who doesn't want to just follow recipes or duplicate what has already been done, is an experimenter. A senior researcher, with years of experience in AI, who is trying out a radical new idea is an experimenter.
On the other hand, most research proceeds step-by-step, making incremental improvements on what already works. That characteristic is true for AI reseach as much as for other fields. It is the mainstream and will remain so. Incremental improvement has been the pathway to many of the dramatic successes of AI in recent years, but it is not the subject of this page. The founding principle of Human-Assisted Training for Artificial Intelligence is radical. HAT-AI is necessarily somewhat experimental. It cannot be advanced by incremental research alone. In addition, one of the purposes of this website is to attract students who can generate new ideas and who are willing to take chances.
Experimenters, this page is for you!
Yes, we know that most of the suggestions on this page break the rules. Specifically, they break rule #3 on the Breaking Rules page. Some of them break additional rules.
Human-assisted training is an especially rich field for an AI researcher who wants to experiment with novel and radical ideas. The definitive characteristic of human-assisted training is that humans participate in the training process in ways that violate the "hands off during training" rule. Therefore, ideas that take advantage of this difference are more likely to lead to new discoveries than in other areas of AI research. The degree of novelty is further enhanced if the novel idea is targeted at one or more of the criteria or goals for the future of machine learning: sensibility, Socratic wisdom, interpretability, continual learning, insightful generalization or adaptability.
For example, in fitting a known function that is representable with the family of models that you are training, the optimum is known: a perfect fit. A source of complex function-fitting tasks is Imitation Learning, which involves training a machine learning system to imitate a reference system as well as possible. Specific examples of imitation learning tasks will be mentioned in other suggestions.
First try your novel idea on a task of fitting a deterministic function or imitating a neural network with no random elements. For example, you can imitate a standard neural network classifier. You also can fit the inverse of a generator, such as a Generative Adversarial Network (GAN), a Variational Autoencoder (VAE) or a Stochastic Categorical Autoencoder Network (SCAN), as will be explained in suggestion 5 below.
You will eventually need to experiment with fitting random functions, but you should make sure your idea works for deterministic functions first.
When you are experimenting with a novel idea, many things can go wrong. You have to expect the unexpected. Therefore, it is recommended that you construct experiments that allow you to focus on one challenge at a time.
You can design an imitation task to test an idea with a specific goal:
4.1: An easy way to generate imitation tasks is discussed in suggestion 6 below.
4.2: For any ensemble, it is possible to build a (large) single network that performs as well as the ensemble by Joint Optimization. As a suggested imitation task, see how close you can match the performance of an ensemble with a smaller single network. The architecture of the single network does not need to be related to the architecture of the members of the ensemble. This is also an example of an imitation task with a deterministic function.
4.3: Experiment with architectures that are easier to interpret or that have other desirable properties compared to the conventional network you are imitating. In this experiment, you get to decide what is to be novel.
4.4: An example of a complex function is the parity function. The input to the parity function is a bit vector with n bits. The value of the parity function is 1 if the number of bits with the value 1 is odd. The value of the parity function is 0 if the number of bits with the value 1 is even. Training a neural network to learn the parity function is notorius for requiring a large amount of training when n is large. Start with a low value for n and see how large you can make n. Try networks of various architectures and various sizes, or try to build a network incrementally.
Note: The parity function and appoximations to it create problems of slow convergence because error loss functions have a very large number of saddle points.
4.4 (cont.): Also try to fit moderate deviations from the parity function. For example, for values of n > 8, for a moderate value of k, say k=6, select n - k bit positions. Select a (n-k)-bit pattern. Whenever the input matches the selected pattern, flip the output. That is, set the output value to be the opposite of the value of the parity function. Train a network to fit the resulting function. Under some circumstances, training the fit to the deviation function is faster than fitting the parity function itself because there are fewer saddle points in the error function. However, the system has to learn the pattern for the deviation as well as the pattern for the underlying parity function. Experiment with various values of n and k. Also experiment with applying more than one set of flipped bits. Try to think of your own complex functions and experiment with imitating them.
4.5 From a purely mathematical point of view, the mapping of a classifier from its input values to its output is a complex deterministic function, regardless of the machine learning method that is used to train the classifier. Thus, the imitation of other machine learning classifiers that are trained with different algorithms is a plentiful source of complex imitation tasks.
4.6 One way to build a very deep network is to build it Layer-by-Layer, adding each layer in a way that guarantees no degradation in performance on training data. Successful ways to facilitate the training of such a deep network, while improving its interpretability, are discussed in Knowledge Sharing for Experimenters. Other ways to build very deep neural networks are discussed throughout this website.
For many generators, such as GANs, VAEs and SCANs, the input to the decoder is random. Furthermore, the mapping is not one-to-one, so the inverse is not well defined. Nonetheless, fitting the inverse of such a generator may be treated essentially as an instance of suggestion 2(a), imitating a deterministic system. Although multiple random inputs to the decoder may map to the same output, an inverse imitation network that finds any one of these inputs from the output will be regarded as correct. For any input to the inverse imitation network, that is for every output from the generator, backpropagate from the output of the generator to find the target input for training the inverse imitator, but judge it correct if it finds any other generator input tht produces the same generator output. The deterministic imitator network is not a generator. It does not learn the probability distribution produced by the generator. It does not learn all the inputs that produce the same output. It solves a simpler, but still challenging, problem. The deterministic inverse function will typically be very complex, which gives you an opportunity to demonstrte the superior performance of your novel idea for imitating complex functions. Like other imitation learning problems, it allows you to use an arbitrary amount of data for training and development testing. Even if the main goal of your idea is to reduce the amount of training data required, the ability to generate an unlimited amount of training data gives you a baseline for your experiments. For this or any other task, having an unlimited amount of data for development testing allows you to run many experiments.
Train a set of classifiers, each trained just on a small set of training data. Actually, any set of trained classifiers will do. Training each of the classifiers on a small amount of training data is suggested to reduce the amount of computation and to make it easy to train many classifiers that will be different from each other. The classifiers may all have the same network architecture, or they may be completely different from each other. In addition to the set of individual classifiers, optionally create ensembles from random subsets of the set of classifiers. Each ensemble is exposed to the union of the sets of training data for its member classifiers. You may choose whether or not to train the ensemble jointly, and what data to use for that training. You can have exponentially many different systems, each trained on different data.
Create imitation tasks by selecting data on which the classifiers have not been trained. The performance of the imitating system is based on whether its output matches the system being imitated, not on whether the classification is correct. The data for developing the imitation system and to test its performance does not even need to be labeled as to the correct category. For pure experimentation, try out a wide variety of architectures for the network that is trying to imitate another system. For testing a novel idea, compare its performance on an imitation task with the performance of one or more conventional systems.
Although an imitation task does not similuate all the problems that occur with real data, an imitation task can be arbitrarily difficult. This combination of characteristics allows you to create a task that is too difficult for some conventional system to show that your novel idea makes an improvement even if you novel system is not yet ready to handle all the complexities of real data.
To make it easy to specify an arbitrary deterministic function, let the function be defined by a table of values. For example, let the domain of the function be integer-valued vectors with the integers in a limited interval, such as 1, 2, ... n. Make a table specifying the value for each vector in the domain. For purpose of illustration, assume the value of the function is in a space of dimension m. Multiple input vectors may be mapped to the same output vector. The noise will create an infinite number of possible values and an arbitrary number of examples for training and testing.
For each of the finite set of output values of the deterministic function, select one or more continuous-valued probability distributions for each of a finite set of target categories. Each probability distribution represents a cluster of data within its category. Use each probability distribution to generate random noise to add to the output of the deterministic function. An arbitrary number of data examples with different samples from the probability distribution may be generated for each cluster associated with each vector value in the range of the deterministic function
The deterministic function merely allows you to create complicated geometry for the spacing and overlap of the distributions of the clusters and categories in m-space. The cluster distributions may be used to generate an m-vector of input data for a classification task in which the target classification for each input vector is the category associated with the cluster whose distribution generated the example input data. The cluster distributions overlap, so different instances of the same input value may be generated from clusters from different categories. You may make the classification task as difficult as you choose.
Start with simpler tasks, with a simpler deterministic function, simpler distributions and less overlap among the cluster distributions. Progress to more challenging tasks when you want to compare the performance of systems that all perform well on the simpler tasks. For example, you may start with each output vector from the deterministic function being associated with only one category and with only one cluster within that category. Even in this simple case, you can make the variances large enough so that the distributions overlap. As simpler probability distributions, you may use multi-variate Gaussian distributions with diagonal covariance matrices. You may vary the values on the diagonal of the covariance matrix to change the amount of overlap and to change the directions of the axes of the contours of each distribution.
This suggestion is a combination of suggestion 6 and a simplified version of random task example 2 above. In any imitation task, for example, a task created in suggestion 6, simulate errors by randomly changing the response of the network being imitated for a small fraction of its input data. Measure the performance of the new network on imitating the data that has not been changed. As a simplified version of random task 2 above, you may ignore the error rate of the new network on the data that has been changed. This test is checking the ability of the network to ignore the simulated errors, rather than to try to correct the errors.
As a more challenging task, develop a system that not only learns to ignore the random errors, but that explicitly detects them and attempts to correct the error. For this task, it is recommended that, in choosing the system to be imitated, you choose a system that, other than the simulated errors, has reasonable and sensible decision boundaries.
Here are some links to other pages that provide useful tools and ideas for experimenters:
by James K Baker