Kaj Sotala recently asked for an update on how MOSES selects a “species” to be “mutated”, when it is searching for the fittest program tree. I have some on-going, unfinished research in this area, but perhaps this is a good time to explain why.
To recap: MOSES is a program-learning system. That is, given some input data, MOSES attempts to learn a computer program that reproduces the data. It does so by applying a mixture of evolutionary algorithms: an “inner loop” and an “outer loop”. The inner loop explores all of the mutations of a “species” (a “deme”, in MOSES terminology), while the outer loop chooses the next deme to explore. (Each deme is a “program tree”, that is, a program written in a certain lisp-like programming language).
So: the outer loop selects some program tree, whose mutations will be explored by the inner loop. The question becomes, “which program tree should be selected next?” Now, nature gets to evolve many different species in parallel; but here, where CPU cycles are expensive, its important to pick a tree whose mutations are “most likely to result in an even fitter program”. This is a bit challenging.
MOSES works from a pool of candidate trees, of various fitnesses. With each iteration of the inner loop, the pool is expanded: when some reasonably fit mutations are found, they are added to the pool. Think of this pool as a collection of “species”, some similar, some not, some fit, some, not so much. To iterate the outer loop, it seems plausible to take the fittest candidate in the pool, and mutate it, looking for improvements. If none are found, then in the next go-around, the second-most-fit program is explored, etc. (terminology: in moses, the pool is called the “metapopulation”).
It turns out (experimentally) that this results in a very slow algorithm. A much better approach is to pick randomly from the highest scorers: one has a much better chance of getting lucky this way. But how to pick randomly? The highest scorers are given a probability: p ~ exp (score /T) so in fact, the highest scoring have the highest probability of being picked, but the poorly-scoring have a chance too. This distribution is the “Gibbs measure” aka “Boltzmann distribution”; (T is a kind of “temperature”, it provides a scale; its held constant in the current algos) I’m guessing that this is the right measure to apply here, and can do some deep theoretical handwaving, but haven’t really worked this out in detail. Experimentally, it works well; there even seems to be a preferred temperature that seems to work well for most/all different problems (but this is not exactly clear).
One can do even better. Instead of using the score, a blend of score minus program tree complexity works better; again, this is experimentally verified. Nil added this back when, and his theoretical justification was to call it “Solomonoff complexity”, and turn it into a ‘Bayesian prior’. From an engineering viewpoint, its basically saying that, to create a good design suitable for some use, its better to start with a simple design and modify it, than to start with a complex design and modify it. In MOSES terminology, its better to pick an initial low-complexity but poorly scoring deme, and mutate it, than to start with something of high complexity, high score, and mutate that. Exactly what the blending ratio (between high score, and high complexity) is, and how to interpret it, is an interesting question.
Experimentally, I see another interesting behaviour, that I am trying to “fix”. I see a very classic “flight of the swallow” learning curve, dating back to the earliest measurements of the speed of telegraph operators at the turn of the 19th century. At first, learning is fast, and then it stalls, until there is a break-through; then learning is again fast (for a very brief time — weeks for telegraph operators) and then stalls (years or a decade for telegraph operators). In MOSES, so, at first, one picks a deme, almost any deme, and almost any mutation will improve upon it. This goes on for a while, and then plateaus. Then there’s a long dry spell — picking deme after deme, mutating it, and finding very little or no improvement. This goes on for a long time (say, thousands of demes, hours of cpu time), when suddenly there is a break-through: dozens of different mutations to some very specific deme all improve the score by some large amount. The bolzmann weighting above causes these to be explored in the next go-around, and mutations of these, in turn, all yield improvements too. This lasts for maybe 10-20 steps, and then the scores plateau again. Exactly like the signalling rate of 19th century telegraph operators 🙂 Or the ability of guitar players. Or sportsmen, all of which have been measured in various social-science studies, and have shown the “flight of the swallow” curve on them.
(Can someone PLEASE fix the horribly deficient Wikipedia article on “learning curve”? It totally fails to cite any of the seminal research and breakthroughs on this topic. Check out google images for examples of fast learning, followed by long plateau.
All these curves beg the question: why is google finding only the highly stylized ones, and not showing any for raw, actual data? Has the learning curve turned into an urban legend??
Recently, I have been trying to shorten the plateau, by trying to make sure that the next deme I pick for exploration is one that is least similar to the last one explored. The rationale here is that the metapaopulation gets filled with lots of very very similar species, all of which are almost equally fit, all of which are “genetically” very similar. Trying to pick among these, to find the magic one, the one whose mutations will yeild a break-through, seems to be a losing strategy. So, instead, add a diversity penalty: explore these “species” that are as different as possible from the current one (but still have about the same fitness score). So far, this experiment is inconclusive; I wasn’t rewarded with instant success, but more work needs to be done. Its actually fairly tedious to take the data…