The MOSES Metapopulation

Kaj Sotala recently asked for an update on how MOSES selects a “species” to be “mutated”, when it is searching for the fittest program tree. I have some on-going, unfinished research in this area, but perhaps this is a good time to explain why.

To recap: MOSES is a program-learning system. That is, given some input data, MOSES attempts to learn a computer program that reproduces the data. It does so by applying a mixture of evolutionary algorithms: an “inner loop” and an “outer loop”. The inner loop explores all of the mutations of a “species” (a “deme”, in MOSES terminology), while the outer loop chooses the next deme to explore. (Each deme is a “program tree”, that is, a program written in a certain lisp-like programming language).

So: the outer loop selects some program tree, whose mutations will be explored by the inner loop. The question becomes, “which program tree should be selected next?” Now, nature gets to evolve many different species in parallel; but here, where CPU cycles are expensive, its important to pick a tree whose mutations are “most likely to result in an even fitter program”. This is a bit challenging.

MOSES works from a pool of candidate trees, of various fitnesses. With each iteration of the inner loop, the pool is expanded: when some reasonably fit mutations are found, they are added to the pool. Think of this pool as a collection of “species”, some similar, some not, some fit, some, not so much. To iterate the outer loop, it seems plausible to take the fittest candidate in the pool, and mutate it, looking for improvements. If none are found, then in the next go-around, the second-most-fit program is explored, etc. (terminology: in moses, the pool is called the “metapopulation”).

It turns out (experimentally) that this results in a very slow algorithm. A much better approach is to pick randomly from the highest scorers: one has a much better chance of getting lucky this way. But how to pick randomly? The highest scorers are given a probability: p ~ exp (score /T) so in fact, the highest scoring have the highest probability of being picked, but the poorly-scoring have a chance too. This distribution is the “Gibbs measure” aka “Boltzmann distribution”; (T is a kind of “temperature”, it provides a scale; its held constant in the current algos)  I’m guessing that this is the right measure to apply here, and can do some deep theoretical handwaving, but haven’t really worked this out in detail. Experimentally, it works well; there even seems to be a preferred temperature that seems to work well for most/all different problems (but this is not exactly clear).

One can do even better. Instead of using the score, a blend of score minus program tree complexity works better; again, this is experimentally verified.   Nil added this back when, and his theoretical justification was to call it “Solomonoff complexity”, and turn it into a ‘Bayesian prior’. From an engineering viewpoint, its basically saying that, to create a good design suitable for some use, its better to start with a simple design and modify it, than to start with a complex design and modify it. In MOSES terminology, its better to pick an initial low-complexity but poorly scoring deme, and mutate it, than to start with something of high complexity, high score, and mutate that. Exactly what the blending ratio (between high score, and high complexity) is, and how to interpret it, is an interesting question.

Experimentally, I see another interesting behaviour, that I am trying to “fix”.   I see a very classic “flight of the swallow” learning curve, dating back to the earliest measurements of the speed of telegraph operators at the turn of the 19th century. At first, learning is fast, and then it stalls, until there is a break-through; then learning is again fast (for a very brief time — weeks for telegraph operators) and then stalls (years or a decade for telegraph operators). In MOSES, so, at first, one picks a deme, almost any deme, and almost any mutation will improve upon it. This goes on for a while, and then plateaus. Then there’s a long dry spell — picking deme after deme, mutating it, and finding very little or no improvement. This goes on for a long time (say, thousands of demes, hours of cpu time), when suddenly there is a break-through: dozens of different mutations to some very specific deme all improve the score by some large amount. The bolzmann weighting above causes these to be explored in the next go-around, and mutations of these, in turn, all yield improvements too. This lasts for maybe 10-20 steps, and then the scores plateau again. Exactly like the signalling rate of 19th century telegraph operators :-) Or the ability of guitar players. Or sportsmen, all of which have been measured in various social-science studies, and have shown the “flight of the swallow” curve on them.

(Can someone PLEASE fix the horribly deficient Wikipedia article on “learning curve”? It totally fails to cite any of the seminal research and breakthroughs on this topic. Check out google images for examples of fast learning, followed by long plateau.

So: e.g.

Learning curve

learning curve

Learning curve. In real life. For Salesmen.

Actual MOSES curves look more like this, with rapid progress followed by stagnant plateaus, punctuated with rapid progress, again. Exccept the plateaus are much flatter and much longer, and the upward curves are much sharper and faster.

All these curves beg the question: why is google finding only the highly stylized ones, and not showing any for raw, actual data? Has the learning curve turned into an urban legend??

Here's a real-life learning curve, taken from MOSES, using real data (the "bank" dataset) from a previous OpenCog blog post on MOSES. Although this learning curve shows a combination of the inner and outer loops, and so, strictly speaking does not represent what I'm discussing here.

).

Recently, I have been trying to shorten the plateau, by trying to make sure that the next deme I pick for exploration is one that is least similar to the last one explored. The rationale here is that the metapaopulation gets filled with lots of very very similar species, all of which are almost equally fit, all of which are “genetically” very similar. Trying to pick among these, to find the magic one, the one whose mutations will yeild a break-through, seems to be a losing strategy. So, instead, add a diversity penalty: explore these “species” that are as different as possible from the current one (but still have about the same fitness score). So far, this experiment is inconclusive; I wasn’t rewarded with instant success, but more work needs to be done. Its actually fairly tedious to take the data…

About Linas Vepstas

Computer Science Researcher
This entry was posted in Design, Documentation, Theory and tagged , , , . Bookmark the permalink.
  • Linas Vepstas

    The appearance of the punctuated learning curve is very clear in MOSES, and I presume that it is seen in other genetic algorithm systems, though I can’t say I’ve heard of reports.   I vaguely heard of “punctuated evolutionary periods”, but again, have no references I can think of.  The learning curve in humans is clear, and well-studied.

    I have no doubt that all of these phenomena are due to exactly one and the same underlying mathematical structure, of a high-dimensional space being searched for a better solution.  Finding an improvement is like a blind man searching for the hole on a golf-green: very hard, amost accidental. But when its found, a whole new vista opens up.   I presume that the underlying mathematical system would be similar: a broadly flat fitness curve, with a few very deep holes, which after going through them, open up to all new vistas of all new golf-courses.  Anyone read of anything to this effect?

  • tlh24

     Linas,

    I have not read anything to the effect, though I really like your golf analogy. 

    My personal intuition is that the structure of the space to be searched has complexity controlled by the lifetime, fidelity, and quantity of memory that the system to be evolved has.  IMHO, this is why it is easy to fit FIR adaptive filters (through LMS) while IIR adaptive filters are harder (but not impossible — solutions are on a manifold determined by the structure of the feedback); programs can have unlimited (?) feedback and memory (or contingencies), which makes the golf-green complicated as you suggest.

    Very thought provoking.  Keep up the good work!

    Tim

  • The Krio

    learning to learn where learning time is used for scoring might ultimately encode the data into your learning algorithm. your data should be different on each pass, that way it would encode only metadata

  • Christine

    Can you force the inner loop to find solutions that are not like any others using parameters by blocking off anything that will evolve to solutions that already exist?  Assuming of course what has already been found is not efficient and reached a plateau that is continuing to inefficiency.  That is, to put all resources into creating a new species at a certain point of non-evolutionary gains by artificially defining a crisis, creating the environment in which innovation not only can occur, but must occur.  Or to follow along the lines of your golf analogy, tell it to go nowhere but down the rabbit hole.  I know almost nothing about the actual programming you’re doing, but I do develop a lot of algorithms to research problems and have experienced the crisis that leads to evolution of new solutions.  If your goal is to follow the human learning curve, look to people or entities with fast learning curves, not the mean.  A crisis of resources is usually what provides the necessary incentive to learn.  Not more time, and not more solutions to choose from, but precisely the opposite, less time and fewer possible resources and solutions.

  • Christine

     I also wonder about the target output.  Whether there is any devolution (?) of the intended results of any evolved program?  That is, given a statistical spread of the output, rather than making it a hard target, putting random at the tail end rather than the front end (or in combination) and categorizing near-hits vs. exact hits.  That is, a sort of collateral damage approach.  A near hit at a moving target could very well be a direct hit when randomness over time comes into play.  A different sort of fine tuning but probably one you’ve already taken into account.  I really appreciate your explanations of how MOSES works, by the way. 

  • Linas Vepstas

    A recently added feature is a “diversity pressure”, that forces the algorithm to spend more time on less fit, but new, novel solutions. The hope is that, in this way, locl maxima can be evaded. I haven’t yet examined just how effective this is.