The MOSES Metapopulation

Kaj Sotala recently asked for an update on how MOSES selects a “species” to be “mutated”, when it is searching for the fittest program tree. I have some on-going, unfinished research in this area, but perhaps this is a good time to explain why.

To recap: MOSES is a program-learning system. That is, given some input data, MOSES attempts to learn a computer program that reproduces the data. It does so by applying a mixture of evolutionary algorithms: an “inner loop” and an “outer loop”. The inner loop explores all of the mutations of a “species” (a “deme”, in MOSES terminology), while the outer loop chooses the next deme to explore. (Each deme is a “program tree”, that is, a program written in a certain lisp-like programming language).

So: the outer loop selects some program tree, whose mutations will be explored by the inner loop. The question becomes, “which program tree should be selected next?” Now, nature gets to evolve many different species in parallel; but here, where CPU cycles are expensive, its important to pick a tree whose mutations are “most likely to result in an even fitter program”. This is a bit challenging.

MOSES works from a pool of candidate trees, of various fitnesses. With each iteration of the inner loop, the pool is expanded: when some reasonably fit mutations are found, they are added to the pool. Think of this pool as a collection of “species”, some similar, some not, some fit, some, not so much. To iterate the outer loop, it seems plausible to take the fittest candidate in the pool, and mutate it, looking for improvements. If none are found, then in the next go-around, the second-most-fit program is explored, etc. (terminology: in moses, the pool is called the “metapopulation”).

It turns out (experimentally) that this results in a very slow algorithm. A much better approach is to pick randomly from the highest scorers: one has a much better chance of getting lucky this way. But how to pick randomly? The highest scorers are given a probability: p ~ exp (score /T) so in fact, the highest scoring have the highest probability of being picked, but the poorly-scoring have a chance too. This distribution is the “Gibbs measure” aka “Boltzmann distribution”; (T is a kind of “temperature”, it provides a scale; its held constant in the current algos)  I’m guessing that this is the right measure to apply here, and can do some deep theoretical handwaving, but haven’t really worked this out in detail. Experimentally, it works well; there even seems to be a preferred temperature that seems to work well for most/all different problems (but this is not exactly clear).

One can do even better. Instead of using the score, a blend of score minus program tree complexity works better; again, this is experimentally verified.   Nil added this back when, and his theoretical justification was to call it “Solomonoff complexity”, and turn it into a ‘Bayesian prior’. From an engineering viewpoint, its basically saying that, to create a good design suitable for some use, its better to start with a simple design and modify it, than to start with a complex design and modify it. In MOSES terminology, its better to pick an initial low-complexity but poorly scoring deme, and mutate it, than to start with something of high complexity, high score, and mutate that. Exactly what the blending ratio (between high score, and high complexity) is, and how to interpret it, is an interesting question.

Experimentally, I see another interesting behaviour, that I am trying to “fix”.   I see a very classic “flight of the swallow” learning curve, dating back to the earliest measurements of the speed of telegraph operators at the turn of the 19th century. At first, learning is fast, and then it stalls, until there is a break-through; then learning is again fast (for a very brief time — weeks for telegraph operators) and then stalls (years or a decade for telegraph operators). In MOSES, so, at first, one picks a deme, almost any deme, and almost any mutation will improve upon it. This goes on for a while, and then plateaus. Then there’s a long dry spell — picking deme after deme, mutating it, and finding very little or no improvement. This goes on for a long time (say, thousands of demes, hours of cpu time), when suddenly there is a break-through: dozens of different mutations to some very specific deme all improve the score by some large amount. The bolzmann weighting above causes these to be explored in the next go-around, and mutations of these, in turn, all yield improvements too. This lasts for maybe 10-20 steps, and then the scores plateau again. Exactly like the signalling rate of 19th century telegraph operators 🙂 Or the ability of guitar players. Or sportsmen, all of which have been measured in various social-science studies, and have shown the “flight of the swallow” curve on them.

(Can someone PLEASE fix the horribly deficient Wikipedia article on “learning curve”? It totally fails to cite any of the seminal research and breakthroughs on this topic. Check out google images for examples of fast learning, followed by long plateau.

So: e.g.

Learning curve

learning curve

Learning curve. In real life. For Salesmen.

Actual MOSES curves look more like this, with rapid progress followed by stagnant plateaus, punctuated with rapid progress, again. Exccept the plateaus are much flatter and much longer, and the upward curves are much sharper and faster.

All these curves beg the question: why is google finding only the highly stylized ones, and not showing any for raw, actual data? Has the learning curve turned into an urban legend??

Here's a real-life learning curve, taken from MOSES, using real data (the "bank" dataset) from a previous OpenCog blog post on MOSES. Although this learning curve shows a combination of the inner and outer loops, and so, strictly speaking does not represent what I'm discussing here.

).

Recently, I have been trying to shorten the plateau, by trying to make sure that the next deme I pick for exploration is one that is least similar to the last one explored. The rationale here is that the metapaopulation gets filled with lots of very very similar species, all of which are almost equally fit, all of which are “genetically” very similar. Trying to pick among these, to find the magic one, the one whose mutations will yeild a break-through, seems to be a losing strategy. So, instead, add a diversity penalty: explore these “species” that are as different as possible from the current one (but still have about the same fitness score). So far, this experiment is inconclusive; I wasn’t rewarded with instant success, but more work needs to be done. Its actually fairly tedious to take the data…

Posted in Design, Documentation, Theory | Tagged , , , | 6 Comments

Fishgram: Frequent Interesting Subhypergraph Mining for OpenCog

One of the tools OpenCog has needed for a long time, is something that can relatively quickly scan an Atomspace and find the interesting patterns in it.  “Interesting” may be defined in a variety of ways, such as “frequent”, or “surprising” (as measured by information theory), etc.  This capability has often been referred to in OpenCog documents as “pattern mining.”

Jade O’Neill (fka Jared Wigmore) implemented python software doing this for the Atomspace some time ago — Fishgram, the Frequent Interesting SubHyperGRaph Miner.   Fishgram has been used to recognize patterns in Atomspaces resultant from OpenCog’s “perception” of a Unity3D based virtual world.

Now,  a wiki page has been created, covering some details of Fishgram — including pseudocode, an explanation of the algorithm, and some indication of which software classes carry out which parts of the algorithm…

http://wiki.opencog.org/w/Fishgram

Plenty more work needs to be done with Fishgram, yet, it does
currently work and can extract some interesting patterns from
Atomspaces….

Some simple examples have also been done, feeding patterns output via Fishgram into PLN…

I think this is a very valuable tool that could be used for a lot of different OpenCog applications, and it would be great to see others jump onto it and help with development.

The current version of Fishgram looks for frequent subhypergraphs (i.e. frequent subhypergraph patterns, which may contain  multiple variables).  One thing that Jade and I have talked about a lot is extending Fishgram to search for “surprising” subhypergraphs, where surprisingness may be measured using interaction information or synergy, as described in these papers:

http://www.rni.org/bell/nara4.pdf

http://arxiv.org/abs/1004.2515/

Those who like logic may also enjoy this paper, which connects interaction information with the logic of questions:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.6110

It seems that a good implementation of a good measure of surprisingness will be a valuable thing to have in OpenCog generally, not just in Fishgram.   If we want “novelty seeking” to be one of the top-level goals of a young AGI or proto-AGI system (which I think we do), then having a nice way to measure novelty seems like a good things — and the interaction information and the informational synergy, as described in these papers, seem a good approach.

Onward and upward 😉

Ben G

Posted in Uncategorized | Leave a comment

Genetic Crossover in MOSES

MOSES is a system for learning programs from input data.  Given a table of input values, and a column of outputs, MOSES tries to learn a program, the simplest program that can reproduce the output given the input values. The programs that it learns are in the form of a “program tree” —  a nested concatenation of operators, such as addition or multiplication, boolean AND’s or OR’s, if-statements, and the like, taking the inputs as arguments.  To learn a program, it starts by guessing a new random program.  More precisely, it generates a new, random program tree, with as-yet unspecified operators at the nodes of the tree. So, for example, an arithmetic node maybe be addition, or subtraction, or multiplication, division, or it may be entirely absent.  It hasn’t yet been decided which.   In MOSES, each such undecided node is termed a “knob”, and program learning is done by “turning the knobs” until a reasonable program is found.  But things don’t stop there: once a “reasonable” program is found, a new, random program tree is created by decorating this “most reasonable” program with a new set of knobs.  The process then repeats: knobs are turned until an even better program is found.

Thus, MOSES is a “metalearning” system: it consists of an outer loop, that creates trees and knobs, and an inner loop, that finds optimal knob settings.  Both loops “learn” or “optimize”; it is the nesting of these that garners the name “metalearning”. Each loop can use completely different optimization algorithms in its search for optimal results.

The rest of this post concerns this inner loop, and making sure that it finds optimal knob settings as quickly and efficiently as possible. The space of all possible knob settings is large: if, for example, each knob has 5 possible settings, and there are 100 knobs, then there is a total of 5100 possible different settings: a combinatorial explosion. Such spaces are hard to search. There are a variety of different algorithms for exploring such a space. One very simple, very traditional algorithm is “hillclimbing”. This algo starts somewhere in this space, at a single point, say, the one with all the knobs set to zero. It then searches the entire local neighborhood of this point: each knob is varied, one at a time, and a score is computed. Of these scores, one will be best. The corresponding knob setting is then picked a the new center, and the process then repeats; it repeats until there is no improvement: until one can’t “climb up this hill” any further. At this point, the inner loop is done; the “best possible” program has been found, and control is returned to the outer loop.

Hill-climbing is a rather stupid algorithm: most knob settings will result in terrible scores, and are pointless to explore, but the hill-climber does so anyway, as it has no clue as to where the “good knobs” lie. It does an exhaustive search of the local neighborhood of single-knob twists.  One can do much better by using estimation-of-distribution algorithms, such as the Bayesian Optimization Algorithm.  The basic premise is that knob settings are correlated: good settings are near other good settings.  By collecting statistics and computing probabilities, one can make informed, competent guesses at which knob settings might actually be good.  The downside to such algorithms is that they are complex:  the code is hard to write, hard to debug, and slow to run: there is a performance penalty for computing those “educated guesses”.

This post explores a middle ground: a genetic cross-over algorithm that improves on simple hill-climbing simply by blindly assuming that good knob settings really are “near each other”, without bothering to compute any probabilities to support this rash assumption.  The algorithm works; headway can be made by exploring only the small set of knob settings that correlate with previous good knob settings.

To explain this, it is time to take a look at some typical “real-life” data. In what follows, a dataset was collected from a customer-satisfaction survey; the goal is to predict satisfaction from a set of customer responses.  The dataset is a table; the outer loop has generated a program decorated with a set of knobs.  Starting with some initial knob setting, we vary each knob in turn, and compute the score. The first graph below shows what a  typical “nearest neighborhood” looks like.  The term “nearest neighborhood” simply means that, starting with the initial knob setting, the nearest neighbors are those that differ from it by exactly one knob setting, and no more.  There is also a distance=2 neighborhood: those instances that differ by exactly two knob settings from the “center” instance.  Likewise, there is a distance=3 neighborhood, differing by 3 knob settings, etc. The size of each neighborhood gets combinatorially larger.  So, if there are 100 knobs, and each knob has five settings, then there are 5 × 100=500 nearest neighbors. There are 500 × 499 / 2 = 125K next-nearest neighbors, and 500 × 499 × 498 / (2 × 3) = 21M instances at distance=3. In general, this is the binomial coefficient: (500 choose k) for distance k. Different knobs, however, may have more or fewer than just 5 settings, so the above is just a rough example.

Nearest Neighbor Scores

Nearest Neighbor Scores

The above graph shows the distribution of nearest neighbor scores, for a “typical” neighborhood. The score of the center instance (the center of the neighborhood) is indicated by the solid green line running across the graph, labelled “previous high score”.  All of the other instances differ by exactly one knob setting from this center.  They’ve been scored and ranked, so that the highest-scoring neighbors are to the left.  As can be seen, there are maybe 15 instances with higher scores than the center, another 5 that seem to tie.  A slow decline is followed by a precipitous drop; there are another 80 instances with scores so bad that they are not shown in this figure.  The hill-climbing algo merely picks the highest scorer, declares it to be the new center, and repeats the process.

All of the other neighborhoods look substantially similar. The graph below shows an average over many generations (here, each iteration of the inner loop is one generation). The jaggedness above is smoothed out by averaging.

Nearest Neighbor Score Change

Nearest Neighbor Score Change

Rather than searching the entire neighborhood, one would like to test only those knob settings likely to yield good scores. But which might these be?  For nearest neighbors, there is no way to tell, without going through the bother of collecting statistics, and running them through some or another Bayesian estimation algorithm.

However, for more distant neighbors, there is a way of guessing and getting lucky: perform genetic cross-overs.  That is, take the highest and next-highest scoring instances, and create a new instance that differs from the center by two knob-settings, the two knobs associated with the two high scorers.  In fact, this new instance will very often be quite good, beating both of its parents.   The graph below shows what happens when we cross the highest scorer with each one of the next 70 highest. The label “1-simplex” simply reminds us that these instances differ by exactly two knob settings from the center.  More on simplexes later.  The green zero line is located at the highest-scoring single-knob change.  The graph shows that by starting here, and twiddling the next-most-promising knob, can often be a win. Not always: in the graph below, only 4 different knobs showed improvement. However, we explored relatively few instances to find these four; for this dataset, most exemplars have thousands of knobs.

Average Score Change, 1-simplex

Average Score Change, 1-simplex

The take-away lesson here is that we can avoid exhaustive searches by simply crossing the 10 or 20 or 30 best instances, and hoping for the best. In fact, we get lucky with these guesses quite often. What happens if, instead of just crossing two, we cross three of the top scorers?  This is the “2-simplex”, below:

Average Score Change

Average Score Change, 2-simplex

Notice that there are now even more excellent candidates!  How far can we go?  The 3-simplex graph below shows the average score change from crossing over four high-scoring instances:

Average Score Change

Average Score Change. 3-simplex

The term “crossover” suggests some sort of “sexual genetic reproduction”. While this is correct, it is somewhat misleading.   The starting population is genetically very uniform, with little “genetic variation”.  The algorithm starts with one single “grandparent”, and produces a population of “parents”, each of which differ from the grandparent by exactly one knob setting. In the “nearest neighborhood” terminology, the “grandparent” is the “center”, and each “parent” is exactly one step away from this center. Any two “parents”, arbitrarily chosen, will always differ from one-another by exactly two knob settings. Thus, crossing over two parents will produce a child that differs by exactly one knob setting from each parent, and by two from the grandparent. In the “neighborhood” model, this child is a distance=2 from the grandparent.   For the case of  three parents, the child is at distance=3 from the grandparent, and so on: four parents produce a child that is distance=4 from the grandparent.  Thus, while “sexual reproduction” is a sexy term, it looses its punch with the rather stark uniformity of the parent population; thinking in terms of “neighbors” and “distance” provides a more accurate mental model of what is happening here.

The term “simplex” used above refers to the shape of the iteration over the ranked instances: a 1-simplex is a straight line segment, a 2-simplex is a right triangle, a 3-simplex is a right tetrahedron. The iteration is performed with 1, 2 or 3 nested loops that cross over 1, 2 or 3 instances against the highest. It is important to notice that the loops do not run over the entire range of nearest neighbors, but only over the top scoring ones. So, for example, crossing over the 7 highest-scoring instances for the 3-simplex generates 6!/(6-3)! = 6 × 5 × 4 = 120 candidates. Scoring a mere 120 candidates can be very quick, as compared to an exhaustive search of many thousands of nearest neighbors. Add to this the fact that most of the 120 are likely to score quite well, whereas only a tiny handful of the thousands of nearest neighbors will show any improvement, and the advantage of this guessing game is quite clear.

So what is it like, after we put it all together? The graph below shows the score as a function of runtime.

Score as function of time

Score as function of time

In the above graph, each tick mark represents one generation. The long horizontal stretches between tick marks shows the time taken to perform an exhaustive nearest-neighborhood search. For the first 100 seconds or so, the exemplar has very few knobs in it (a few hundred), and so an exhaustive search is quick and easy. After this point, the exemplars get dramatically more complex, and consist of thousands of knobs. At this point, an exhaustive neighborhood search becomes expensive: about 100 seconds or so, judging from the graph. While the exhaustive search is always finding an improvement for this dataset, it is clear that performing some optimistic guessing can improve the score a good bit faster. As can be seen from this graph, the algorithm falls back to an exhaustive search when the optimistic simplex-based guessing fails to show improvement; it then resumes with guessing.

To conclude: for many kinds of datasets, a very simple genetic-crossover algorithm combined with hillclimbing can prove a simple but effective search algorithm.

Note Bene: the above only works for some problem types; thus it is not (currently) enabled by default. To turn it on, specify the -Z1 flag when invoking moses.

Appendix

Just to keep things honest, and to show some of the difficulty of algorithm tuning, below is a graph of some intermediate results taken during the work.  I won’t explain what they all are, but do note one curious feature:  the algos which advance the fastest initially seem to have trouble advancing later on.  This suggests a somewhat “deceptive” scoring landscape: the strong early advancers get trapped in local maxima that they can’t escape.   The weak early advancers somehow avoid these traps.  Note also that results have some fair dependence on the random number generator seed; different algos effectively work with different random sequences, and so confuse direct comparison by some fair bit.

Many Different Algorithms

Many Different Algorithms

Posted in Design, Documentation, Introduction, Theory | Tagged , , , , , | 2 Comments

Tuning Metalearning in MOSES

I’ve been studying MOSES recently, with an eye towards performance tuning it. Turns out optimization algorithms don’t always behave the way you think they do, and certainly not the way you want them to.

Given a table of values, MOSES will automatically learn a program that reproduces those values. That is, MOSES performs table regression: given N columns of “input” values, and one column of “output”, MOSES will create a program that outputs the output, given the inputs.  MOSES can deal with both floating point and boolean inputs, and thus can learn, for example, expressions such as ((x<2) AND b) OR (x*(y+1) >3).  MOSES programs are real “programs”: it can even learn branches and loops, although I won’t explore that here. For performance tuning, I studied the 4-parity problem: given 4 input bits, compute the parity bit.  Written out in terms of just AND, OR and NOT, this is a fairly complex expression, and is rather non-trivial to learn.

MOSES performs learning by keeping a “metapopulation” of example programs, or “exemplars”.  These are graded on how well the match the output, given the inputs. For the 4-parity problem, there are 24=16 different possible inputs; a given program may get any number of these correct.  For example, there are 16 ways to get one answer wrong; 16×15 ways to get two wrong, 16×15×14 ways to get three wrong, etc. This is the binomial distribution: (16 choose k) ways to get k answers wrong, in general. But this doesn’t mean that there are only 16 different programs that get one answer wrong: there are zillions: some simple, some very very complex.

As MOSES iterates, it accumulates a metapopulation of programs that best fit the data. As soon as it finds a program that gets more correct answers than the others, the old metapopulation is wiped out; but then, it starts growing again, as new programs with equal score are found.  This is shown in the following graph:

Metapopulation size as function of generation.

Metapopulation size as function of generation number.

The red line shows the metapopulation size (divided by 50), as a function of the generation number (that is, the iteration count).  It can be seen to collapse every time the score improves; here, the “minus score”, in green is the number of wrong answers: a perfect score has zero wrong answers; the program stops when a perfect score is reached.

In blue, the complexity of the program — actually, the complexity of the least complex program that produces the given score. Computing the parity requires a fairly complex combination of AND’s OR’s and NOT’s; there is a minimum amount of complexity such a program can have.  Here, for example, are two different programs that compute the parity perfectly, a short one:

and(or(and(or(and(!#1 !#2) and(!#3 #4)) or(!#2 !#3))
   and(#1 #2) and(#3 !#4))
   or(and(!#1 #2) and(#1 !#2) and(!#3 !#4) and(#3 #4)))

and a longer one:

or(and(or(and(or(!#1 !#3) #4) and(or(#1 !#2) !#3 !#4)
   and(or(#3 #4) #2)) or(and(or(!#1 !#4) #2 !#3)
   and(or(!#2 #3) #1 #4) and(!#1 !#4) and(!#2 #3)))
   and(#1 !#2 #3 !#4))

More on complexity later.

But first: how long does it take for MOSES to find a solution to 4-parity? It turns out that this depends strongly on the random-number sequence.  MOSES makes heavy use of a random number generator to explore the problem space.  Each run can be started with a different seed value, to seed the random number generator.  Some runs find the correct solution, some take a surprisingly long amount of time. Amazingly so: the distribution appears to follow a logarithmic distribution, as in the following graph:

Runtime, showing temperature dependence

One the vertical axis, the amount of time, in seconds, to find a solution. One the horizontal axis, the order in which a solution was found, out of 20 random attempts. The way to read this graph is as follows: there is a probability Pr=1/20 chance of finding a solution in about 10 seconds. There is a Pr=2/20 chance of finding a solution in about 20 seconds, etc. Continuing: about a Pr=6/20 chance of finding a solution in less than about 100 seconds, and a Pr=17/20 chance of finding a solution in less than about 1000 seconds.

The shape of this graph indicates that there is a serious problem with the current algorithm. To see this, consider running two instances of the  algorithm for 300 seconds each. Per the above graph, there is a 50-50 chance that each one will finish, or a 75% chance that at least one of them will finish.  That is, we have a 75% chance of having an answer after 600 CPU-seconds.  This is better than running a single instance, which requires about 900 seconds before it has a 75% chance of finding an answer!  This is bad.  It appears that, in many cases, the algorithm is getting stuck in a region far away from the best solution.

Can we do better? Yes. Write p = Pr(t<T) for the probability that a single instance will find a solution in less time than T.  Then, from the complexity point of view, it would be nice if we had an algorithm if two instances did NOT run faster than a single instance taking twice as long; that is, if

Pr(t<2T) ≤ p2+2p(1-p)

The first term, p2, is the probability that both instances finished.  The second term is the probability that one instance finished, and the other one did not (times two, as there are two ways this could happen).   More generally, for n instances,  we sum the probability that all n finished, with the probability that n-1 finished, and one did not (n different ways), etc.:

Pr(t<nT) ≤ pn + npn-1(1-p) + n(n-1)pn-2(1-p)2 + … + np(1-p)n-1

This inequality, this desired bound on performance, has a simple solution, given by the exponential decay of probability:

Pr(t<T) = 1-exp(-T/m)

As before,  Pr(t<T) is the probability of finding a solution in less than time T, and m is the mean time to finding a solution (the expectation value). To better compare the measured performance to this desired bound, we need to graph the data differently:

Showing the exponential bound

This graph shows the same data as before, but graphed differently: the probability of not yet having found a solution is shown on the horizontal axis. Note that this axis is logarithmic, so that the exponential decay bound becomes a straight line.  Here, the straight purple line shows the bound for a 500 second decay constant; ideally, we’d like an algorithm that generates points below this line.

Before continuing, a short aside on the label “temp“, which we haven’t explained yet. During the search, MOSES typically picks one of the simplest possible programs out of the current metapopulation, and explores variations of it, it explores its local neighborhood.  If it cannot find a better program, it picks another, simple, exemplar out of the metapopulation, and tries with that, and so on.   It occurred to me that perhaps MOSES was being too conservative in always picking from among the least complex exemplars.  Perhaps it should be more adventurous, and occasionally pick a complex exemplar, and explore variations on that.   The results are shown in the green and blue lines in the graph above.  The select_exemplar() function uses a Boltzmann distribution to pick the next exemplar to explore. That is, the probability of picking an exemplar of complexity C as a starting point is

exp(-C/temp)

where temp is the “temperature” of the distribution. The original MOSES algorithm used temp=1, which appears to be a bit too cold; a temperature of 2 seems about right.  With luck, this new, improved code will be checked into BZR by the time you read this.

There is another issue: the unbounded size of the metapopulation. When MOSES stalls, grinding away and having trouble finding a better solution, the size of the metapopulation tends to grow without bounds, linearly over time. It can get truly huge: sometimes up to a million, after a few thousand generations.  Maintaining such a large metapopulation is costly: it takes up storage, and eats up CPU time to keep it sorted in order of complexity.  Realistically, with a metapopulation that large, there is only a tiny chance (exponentially small!) that one of the high-complexity programs will be selected for the next round. The obvious fix is to clamp down on the population size, getting rid of the unlikely, high-complexity members.   I like the results so far:

clamped data

Runtime, using a limited population size.

Clamping the population size clearly improves performance — by a factor of two or more, as compared to before.  However, the troublesome behavior, with some solutions being hard to discover, remains.

Now, to attack the main issue: Lets hypothesize what might be happening, that causes the exceptionally long runtimes.  Perhaps the algorithm is getting stuck at a local maximum?  Due to the knob-insertion/tweaking nature of the algorithm, there are no “true” local maxima, but some may just have very narrow exits.  The standard solution is to apply a simulated-annealing-type trick, to bounce the solver out of the local maximum.  But we are already using a Boltzmann factor, as described above, so what’s wrong?

The answer seems to be that the algorithm was discarding the “dominated”  exemplars, and was keeping only those with the best score, and varying levels of complexity. It only applied the Boltzmann factor to the complexity.  What if, instead, we applied the Boltzmann factor to mixture of score and complexity?  Specifically, lets try this:

exp(-(C – S W) / temp)

Here, C is the complexity, as before, while S is the score, and W a weight.  That is, some of the time, the algorithm will select exemplars with a poor score, thus bouncing out of the local maximum.  Setting W to zero regains the old behavior, where only the highest-scoring exemplars are explored.  So .. does this work? Yes! Bingo! Look at this graph:

score-weighted annealing

Score-weighted Annealing

Two sets of data points, those for W=1/4 and 1/3, look very good.  Its somewhat strange and confusing that other W values do so poorly.   I’m somewhat worried that the W=1/4 value is “magical”: take a look again at the very first graph in this post.  Notice that every time a better solution is found, the complexity jumps by about 4.  Is this the W=1/4 value special to the 4-parity problem? Will other problems behave similarly, or not?

I’m continuing to experiment. Collecting data takes a long time. More later… The above was obtained with the code in bzr revision 6573, with constant values for “temp” and “weight” hand-edited as per graphs. Later revisions have refinements that fundamentally alter some loops, including that in select_exemplar(), thus altering the range of reasonable values, and the meaning/effect of some of these parameters. Sorry 🙂

I do hope that this post does offer some insight into how MOSES actually works.  A general overview of MOSES can be found on the MOSES wiki, as well as a detailed description of the MOSES algorithm. But even so, the actual behavior, above, wasn’t obvious, at least to me, until I did the experiments.

Appendix: Parallelizability

A short footnote about the generic and fundamental nature of the exponential decay of time-to-solution in search problems. Earlier in this post, there is a derivation of exponential decay as the result of running N instances in parallel.   How should this be understood, intuitively?

Search algorithms are, by nature, highly parallelizable: there are many paths (aka exemplars) to explore; some lead to a solution, some do not.  (An exemplar is like a point on a path: from it, there are many other paths leading away).  A serial search algorithm must implement a chooser: which exemplar to explore next? If this chooser  is unlucky/unwise, it will waste effort exploring exemplars that don’t lead to a solution, before it finally gets around to the ones that do.  By contrast, if one runs N instances in parallel (N large), then the chooser doesn’t matter, as the N-1 ‘bad’ exemplars don’t matter: the one good one that leads to a solution will end the show.

Thus, we conclude: if a serial search algorithm follows the exponential decay curve, then it has an optimal chooser for the next exemplar to explore.  If it is “worse” than exponential, then the chooser is poorly designed or incapable.  If it is “better” than exponential, then that means that there is a fixed startup cost associated with each parallel instance: cycles that each instance must  pay, to solve the problem, but do not directly advance towards a solution.  Ideal algorithms avoid/minimize such startup costs.  Thus, the perfect, optimal algorithm, when run in serial mode, will exhibit exponential solution-time decay.

The current MOSES algorithm very nearly achieves this for 4-parity, as shown in this last figure, which compares the original chooser to the current one (bzr revno 6579)

runtime, tuned chooser

Posted in Design, Development, Documentation, Theory | Tagged , , | 2 Comments

Preview of a virtual learning environment

It’s been a while since the last update, but be assured we’ve been very busy working away on the embodiment code and developing our virtual learning environment. With AGI-11 now in progress at the Googleplex, we’ve put together a few videos to give an outline of what we’re working on. There isn’t any overly advanced learning going on yet, but it gives you a feel for where the project is going.

First up is a video demonstrating a human controlled player navigating and interacting with the world. This world is built in Unity3D, so eventually we’ll be able to put this environment online, make it an iPad app, or whatever else, and let you guys interact and teach OpenCog directly.

The things to note are that it’s based off a minecraft-like environment, which means the player and AI will be able to modify the terrain and build things. Other objects can also be moved around and interacted with. We’ve got a very flexible action system that allows new action types to easily be added, and OpenCog will be able to learn the causal effect of executing these previously unknown actions.

Next is a demonstration of 3D pathfinding (implemented by Troy Deheng), with the satisfaction of a single “demand” for energy by consuming the battery object. In addition, it also shows simple planning by asking for a battery from the player if OpenCog can’t find one in the world. After asking this, the player spawns a new battery with a switch, and OpenCog efficiently detects this correlation between using the switch and new batteries appearing. In essence learning a new behaviour to satisfy it’s goals.

Jared Wigmore has been working on Fishgram, which is a frequent sub-hypergraph miner that detects patterns in the AtomSpace (and so by extension, also in the virtual world). This is the component used to detect that pushing a switch creates a new battery.

Last is a shorter video showing demands beyond just energy. These are analogous to goals, but not quite. They are slightly different in that they impel the system towards certain goals. The new demand in this video is one for integrity, which is roughly analogous to fatigue/health. In the video, the house is known to satisfy this demand and increase integrity, but it oscillates between which demand is most important. Integrity then Energy, then back again. Zhenhua Cai has already added a couple of more demands: Competence and Certainty.

Thanks to Cord Krohn for putting the videos together, as well as doing environment and character design and thanks to Michael Jia for 3d modelling and art assets.

Posted in Development | Tagged | 8 Comments

OpenCog Recap – March to May 2011

Wow – it’s been while since the last OpenCog Recap. I think it’s time to rectify that, as there’s been a lot happening since then!

In early April, Ben visited Hong Kong to give guidance to the long-term plan for the HK project. We had many interesting discussions with plans for how to tackle a number of problems, including resolving issues in the language pipeline, frequent sub-graph mining, and recognizing event boundaries.

I implemented an initial workable version of OpenCog Python bindings, with support for MindAgents and more recently CogServer requests both written in Python.

Troy and Zhenhua developed the OpenCog embodiment workbench. The OpenCog workbench is designed to monitor the OpenPsi system, and other aspects of a running instance of OpenCog. Currently, the workbench has three modules: Psi monitor, Dialog system logger, and Psi Emotion Space. The Psi monitor is usable as it stands while the other two are under development. It’s written in Python and uses PyQt for the GUI library.

Jared has been doing a lot of background reading and has been experimenting with frequent sub-graph mining. He has currently decided on using SUBDUE to detect embodiment patterns, as it has a number of benefits over other packages.

Troy integrated minepackage, to provide a dynamic environment (with block destruction and creation) for OpenCog embodiment. Minepackage is an open source package for creating cubic type games, such as Minecraft and infiniminer. The full package supports (or will support): terrain generation, fluid physics, decoration objects and lighting systems. However, our research project will only need some of these features and will also be extending the game world significantly to provide an effective learning environment.

On this note, Cord Krohn has been working on the design side of things has put together some nice concept sheets for the characters in the demos. He is also planning how the learning we wish to demonstrate will be best represented in the minecraft-like environment.

Zhenhua is currently replacing the OpenPsi equations with more suitable expressions based on AtomSpace contents and feedback from the world. Zhenhua also started work on the Dialogue System, which is related to the Perception/Action system that OpenPsi will inform.

Together with Michel, Shuo Chen and others, Ben Goertzel worked out a detailed design for integrating DeSTIN (a variant of HTM) with OpenCog. In addition, a quite extensive document on Michel’s work to port DeSTIN to CUDA is expected by the end of June as a result of Michel’s thesis.

Ben Goertzel and Nil Geisweiller have designed and implemented a framework for “feature metalearning” (transferring information on feature quality from one categorization problem to another), integrated it with MOSES, and are currently testing it on a large database of text categorization problems.

Matt Ikle’ prototyped a simple, non-scalable version of information geometry based economic attention allocation in OpenCog, and found dramatic intelligence improvements over the standard ECAN version for these simple cases. A draft paper is available here.

So, progress on a number of fronts, hopefully by the next recap we’ll have a video of embodied learning to show 😉

Posted in Development | Tagged | 1 Comment

DeSTIN vision development

Late in 2010 a working group was formed to turn the research code for DeSTIN, originally implement and developed by Tom Karnowski as part of his PhD, into a quality open source project.

There have been a number of contributors so far, including:

  • Ted Sanders took the lead with tidying things up and getting things working.
  • The Xiamen BLISS lab provided a collection of gesture videos to train and test gesture recognition with DeSTIN.
  • Michel Drenthe, also located at the BLISS lab, is planning to port DeSTIN to take advantage of CUDA on NVIDIA graphics cards.
  • Amritpal Singh has been developing a wrapper around a SIFT library to do pre-processing of images to extract keypoints and do rudimentary object detection before feeding the annotated image into DeSTIN.

And because a picture is worth a thousand words (or perhaps a thousand keypoints?), here are a couple of images with Amritpal’s SIFT object recognition successfully detecting an apropos textbook.

Posted in Development | Tagged , | 2 Comments

OpenCog Hong Kong Project

I thought I should write a post to let everyone know where we are at
with the Hong Kong AI project.

We’ve got ourselves established in the M-Lab currently, although Gino may want to relocate us to the School of Design on the PolyU campus. We’ve submitted an order to the University requisition people in the University for 3 reasonably specc’ed workstations, we’ll no doubt order more for the rest of the team when they arrive (see below). In the mean time, there are older computers that are functional enough for now and we have our personal laptops which we’ve been using.

After a review of available game engines, and based on the expertise of the team, we’ve decided that Unity3D will be the best engine to use for the project. While Lucid2 was initially proposed, it only supports an old version of DirectX and the original development team isn’t available for round the clock support. Unity3D has an active community and also supports cross-platform development, including iOS (although no Linux support ;-( ).

Initially we were focused on speech interaction in Unity3D, and we managed to get pocketsphinx, the speech recognition library, working as a DLL plugin for Unity3D and controlling a robot character with very basic commands (“go left”, “go right”, “explode”, “stop”, etc.). I started playing with festival as a speech synthesis plugin but then we realised that these aspects should probably not be a primary focus at the beginning due to the trickiness of getting speech working right. Instead and in the mean time,
we want to just use text input/output. We can build on that once the core dynamics are working correctly (or we have more people involved).

Thus, I am focused on creating a network interface for connecting Unity3D to OpenCog. This has involved understanding the existing network architecture of the embodiment system. At this stage it seems it will first involve making a NetworkElement class in C#, then generating the right XML messages. I think this can be improved with zeromq and
protocol buffers, but we’ll wait until the existing pipeline works and I/we better understand the embodiment system.

Lester has been getting more familiar with Unity3D development. Importing game characters from another in-house game project so that we have models to prototype with (until we get some graphic-artists/modellers). He’ll also help with the Unity side of the OpenCog-Unity connection.

Cord is busy with game design. Loosely we’ve been thinking about a game around the idea of the Incredible Machine meets Creatures (well, very roughly at least). There will be levels to solve, where you have to teach the AI character(s) to do certain behaviors and can interact with them by talking… and to a limited extent by “playing god” (limited, because “playing god” uses “magic”/”psi” which you’ve only got a certain amount of). Each level your AI characters will remember the stuff you taught them earlier, you’ll be able to tell
them to interact with new objects the AI hasn’t seen before by telling it that the object is similar to another one it’s already seen. This will be more fully detailed by the end of the week.

Progress is happening, and on a personal note it feels good to be spending all my time and mental energy on OpenCog rather than on distracting contracts to pay the bills 😉

I’ll endeavour to regular post updates and perhaps a video or two once we start having a functional demo.

Posted in Development | Tagged | 5 Comments

OpenCog navigating Nao robot

Recently the team at Xiamen University have been working on integrating OpenCog with Nao robots. This recent video shows them using voice commands to tell the robot to walk from one object to the next. The robot hears, parses, and understands the command before then using pathfinding to get to the destination.

The team involved were: Professor Min Jiang and students Huang Deheng, Ye Yang, and Shuo Chen.

Posted in Development | Tagged , , , | 6 Comments

The AGI Summer School 2009

In the middle of last year, Xiamen University hosted the first international summer school on Artificial General Intelligence. While several of the core OpenCog developers, and Ben Goertzel were there to teach, it passed by somewhat quietly on our blog as we waited for the videos and presentations to be put together by independent film-maker Raj Dye.

Raj actually completed these early 2010, but due to myself and others being very busy at the time, we didn’t post them here. However, we continue to get many requests for an easy introduction and a tutorial to OpenCog. Fortunately one of the videos is an introduction to the software framework:

The OpenCog Software Framework – presented by Joel Pitt (me) and Ben Goertzel.

Of course, this is a the high level overview, and the other videos available on the AGI Summer School site focus on more specific aspects of OpenCog, such as:

…among many others!

Posted in Documentation, Events, Introduction | Tagged | Leave a comment