Catalog of Current OpenCog Atom Types

Alex van der Peet (of the OpenCog Hong Kong team) has been working on cataloguing all Atom types currently in use in the OpenCog code on the wiki site.

This page lists them all, with a page for each one:

http://wiki.opencog.org/w/Category:Atom_Types

A few of the pages still don’t have any information on them.

To all OpenCog developers: If you’re working heavily with a certain set of Atom types, please check out the corresponding wiki page, and think about adding some comments or examples.

Posted in Uncategorized | Leave a comment

The Viterbi Parser

I’ve recently made some good progress on something that I’m calling “the Viterbi decoder”, a new parser for the Link Grammar natural language parser.  So I guess that means its time to talk a bit about the why and how of this parser.

The goal of providing this decoder is to present a flexible, powerful interface for implementing high-level semantic algorithms on top of the the low-level link-grammar syntactic parser, and, in particular, for steering the parse based on high-level semantic knowledge. This allows the parser to move beyond being merely a syntactic parser, and to become fully integrated with general semantic artificial intelligence.

A less abstract list of expected benefits include:

  • Incremental parsing: the ability to obtain partial results after providing partial sentences, a word at a time.
  • Less sensitivity to sentence boundaries, allowing longer, run-on sentences to be parsed far more quickly.
  • Mitigation of the combinatorial explosion of parses.
  • Allow gramatically broken/incorrect chat dialog to be parsed; in general, to do better with slang, hip-speak.
  • Enable co-reference resolution and anaphora resolution across sentences (resolve pronouns, etc.)
  • Enable annotation of the parse graph with word-sense data, entity markers.
  • Allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternative reference resolutions.
  • Allow a plug-in architecture, so that plugins, employing higher- level semantic (AGI) algorithms can provide parse guidance and parse disambiguation.
  • Eliminate many of the hard-coded array sizes in the code.

The data structures used to implement this resemble those of the OpenCog AtomSpace. All data classes inherit from a class called Atom (which is an atomic predicate, in the sense of mathematical logic). Atoms are typed; the two core types are Links and Nodes. Thus, all data is represented in the form of a “term algebra” (aka the “Free Theory”, in the sense of model theory). This structure allows all data to be represented as (hyper-)graphs, which in turn makes the implementation of graph algorithms easier to implement. All these theoretical considerations provide a natural setting for storing Viterbi state information. Put differently, this provide a generic, uniform way of holding the various partly-finished parses, and effecting state transformations on them.

Since all of the data is represented dynamically (at run-time) by these (hyper-)graphs composed of atoms, developing custom algorithms to manipulate the parse becomes easy: there are no strange compile-time structures to master.  All algorithms can access the data in a uniform, common way.

Making the internal state directly visible allows low-level syntactic algorithms, as well as high-level, semantic algorithms to control parsing. In other words, the intended use of the Viterbi decoder is to provide a framework for parsing that should make it possible to integrate tightly (and cleanly) with high-level semantic analysis algorithms. Thus, reference and anaphora resolution can be done using the same graph structure as used for parsing; it should also allow graphical transformations, such as those currently implemented in RelEx.

One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found via the news story “Language Use is Simpler Than Prviously Thought“, per Morten Christiansen, Cornell professor of psychology.

Currently, the parser can correctly parse many short sentences. It currently runs very slowly, as no pruning algorithms have yet been implemented. Instructions for turning it on can be found in the viterbi/README file. The code is not in the 4.7.10 tarball; you need something newer: i.e. pull from the svn source tree. It will be in 4.7.11, whenever that comes out.

Here’s an example parse of “this is a test”. First, the usual link-parser output:

         +--Ost--+
   +-Ss*b+  +-Ds-+
   |     |  |    |
this.p is.v a test.n

or, with the wall words:

    +---------------RW--------------+
    |              +--Ost--+        |
    +---Wd---+-Ss*b+  +-Ds-+        |
    |        |     |  |    |        |
LEFT-WALL this.p is.v a test.n RIGHT-WALL

The output of viterbi, with some explanatory comments,  is this:

SEQ :                  # a sequence, an ordered set
  LING :               # a link-grammar link; naming conflict with opencog link.
    LING_TYPE : Wd     # the type of the link connecting two words.
    WORD_DISJ :        # holds the word and the connector used
      WORD : LEFT-WALL # all sentences begin with the left-wall.
      CONNECTOR : Wd+  # + means "connect to the right". - means left
    WORD_DISJ :
      WORD : this.p    # word with suffix as it appears in link-grammar dictionary
      CONNECTOR : Wd-
  LING :
    LING_TYPE : Ss*b   # and so on ...
    WORD_DISJ :
      WORD : this.p
      CONNECTOR : Ss*b+
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : Ss-
  LING :
    LING_TYPE : Ds
    WORD_DISJ :
      WORD : a
      CONNECTOR : Ds+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Ds-
  LING :
    LING_TYPE : Ost
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : O*t+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Os-

Oh, and I suppose its appropriate to answer the question “why is it called the Viterbi parser”?  I’m calling it that because it is inspired by (and vaguely resembles) the Viterbi algorithm famous from signal processing. A characteristic feature of that algorithm is that it maintains a set of states in parallel. As each new bit is received, some of the states become inherently inconsistent (e.g. because some checksum is violated), while other new states become possible. Once some certain number of bits have been received, the ones that can be consistently interpreted with the checksum constraints can be output. The process then repeats with each new bit streaming in.

In link-grammar, a “disjunct” can be thought of as a puzzle piece with a word printed on it. There are many different puzzle pieces with the same word on it. As each word comes in, one tries to find a piece that fits (this is like the viterbi checksum). Sometimes, more than one fits, so one has multiple ‘alternatives’ (this is like the viterbi state-vector). The algo keeps a set of these alternatives (of assembled pieces), and, as words come in, alternatives are either discarded (because nothing fits) or are elaborated on.

Unlike the viterbi algorithm, in natural language processing, it is useful to keep some of these alternatives or ambiguities around until much later stages of processing, when the disambiguation can finally be performed. As a famous example: “I saw the man with the telescope” has two valid syntactic parses, and two valid semantic interpretations.  Who was holding the telescope, me, or the man? Resolving this would be like applying a checksum to two different paths very late in the Viterbi game.

I like this analogy because it is vaguely biological as well: or perhaps I should say “neural net-ish”. The multiple, provisional states that are kept around are sort of like the activation states of a feed-forward artificial neural network. But this is not very deep: the feed-forward neural net looks like a Hidden Markov Model (HMM), and the Viterbi algorithm is essentially an HMM algorithm. No surprise!

But all this talk of algorithms hides the true reason for this work. The above algo is not strong enough to reproduce the old parser behavior: it can create islands; it ignores post-processing. The original algorithm uses integer-valued “cost” to rank parses; I want to replace this by floating point values (probabilities! maximum entropy!).

I also want to implement an “algorithm plug-in” API — basically, a way of offering “here’s the current state, go and modify it.” — to have ‘mind-agents’ in OpenCog terminology. The above puzzle-piece assembly algo would be the first to run, but clearly, others are needed to prevent islands, or to re-order states by probability/likelihood.   Some of these may be clearly distinct algos; others may end up as tangled balls of complexity.  Factorization into distinct algos is clearly possible: RelEx already had a list of algos there were applied in sequential order.  First, some POS tagging was done, then some head-word verb extraction, then some entity extraction, etc. Algorithms can be layered.

So really, the core issue I’m hoping to solve here is that of having a uniform development environment: link-grammar is in C, has no probability (besides cost), and no internal API. RelEx is in Java, is explicitly graphical, but is not a hypergraph, has no probabilities, and can’t provide parse feed-back to control link-grammar. RelEx output was pumped into OpenCog, which is in C++; it cannot feedback into RelEx or Link Grammar.  The Link Grammar dictionaries are files: how can an automated system learn a new word, and stick it into a file?

At the moment, there aren’t really any new or novel algorithms: I’ve day-dreamed some in the past, but the fractured approach halts progress. All these boundaries are barriers; the hope here is to get past all of these barriers.  The work is really an architectural re-design, and not a new whiz-bang algo.

Posted in Design, Development, Theory | Tagged , , , , , , | 3 Comments

The MOSES Metapopulation

Kaj Sotala recently asked for an update on how MOSES selects a “species” to be “mutated”, when it is searching for the fittest program tree. I have some on-going, unfinished research in this area, but perhaps this is a good time to explain why.

To recap: MOSES is a program-learning system. That is, given some input data, MOSES attempts to learn a computer program that reproduces the data. It does so by applying a mixture of evolutionary algorithms: an “inner loop” and an “outer loop”. The inner loop explores all of the mutations of a “species” (a “deme”, in MOSES terminology), while the outer loop chooses the next deme to explore. (Each deme is a “program tree”, that is, a program written in a certain lisp-like programming language).

So: the outer loop selects some program tree, whose mutations will be explored by the inner loop. The question becomes, “which program tree should be selected next?” Now, nature gets to evolve many different species in parallel; but here, where CPU cycles are expensive, its important to pick a tree whose mutations are “most likely to result in an even fitter program”. This is a bit challenging.

MOSES works from a pool of candidate trees, of various fitnesses. With each iteration of the inner loop, the pool is expanded: when some reasonably fit mutations are found, they are added to the pool. Think of this pool as a collection of “species”, some similar, some not, some fit, some, not so much. To iterate the outer loop, it seems plausible to take the fittest candidate in the pool, and mutate it, looking for improvements. If none are found, then in the next go-around, the second-most-fit program is explored, etc. (terminology: in moses, the pool is called the “metapopulation”).

It turns out (experimentally) that this results in a very slow algorithm. A much better approach is to pick randomly from the highest scorers: one has a much better chance of getting lucky this way. But how to pick randomly? The highest scorers are given a probability: p ~ exp (score /T) so in fact, the highest scoring have the highest probability of being picked, but the poorly-scoring have a chance too. This distribution is the “Gibbs measure” aka “Boltzmann distribution”; (T is a kind of “temperature”, it provides a scale; its held constant in the current algos)  I’m guessing that this is the right measure to apply here, and can do some deep theoretical handwaving, but haven’t really worked this out in detail. Experimentally, it works well; there even seems to be a preferred temperature that seems to work well for most/all different problems (but this is not exactly clear).

One can do even better. Instead of using the score, a blend of score minus program tree complexity works better; again, this is experimentally verified.   Nil added this back when, and his theoretical justification was to call it “Solomonoff complexity”, and turn it into a ‘Bayesian prior’. From an engineering viewpoint, its basically saying that, to create a good design suitable for some use, its better to start with a simple design and modify it, than to start with a complex design and modify it. In MOSES terminology, its better to pick an initial low-complexity but poorly scoring deme, and mutate it, than to start with something of high complexity, high score, and mutate that. Exactly what the blending ratio (between high score, and high complexity) is, and how to interpret it, is an interesting question.

Experimentally, I see another interesting behaviour, that I am trying to “fix”.   I see a very classic “flight of the swallow” learning curve, dating back to the earliest measurements of the speed of telegraph operators at the turn of the 19th century. At first, learning is fast, and then it stalls, until there is a break-through; then learning is again fast (for a very brief time — weeks for telegraph operators) and then stalls (years or a decade for telegraph operators). In MOSES, so, at first, one picks a deme, almost any deme, and almost any mutation will improve upon it. This goes on for a while, and then plateaus. Then there’s a long dry spell — picking deme after deme, mutating it, and finding very little or no improvement. This goes on for a long time (say, thousands of demes, hours of cpu time), when suddenly there is a break-through: dozens of different mutations to some very specific deme all improve the score by some large amount. The bolzmann weighting above causes these to be explored in the next go-around, and mutations of these, in turn, all yield improvements too. This lasts for maybe 10-20 steps, and then the scores plateau again. Exactly like the signalling rate of 19th century telegraph operators 🙂 Or the ability of guitar players. Or sportsmen, all of which have been measured in various social-science studies, and have shown the “flight of the swallow” curve on them.

(Can someone PLEASE fix the horribly deficient Wikipedia article on “learning curve”? It totally fails to cite any of the seminal research and breakthroughs on this topic. Check out google images for examples of fast learning, followed by long plateau.

So: e.g.

Learning curve

learning curve

Learning curve. In real life. For Salesmen.

Actual MOSES curves look more like this, with rapid progress followed by stagnant plateaus, punctuated with rapid progress, again. Exccept the plateaus are much flatter and much longer, and the upward curves are much sharper and faster.

All these curves beg the question: why is google finding only the highly stylized ones, and not showing any for raw, actual data? Has the learning curve turned into an urban legend??

Here's a real-life learning curve, taken from MOSES, using real data (the "bank" dataset) from a previous OpenCog blog post on MOSES. Although this learning curve shows a combination of the inner and outer loops, and so, strictly speaking does not represent what I'm discussing here.

).

Recently, I have been trying to shorten the plateau, by trying to make sure that the next deme I pick for exploration is one that is least similar to the last one explored. The rationale here is that the metapaopulation gets filled with lots of very very similar species, all of which are almost equally fit, all of which are “genetically” very similar. Trying to pick among these, to find the magic one, the one whose mutations will yeild a break-through, seems to be a losing strategy. So, instead, add a diversity penalty: explore these “species” that are as different as possible from the current one (but still have about the same fitness score). So far, this experiment is inconclusive; I wasn’t rewarded with instant success, but more work needs to be done. Its actually fairly tedious to take the data…

Posted in Design, Documentation, Theory | Tagged , , , | 6 Comments

Fishgram: Frequent Interesting Subhypergraph Mining for OpenCog

One of the tools OpenCog has needed for a long time, is something that can relatively quickly scan an Atomspace and find the interesting patterns in it.  “Interesting” may be defined in a variety of ways, such as “frequent”, or “surprising” (as measured by information theory), etc.  This capability has often been referred to in OpenCog documents as “pattern mining.”

Jade O’Neill (fka Jared Wigmore) implemented python software doing this for the Atomspace some time ago — Fishgram, the Frequent Interesting SubHyperGRaph Miner.   Fishgram has been used to recognize patterns in Atomspaces resultant from OpenCog’s “perception” of a Unity3D based virtual world.

Now,  a wiki page has been created, covering some details of Fishgram — including pseudocode, an explanation of the algorithm, and some indication of which software classes carry out which parts of the algorithm…

http://wiki.opencog.org/w/Fishgram

Plenty more work needs to be done with Fishgram, yet, it does
currently work and can extract some interesting patterns from
Atomspaces….

Some simple examples have also been done, feeding patterns output via Fishgram into PLN…

I think this is a very valuable tool that could be used for a lot of different OpenCog applications, and it would be great to see others jump onto it and help with development.

The current version of Fishgram looks for frequent subhypergraphs (i.e. frequent subhypergraph patterns, which may contain  multiple variables).  One thing that Jade and I have talked about a lot is extending Fishgram to search for “surprising” subhypergraphs, where surprisingness may be measured using interaction information or synergy, as described in these papers:

http://www.rni.org/bell/nara4.pdf

http://arxiv.org/abs/1004.2515/

Those who like logic may also enjoy this paper, which connects interaction information with the logic of questions:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.6110

It seems that a good implementation of a good measure of surprisingness will be a valuable thing to have in OpenCog generally, not just in Fishgram.   If we want “novelty seeking” to be one of the top-level goals of a young AGI or proto-AGI system (which I think we do), then having a nice way to measure novelty seems like a good things — and the interaction information and the informational synergy, as described in these papers, seem a good approach.

Onward and upward 😉

Ben G

Posted in Uncategorized | Leave a comment

Genetic Crossover in MOSES

MOSES is a system for learning programs from input data.  Given a table of input values, and a column of outputs, MOSES tries to learn a program, the simplest program that can reproduce the output given the input values. The programs that it learns are in the form of a “program tree” —  a nested concatenation of operators, such as addition or multiplication, boolean AND’s or OR’s, if-statements, and the like, taking the inputs as arguments.  To learn a program, it starts by guessing a new random program.  More precisely, it generates a new, random program tree, with as-yet unspecified operators at the nodes of the tree. So, for example, an arithmetic node maybe be addition, or subtraction, or multiplication, division, or it may be entirely absent.  It hasn’t yet been decided which.   In MOSES, each such undecided node is termed a “knob”, and program learning is done by “turning the knobs” until a reasonable program is found.  But things don’t stop there: once a “reasonable” program is found, a new, random program tree is created by decorating this “most reasonable” program with a new set of knobs.  The process then repeats: knobs are turned until an even better program is found.

Thus, MOSES is a “metalearning” system: it consists of an outer loop, that creates trees and knobs, and an inner loop, that finds optimal knob settings.  Both loops “learn” or “optimize”; it is the nesting of these that garners the name “metalearning”. Each loop can use completely different optimization algorithms in its search for optimal results.

The rest of this post concerns this inner loop, and making sure that it finds optimal knob settings as quickly and efficiently as possible. The space of all possible knob settings is large: if, for example, each knob has 5 possible settings, and there are 100 knobs, then there is a total of 5100 possible different settings: a combinatorial explosion. Such spaces are hard to search. There are a variety of different algorithms for exploring such a space. One very simple, very traditional algorithm is “hillclimbing”. This algo starts somewhere in this space, at a single point, say, the one with all the knobs set to zero. It then searches the entire local neighborhood of this point: each knob is varied, one at a time, and a score is computed. Of these scores, one will be best. The corresponding knob setting is then picked a the new center, and the process then repeats; it repeats until there is no improvement: until one can’t “climb up this hill” any further. At this point, the inner loop is done; the “best possible” program has been found, and control is returned to the outer loop.

Hill-climbing is a rather stupid algorithm: most knob settings will result in terrible scores, and are pointless to explore, but the hill-climber does so anyway, as it has no clue as to where the “good knobs” lie. It does an exhaustive search of the local neighborhood of single-knob twists.  One can do much better by using estimation-of-distribution algorithms, such as the Bayesian Optimization Algorithm.  The basic premise is that knob settings are correlated: good settings are near other good settings.  By collecting statistics and computing probabilities, one can make informed, competent guesses at which knob settings might actually be good.  The downside to such algorithms is that they are complex:  the code is hard to write, hard to debug, and slow to run: there is a performance penalty for computing those “educated guesses”.

This post explores a middle ground: a genetic cross-over algorithm that improves on simple hill-climbing simply by blindly assuming that good knob settings really are “near each other”, without bothering to compute any probabilities to support this rash assumption.  The algorithm works; headway can be made by exploring only the small set of knob settings that correlate with previous good knob settings.

To explain this, it is time to take a look at some typical “real-life” data. In what follows, a dataset was collected from a customer-satisfaction survey; the goal is to predict satisfaction from a set of customer responses.  The dataset is a table; the outer loop has generated a program decorated with a set of knobs.  Starting with some initial knob setting, we vary each knob in turn, and compute the score. The first graph below shows what a  typical “nearest neighborhood” looks like.  The term “nearest neighborhood” simply means that, starting with the initial knob setting, the nearest neighbors are those that differ from it by exactly one knob setting, and no more.  There is also a distance=2 neighborhood: those instances that differ by exactly two knob settings from the “center” instance.  Likewise, there is a distance=3 neighborhood, differing by 3 knob settings, etc. The size of each neighborhood gets combinatorially larger.  So, if there are 100 knobs, and each knob has five settings, then there are 5 × 100=500 nearest neighbors. There are 500 × 499 / 2 = 125K next-nearest neighbors, and 500 × 499 × 498 / (2 × 3) = 21M instances at distance=3. In general, this is the binomial coefficient: (500 choose k) for distance k. Different knobs, however, may have more or fewer than just 5 settings, so the above is just a rough example.

Nearest Neighbor Scores

Nearest Neighbor Scores

The above graph shows the distribution of nearest neighbor scores, for a “typical” neighborhood. The score of the center instance (the center of the neighborhood) is indicated by the solid green line running across the graph, labelled “previous high score”.  All of the other instances differ by exactly one knob setting from this center.  They’ve been scored and ranked, so that the highest-scoring neighbors are to the left.  As can be seen, there are maybe 15 instances with higher scores than the center, another 5 that seem to tie.  A slow decline is followed by a precipitous drop; there are another 80 instances with scores so bad that they are not shown in this figure.  The hill-climbing algo merely picks the highest scorer, declares it to be the new center, and repeats the process.

All of the other neighborhoods look substantially similar. The graph below shows an average over many generations (here, each iteration of the inner loop is one generation). The jaggedness above is smoothed out by averaging.

Nearest Neighbor Score Change

Nearest Neighbor Score Change

Rather than searching the entire neighborhood, one would like to test only those knob settings likely to yield good scores. But which might these be?  For nearest neighbors, there is no way to tell, without going through the bother of collecting statistics, and running them through some or another Bayesian estimation algorithm.

However, for more distant neighbors, there is a way of guessing and getting lucky: perform genetic cross-overs.  That is, take the highest and next-highest scoring instances, and create a new instance that differs from the center by two knob-settings, the two knobs associated with the two high scorers.  In fact, this new instance will very often be quite good, beating both of its parents.   The graph below shows what happens when we cross the highest scorer with each one of the next 70 highest. The label “1-simplex” simply reminds us that these instances differ by exactly two knob settings from the center.  More on simplexes later.  The green zero line is located at the highest-scoring single-knob change.  The graph shows that by starting here, and twiddling the next-most-promising knob, can often be a win. Not always: in the graph below, only 4 different knobs showed improvement. However, we explored relatively few instances to find these four; for this dataset, most exemplars have thousands of knobs.

Average Score Change, 1-simplex

Average Score Change, 1-simplex

The take-away lesson here is that we can avoid exhaustive searches by simply crossing the 10 or 20 or 30 best instances, and hoping for the best. In fact, we get lucky with these guesses quite often. What happens if, instead of just crossing two, we cross three of the top scorers?  This is the “2-simplex”, below:

Average Score Change

Average Score Change, 2-simplex

Notice that there are now even more excellent candidates!  How far can we go?  The 3-simplex graph below shows the average score change from crossing over four high-scoring instances:

Average Score Change

Average Score Change. 3-simplex

The term “crossover” suggests some sort of “sexual genetic reproduction”. While this is correct, it is somewhat misleading.   The starting population is genetically very uniform, with little “genetic variation”.  The algorithm starts with one single “grandparent”, and produces a population of “parents”, each of which differ from the grandparent by exactly one knob setting. In the “nearest neighborhood” terminology, the “grandparent” is the “center”, and each “parent” is exactly one step away from this center. Any two “parents”, arbitrarily chosen, will always differ from one-another by exactly two knob settings. Thus, crossing over two parents will produce a child that differs by exactly one knob setting from each parent, and by two from the grandparent. In the “neighborhood” model, this child is a distance=2 from the grandparent.   For the case of  three parents, the child is at distance=3 from the grandparent, and so on: four parents produce a child that is distance=4 from the grandparent.  Thus, while “sexual reproduction” is a sexy term, it looses its punch with the rather stark uniformity of the parent population; thinking in terms of “neighbors” and “distance” provides a more accurate mental model of what is happening here.

The term “simplex” used above refers to the shape of the iteration over the ranked instances: a 1-simplex is a straight line segment, a 2-simplex is a right triangle, a 3-simplex is a right tetrahedron. The iteration is performed with 1, 2 or 3 nested loops that cross over 1, 2 or 3 instances against the highest. It is important to notice that the loops do not run over the entire range of nearest neighbors, but only over the top scoring ones. So, for example, crossing over the 7 highest-scoring instances for the 3-simplex generates 6!/(6-3)! = 6 × 5 × 4 = 120 candidates. Scoring a mere 120 candidates can be very quick, as compared to an exhaustive search of many thousands of nearest neighbors. Add to this the fact that most of the 120 are likely to score quite well, whereas only a tiny handful of the thousands of nearest neighbors will show any improvement, and the advantage of this guessing game is quite clear.

So what is it like, after we put it all together? The graph below shows the score as a function of runtime.

Score as function of time

Score as function of time

In the above graph, each tick mark represents one generation. The long horizontal stretches between tick marks shows the time taken to perform an exhaustive nearest-neighborhood search. For the first 100 seconds or so, the exemplar has very few knobs in it (a few hundred), and so an exhaustive search is quick and easy. After this point, the exemplars get dramatically more complex, and consist of thousands of knobs. At this point, an exhaustive neighborhood search becomes expensive: about 100 seconds or so, judging from the graph. While the exhaustive search is always finding an improvement for this dataset, it is clear that performing some optimistic guessing can improve the score a good bit faster. As can be seen from this graph, the algorithm falls back to an exhaustive search when the optimistic simplex-based guessing fails to show improvement; it then resumes with guessing.

To conclude: for many kinds of datasets, a very simple genetic-crossover algorithm combined with hillclimbing can prove a simple but effective search algorithm.

Note Bene: the above only works for some problem types; thus it is not (currently) enabled by default. To turn it on, specify the -Z1 flag when invoking moses.

Appendix

Just to keep things honest, and to show some of the difficulty of algorithm tuning, below is a graph of some intermediate results taken during the work.  I won’t explain what they all are, but do note one curious feature:  the algos which advance the fastest initially seem to have trouble advancing later on.  This suggests a somewhat “deceptive” scoring landscape: the strong early advancers get trapped in local maxima that they can’t escape.   The weak early advancers somehow avoid these traps.  Note also that results have some fair dependence on the random number generator seed; different algos effectively work with different random sequences, and so confuse direct comparison by some fair bit.

Many Different Algorithms

Many Different Algorithms

Posted in Design, Documentation, Introduction, Theory | Tagged , , , , , | 2 Comments