What is consciousness?

… and can we implement it in OpenCog?  I think we can.  It might not even be that hard!   Consciousness isn’t this magical pixie dust that it’s often made out to be.  I’d like to provide a sketch.

In order for machine intelligence to perform in the real world, it needs to create an internal model of the external world. This can be as trite as a model of a chessboard that a chess-playing algo maintains.  As information flows in from the senses, that model is updated; the current model is used to create future plans (e.g. the next move, for a chess-playing computer).

Another important part of an effective machine algo is “attentional focus”: so, for a chess-playing computer, it is focusing compute resources on exploring those chess-board positions that seem most likely to improve the score, instead of somewhere else. Insert favorite score-maximizing algo here.

Self-aware systems are those that have an internal model of self. Conscious systems are those that have an internal model of attentional focus.   I’m conscious because I maintain an internal model of what I am thinking about, and I can think about that, if I so choose. I can ask myself what I’m thinking about, and get an answer to that question, much in the same way that I can ask myself  what my teenage son is doing, and sort-of get an answer to that (I imagine, in my minds eye, that he is sitting in his room, doing his homework. I might be wrong.)    I can steer my attention the way I steer my limbs, but this is only possible because I have that internal model (of my focus, of my limbs), and I can use that model to plan, to adjust, to control.

So, can we use this to build an AGI?

Well, we already have machines that can add numbers together better than us, can play chess better than us, and apparently, can drive cars better than us.  Only the last can be said to have any inkling of self-awareness, and that is fairly minimal: just enough to locate itself in the middle of the road, and maintain a safe distance between it and obstacles.

I am not aware of any system that maintains an internal model of its own attentional focus (and then uses that model to perform prediction, planning and control of that focus). This, in itself, might not be that hard to do, if one set out to explicitly accomplish just that. I don’t believe anyone has ever tried it. The fun begins when you give such a system senses and a body to play with. It gets serious when you provide it with linguistic abilities.

I admit I’m not entirely clear on how to create a model of attentional focus when language is involved; I plan to think heavily on this topic in the coming weeks/months/years. At any rate, I suspect its doable.

I believe that if someone builds such a device, they will have the fabled conscious, self-aware system of sci-fi. It’s likely to be flawed, stupid, and psychotic: common-sense reasoning algorithms are in a very primitive state (among (many) other technical issues).  But I figure that we will notice, and agree that its self-aware, long before its intelligent enough to self-augument itself out of its pathetic state: I’m thinking it will behave a bit like a rabid talking dog: not a charming personality, but certainly “conscious”, self-aware, intelligent, unpredictable, and dangerous.

To be charming, one must develop a very detailed model of humans, and what humans like, and how they respond to situations. This could prove to be quite hard.  Most humans can’t do it very well. For an AGI to self-augument itself, it would have to convince it’s human masters to let it tinker with itself.  Given that charm just might be a pre-requisite, that would be a significant challenge, even for a rather smart AGI.  Never mind that self-augumentation can be fatal, as anyone who’s overdosed on heroin might fail to point out.

I’m sure the military and certain darker political forces would have considerable interest in building a charming personality, especially if its really, really smart.  We already know that people can be charming and psychotic all at the same time; ethics or lack thereof is not somehow mutually exclusive of intelligence. That kind of a machine, unleashed on the world, would be … an existential threat.   Could end well, could end badly.

Anyway, I think that’s the outline of a valid course of research.  It leaves open some huge questions, but it does narrow the range of the project to some concrete and achievable goals.

Posted in Design, Theory | 24 Comments

The Relationship Between PLN Inference and Gibbs Sampling (Some Thought-Experiments)

This post describes some new thought-experiments regarding PLN, which have not yet been tested nor worked out mathematically in detail… Reader beware — there could be some mistakes here! But I think the ideas are interesting enough to be worth sharing….

These ideas are part of the same train of thought as the New PLN Design, currently being implemented bit-by-bit (and with interesting variations and deviations from the rough spec I just linked to) by Jade O’Neill and Ramin Barati. But this blog post contains new ideas not contained on that page.

Actually, I am unsure if I will end up recommending the ideas outlined here for implementation or not.   But even if not, I think they are interesting for the light they shed on what is going on with PLN conceptually and mathematically.

For one thing, on the theoretical side, I will outline here an argument why inference trails are ultimately unnecessary in PLN.   (They are needed in Pei Wang’s NARS system, from which PLN originally borrowed them; but this is because NARS is not probabilistic, so that the sorts of Gibbs sampling based arguments I outline here can’t be applied to NARS.)

Rough Summary / Prelude

Basically: In this post I will describe how to reformulate PLN inference as (very broadly speaking) to make use of Gibbs Sampling.   As Gibbs Sampling is used in the standard approach to Markov Logic Networks, this also serves (among other more practical purposes) to make clearer the relationship between PLN and MLN.

Broadly speaking, the idea here is to have two different, interlocking levels of PLN inference, with different truth values and different dynamics associated with them

  • a Gibbs sampling based layer, corresponding very roughly to shallow, massively parallel, “unconscious” inference (more like inference based “activation spreading”, to use a neural net metaphor)
  • a forward/backward chaining based layer, corresponding very roughly to “conscious”, deliberative inference

It seems possible that doing this might speed the convergence of a PLN network toward maximally intelligent conclusions based on the knowledge implicit in it.

Consideration of this possibility leads to an understanding of the relation between PLN dynamics and Gibbs sampling, which leads to an argument (at this stage, a sketch of a proof rather than a proof) that inference trails are not really needed in PLN.

Two preliminary notes before getting started:

  • The ideas given here are related, though far from identical, to the work by myself and Cassio Pennachin, reported in Section 3.1 of the paper “PLN and the Brain” from the proceedings of AGI-08:  ….
  • These ideas will make the most sense to the reader who knows the basic ideas of Gibbs sampling, and will make even more sense to readers who know about Markov Logic Networks.  Advanced knowledge of all the details and variations of these topics is not necessary, though.

Without further ado, I will now present two thought-experiments in PLN design: one fairly extreme, the other less so.

Thought-Experiment #1: PLN Inference via Gibbs Sampling on Distributional Truth Values

In this section I’ll describe a hypothetical way of doing PLN inference via Gibbs sampling.

Suppose that, instead of a single truth value, we let each PLN Atom have two truth values:

  • the current truth value (which we may call the “primary truth value”)
  • a new entity called the “instantaneous truth value,” which consists of: a series of K values called the “sample distribution”

The sample distribution consists of a series of values that define the shape of a distribution.    For example, the template sample distribution might comprise K=5 values corresponding to the intervals [0, .2] , [.2, .4], [.4,.6], [.6,.8], [.8,1].  The values would be viewed as a step value approximation to an underlying first-order probability distribution.

Next, the instantaneous truth values would be updated via Gibbs sampling. What I mean by this is, a process by which: the Atoms in the Atomspace are looped through, and when each Atom X is visited, its sampled strengths are replaced with the result of the following Gibbs-type Update Rule:

  1. Find all inference rules R that, in a single step from some set of premise Atoms existing in the Atomspace currently, would result in an estimate for the truth value of X
  2. Execute all the (rule, premise-set) pairs found in Step 1.   That is,
    1. for each pair, repeat the following process some number N of times: choose a specific value from the distribution comprising the instantaneous truth value for each premise, and draw a conclusion from these specific values.  This produces a truth value distribution for the conclusion.
    2. merge these distributions via revision (weighted averaging), obtaining an overall truth value distribution for the conclusion
  3. Replace the existing instantaneous truth value of X with (a discretized version of) the result of Step 2

The instantaneous truth value would then impact the primary truth value as follows

Periodically (every N cycles), the primary truth value of A is revised with the instantaneous truth value of A

(i.e. the primary truth value is replaced with a weighted average of itself & the instantaneous truth value)

Note that one could vary on this process in multiple ways — e.g. via making the instantaneous truth value an imprecise or indefinite probability, or a second order probability distribution.   The above procedure is given as it is, more out of a desire for relative simplicity of presentation, than because it necessarily seems the best approach.

If nothing else besides this updating happened with the primary truth values of logical Atoms (and if the various logical relations in the Atomspace all possessed a consistent probabilistic interpretation in terms of some grounding) — then according to the theory of Gibbs sampling, each Atom would get a primary strength approximating its correct strength according to the joint distribution implicit in all the logical Atoms in the Atomspace.

(The above description, involved as it is, still finesses a bit of mathematical fancy footwork.   It’s important to remember that, in spite of the Gibbs sampling, the PLN heuristic inference rules (which are derived using probability theory, but also various other heuristics) are being used to define the relationships between the variables (i.e. the truth value strengths of Atoms) in the network.

So the Gibbs sampling must be viewed as taking place, not on the variables (the Atom strengths) themselves, but on propositions of the form “the strength of Atom A lies in interval [x,y]“.   One can thus view the sampling as happening on a second-order probability distribution defined over the main probability distribution of strengths.

So the joint distribution on the truth value strength distributions in the PLN network, has to be calculated consistently with the results of the PLN probabilistic/heuristic inference rules.   If the PLN inference rules deviated far from probability theory, then the Gibbs sampling would result in a network that didn’t make sense as a probabilistic model of the world to which the variables in the network refer, but did make sense as a model of the relationship between the variables according to the PLN  inference rules.

This is pretty different from a MLN, because in an MLN the Gibbs sampling just has to find a distribution consistent with certain propositional logic relations, not consistent with certain heuristic uncertain truth value estimation functions.

Anyway: this sort of subtlety is the reason that the idea presented here is not “obvious” and hasn’t emerged in PLN theory before.

So then, if this were the only kind of inference dynamic happening in PLN, we could view PLN as something vaguely analogous to a second-order Markov Logic Network incorporating a wider variety of logical constructs (more general quantifier logic, intensional inference, etc.) via heuristic formulas.

However, the thought-experiment I am outlining in this section is not to have this kind of sampling be the only thing happening in PLN.   My suggestion is that in any new PLN, just like in the current and prior PLN, primary strengths may also be modified via forward and backward chaining inference. These inference methods do something different than the Gibbs-type updating mentioned above, because they add new logical links (and in some cases nodes) to the network.

This is vaguely comparable to how, in some cases, Gibbs sampling or message-passing in Markov Logic Networks have been coupled with Inductive Logic Programming.  ILP, vaguely similarly to PLN forward and backward inference, adds new logical links to a network. I.e., to use MLN / Bayes Nets terminology, both ILP and PLN chaining are concerned with structure building, whereas Gibbs sampling, message-passing and other comparable methods of probabilistic inference are concerned with calculating probabilities based on a given network structure.

Also note: If there is information coming into the system from outside PLN, then this information should be revised into the instantaneous truth values as well as the primary ones.  (This point was raised by Abram Demski in response to an earlier version of this post.) ….  And this leads to the interesting question of when, and to what extent, it is useful to revise the primary truth values back into the instantaneous truth values, based on the modifications of the primary truth values due to regular PLN forward and backward inference.

If we do both the Gibbs sampling suggested above and the traditional PLN chaining on the same network, what we have is a probabilistic network that is constantly adapting its structure (and a subset of its truth values) based on chains of inference rules, and constantly updating its truth values based on its structure according to Gibbs type (and vaguely MLN-ish) methods.

Note that the Gibbs sampling forms a consistent model of the joint distribution of all the Atoms in the Atomspace, without needing a trail-like mechanism. Clearly the Gibbs-type approach is much more like what could be realized in a brain-like system (though OpenCog is not really a brain-like system in any strong sense).

Inference trails would still be useful for chaining-based inferences, in the suggested framework. However, if the trail mechanism screws up in some cases and we get truth values that handle dependencies incorrectly — in the medium run, this won’t matter so much, because the Gibbs sampling mechanism will eventually find more correct versions for those truth values, which will be revised into the truth values. Note that incorrect truth values gotten by inadequate use of trails will still affect the results of the sampling, because they will weight some of the links used in the sampling-based inference — but the sampling-based inference will “merge” these incorrect truth values with the truth values of the relations embodying the dependencies they ignore, muting the influence of the incorrect values.

Also: one problem I’ve noted before with MLN and related ideas is that they assume a fully consistent interpretation of all the links in their network.    But a complex knowledge network reflecting the world-understanding of an AGI system, is not going to be fully consistent.  However, I believe the approach described here would inherit PLN’s robustness with regard to inconsistency.   The PLN heuristic inference rules are designed to dampen inconsistencies via locally ignoring them (e.g. if the premises of the PLN deduction rule are wildly inconsistent so that the rule gives a truth value strength outside [0,1], the resultant inference will simply not be revised into the truth value of the conclusion Atom).   In the current proposal, this sort of mechanism would be used both in the Gibbs sampling and the chaining control mechanisms.

Revision versus Gibbs Sampling

Now — if anyone is still following me by this point — I want to take the discussion in a slightly different direction.   I’m going to use the above ideas to make an argument why inference trails are unnecessary in PLN even without Gibbs sampling.

Reading through Thought Experiment #1 above, one might wonder why bother to maintain two truth values, an instantaneous and a primary one.  Why is this better than the traditional PLN approach, where you do the updating directly on the primary truth values, but instead of (as in Gibbs sampling) replacing the old truth value with the new one at each step, just revise the new truth value with the old one?

The answer seems to be: In the long run, if one assumes a fixed set of knowledge in the inference network during the learning process, both approaches amount to the same thing.  So in this somewhat artificial “fixed knowledge” setting, it’s really mainly a matter of convergence rates.   (Which means it’s a matter of the speed of coming to modestly intelligent conclusions, since in a real-world system in a dynamic environment, there is no hope of an inference network converging to a fully coherent conclusion based on its existing data before new data comes in and disrupts things).

Viewed at a sufficient level of abstraction, the Gibbs sampling approach corresponds to taking a Markov matrix M and taking the limit of the power M^n as n goes to infinity, till (M^n x), where x is the initial condition, converges to a stationary distribution.

Specifically, in the approach outlined above, one can think about a long vector, each entry of which refers to a “truth value state” of the PLN system as a whole.   The k’th truth value state corresponds to a proposition of the form “Truth value of Atom 1 lies in interval I_k(1), AND truth value of Atom 2 lies in interval I_k(2), AND … truth value of Atom lies in interval I_k(n).”   So this is a very high dimensional vector.  Given the specific set of inference rules and truth value formulas in a PLN system, if one iterates PLN using parallel forward chaining (i.e. executing all possible single-step forward inferences at the same time, and revising the results together); then PLN execution corresponds to multiplying by a large Markov matrix M.

On the other hand, the standard PLN approach with only one truth value for each Atom and a fixed weight c in the revision rule, corresponds roughly to taking the limit of the power ( c I + (1-c) M )^n as n goes to infinity.   The latter approach will generally take significantly longer to converge to the stationary distribution, because the ratio (second largest eigenvalue) / (largest eigenvalue) will be closer to 1.

Actually it’s a bit subtler than that, because the revision weight c isn’t a constant in PLN. Rather, as the system accumulates more evidence, c gets larger, so that the existing evidence is weighted more and the new evidence is weighted less.

But for each fixed value of c, the iteration would converge to the same stationary distribution as the Gibbs sampling approach (under reasonable assumptions, for a network with fixed knowledge).   And we may assume that as the network learns, eventually c will reach some maximum short of 1 (c=.9999 or whatever).   Under this assumption, it seems PLN iteration with adaptive revision weight will converge to the stationary distribution — eventually.

So the apparent conclusion of this somewhat sketchy mathematical thinking (if all the details work out!) is that, if one makes the (unrealistic) assumption of a fixed body of knowledge in the system,

  • The current PLN revision-based approach will get to the same place as the hypothetical Gibbs Sampling based approach outlined in Thought-Experiment #1 above
  • In this setting, we don’t need trails.  Dependencies will take care of themselves eventually as the network iterates.  (i.e., since Gibbs sampling doesn’t need trails, and the standard PLN approach is equivalent to Gibbs sampling on second-order distributions in the long run, the standard PLN approach also doesn’t need trails)

Now, it may be that trails are still useful in the short run.   On the other hand, there seem other ways to handle the matter.  For instance: If one has a sub-network of tightly interlinked Atoms, then one can do a lot of inference on these Atoms, i.e. accelerating the iterative sampling process as regards the relationships between these Atoms.  In this way the mutual dependencies among those Atoms will get resolved faster, much as if one were using trails.

Thought-Experiment #2

Finally, I’ll present a less extreme thought-experiment, which I think has a greater likelihood of actually being useful for PLN in OpenCog.

Instead of having two truth values per Atom — one the primary, traditional PLN truth value and the other an instantaneous truth value used for Gibbs sampling — what if one had two truth values, both updated via the standard PLN approach, but with widely differing default revision weights?

The standard default revision weight in PLN now is driven by the confidence factor

c = n/(n+k)

where n is a number of observations, and k is the “personality parameter.”  But layered on top of this (in the PLN theory, though not currently in the code), is a “confidence decay factor”, which decays confidence values over time.

One possibility would be to have two different truth values associated with each Atom: one conservative and one adventurous.   The two would differ in their personality parameters.  The conservative truth value would get updated with a small value of k, meaning that it would tend to weight its past experience highly and its new conclusions not so much.   The adventurous truth value would get updated with a large value of k, meaning that it would weight its new conclusions much more than its past experience.

What Thought Experiment #1 teaches us is that: As k goes to infinity, if one follows a simple inference control strategy as outlined there, the adventurous truth value will basically be getting updated according to Gibbs sampling (on second order probability distributions).

We have seen that both the adventurous and conservative truth values will converge to the same stationary distribution in the long run, under unrealistic assumptions of fixed knowledge in the network.  But so what?  Under realistic conditions they will behave quite differently.

There is much to experiment with here.   My point in this post has merely been to suggest some new experiments, and indicate some theoretical connections between PLN, sampling theory, and other probabilistic inference methods like MLN.

OK, that’s a rough summary of my train of thought on these topics at the moment. Feedback from folks with knowledge of PLN, MLNs and sampling would be valued. Am I thinking about this stuff in a sensible way? What do you think?

The current version of this post owes something to a critique of the previous version by Abram Demski.

Posted in Theory, Uncategorized | Leave a comment

Why Hypergraphs?

OpenCog uses hypergraphs to represent knowledge.  Why?  I don’t think this is clearly, succinctly explained anywhere, so I will try to do so here.  This is a very important point: I can’t begin to tell you how many times I went searching for some whiz-bang logic programming system,  or inference engine, or theorem-prover, or some graph re-writing engine, or some probabilistic programming system, only to throw up my hands up and realize that, after many wasted hours, none of them do what I want.  If you’re interested in AGI, then let me assure you: they don’t do what you want, either.  So, what do I want them to do, and why?

Well, lets begin easy: with graph re-writing systems.  These days, almost everyone agrees that a great way to represent knowledge is with graphs.  The structure IsA(Cat, Animal) looks like a graph with two vertexes, Cat and Animal, and a labelled edge, IsA, between them.  If I also know that IsA(Binky, Cat), then, in principle, I should be able to deduce that IsA(Binky, Animal).  This is a simple transitive relationship, and the act of logical deduction, for this example, is a simple graph re-write rule: If you see two IsA edges in a row, you should draw a third IsA edge between the first and the last vertex.  Easy, right?

So perhaps you’d think that all logic induction and reasoning engines have graph rewrite systems at their core, right? So you’d think. In fact, almost none of them do.  And those that do, do it in some internal, ad hoc, non-public, undocumented way: there’s no API, its not exposed externally; its not an ‘official’ part of the system for you to use or tinker with.

OK, so why do I need a graph re-write system? Well, I’ve been working on a natural language parser, a so-called Viterbi decoder for Link Grammar.  My initial graph is a string of words: a sentence.  The vertexes are words, and the edges are arrows called “next word”. Real simple. To parse this sentence, I want to apply a certain set of simple graph-rewrite rules: for example, if word X is a noun, then create an arrow, called ‘part-of-speech’ (POS),  from word X to the special vertex ‘noun’.  If the word immediately before word X is an adjective (i.e. if  it has a POS arrow pointing to ‘adjective’), then create a new arrow, called ‘noun modifier’, pointing from X to this word before it.   This kind of graph markup is called ‘dependency parsing‘, and is a very popular way of doing natural language parsing.  So you’d think that all dependency parsers have a graph re-write system at their core, right?  Hardly. In fact, just about none of them do.  And if they do, they’re walled off, hacked up, undocumented … you get the idea.

The only dependency parser that I know of that has an explicit graph-rewriting system in it, that is open for tinkering, and is documented (!) is RelEx.  And that’s great.  Wow!  Although RelEx invented and used its own, custom, graph rewrite system, I suppose that, in principle, it could have used some other, pre-existing system to do this (Well, it couldn’t, because in 2005, there weren’t any robust, open-source graph rewriting systems. Whatever).

What else do I want to do? Well, I daydream about using a machine-learning system to learn new rules!   I mean, this is the goal of AGI, right? Have a machine that can learn new things?  Well, to learn new rules, lets see, I need to have some simple syntax for representing rules.  Basically a simple graph language.  So you’d think that all graph re-writing systems have some simple, easy-to-use graph language, right?  No. Emphatically, hell no. With maybe one exception, you have to program in Java or C++ or C#.net. Unfortunately, my machine learning system doesn’t yet know how to program in those languages.

Here’s the leap of faith, the leap of logic that I’ve come to: It would be convenient if I could express graph re-write rules as graphs themselves.  It would be convenient if I could express logical implications as graphs themselves.  It would be convenient if  my graphical programming language itself could be written as a graph. Well, it can be. In fact, it is easiest to do this if the graph is actually a hypergraph. I’ll explain why in the next section below.  If I had a hypergraph re-writing system, than I would have a place where I could unify natural language processing, logical reasoning and machine learning, all in one place.  So you’d think that anyone who was trying to build an AGI system wouldbe writing its foundations on a hypergraph rewriting system, right? No, you’d be wrong. Apparently, OpenCog is the only system that does this.  Now, the OpenCog implementation has many design warts and architectural flaws. Its hard to understand and hard to use.  But now, perhaps, now you see why I’ve pledged allegiance to it,   instead of running off with some other reasoning system or parser or Bayesian network or whatever.

Mathematical Foundations

In this section, I will try to put all of the above comments on a solid mathematical footing, invoking model theory, category theory, (and even n-categories!), and type theory.  The upshot of all of this will be that the easiest way to represent data structures so that machine learning algorithms can learn them, and then apply them both to natural-language parsing, and to logical reasoning, is to represent the data structures as hypergraphs.

From model theory and computer science, we have the concept of a signature: a set of functions which take some number of arguments and return some value (just like a signature in Java or C++).  If one ignores types for a moment (which is what lisp and scheme do), then, in principle, one can pass any value in any position of any function, and stack these up arbitrarily, recursively, even.  This is called a term algebra, or more precisely a free term algebra or ‘free theory’.  If the functions don’t have names, but are anonymous, then one has the lambda calculus.

One way to envision a member of a term algebra is as a directed tree graph.  So, if we have two functions f(x,y) and g(x,y) and three constants a,b,c, then f(a, g(b,c)) is a binary tree, with f at the top node, and g as the left node, and a, b and c as the leaves. A term algebra is then just the collection of all such trees. Nothing more, nothing less.

To do useful programming, one also needs predicates or relations: things that have truth values, and order terms. Thus, ‘greater then’ is a relation, and ‘a>b’ is either true or false.  Relations can also be things like IsA, HasA, BelongsTo, LivesIn, EmployedAt. The last two examples should make clear that relational algebras form the foundation of databases, whether SQL or noSQL.  Relations are combined with logical operators (employee X LivesIn city Y AND ReportsTo dept Z is a textbook example).

In general, one combines both term algebras and relational algebras, so that one writes things like 3<f(x,y) where f(x,y) is a term, < is a relation, 3 is a constant.  Add to this the special free-variable binding operators ForAll and ThereExists, one gets a First-Order Logic. So, for example, ForAll x ThereExists y such that 3<f(x,y).

A special case of a relation is a term re-write rule.  This is a relation that takes a term, and replaces it with a different term: for example, ab->c, which says ‘whenever you see the string ab, replace it with c’. The BNF notation of computer languages is just a collection of term re-writing relations. One uses a term rewriting system to parse a (formal) language. Graph rewriting is just a variation of this: whenever you see a graph x, replace it with a graph y.

So far, I’ve avoided the issue of types.  In programming, types allow type safety.  Types make code more readable: f(string, int) is less mysterious than f(x,y). Types solve certain abstract recursion problems in lambda calculus.  A re-write rule in BNF notation is a typed rewrite rule: a substitution a->bc holds not just for any a, but specifically, only when a is a web page, or an IP address or a URL.  A graph re-write rule that says ‘whenever you see x, replace it with y’ implicitly demands that x be typed: x can’t be just anything, it has to be a specific kind of graph, having a specific shape and linkage.  The rule applies for all graphs that have this shape, that are of this kind or type.  So a re-write rule x->y is really a rule (type x)->(type y). Graphically, its still two points x and y, with a directed edge -> in between them. Oh, wait, x and y aren’t points, x and y are graphs.  What kind of a graph has graphs as points?  What kind of graph has edges between graphs? A hypergraph!

And that is the main Ah-HA! moment.  Once you see that, you start seeing hypergraphs everywhere. Sure, you can visualize Set(a,b,c) as a tree-graph, with Set as the parent node, and three children a,b,c.  Or you can visualize this as a hypergraph: Set as a ‘link’ (a ‘hyper-edge’ with 3 endpoints, not 2), and the points a,b,c as the nodes contained in the link.  In fact, all hypergraphs are dual to these directed trees; if you have one, you can have the other.  Hypergraphs are just a convenient notation.

Lets take a moment to look back on what just happened: a function f(x,y,z) is just a hyperedge f connecting three nodes x,y,z. A boolean expression a AND b AND c can be written as AND(a,b,c), which shows a specific example of a hypergraph equivalance. It can be written as a reduction rule: (a AND b AND c) -> AND(a,b,c) which is itself just a hypergraph of the form x->y with x and y being hypergraphs.  The first-order logic constructs ‘for-all’ and ‘there-exists’ are just special cases of the lambda-calculus binding operation lambda, which binds free variables in an expression. Again, hypergraphs: lambda is just a hyperlink that binds a variable x in an expression y, and y was just a term, ahem, hypergraph!

I mentioned categories and n-categories, and I suppose I should justify this mention. Insofar as category theory is the theory of points and arrows, then a rewrite rule between graphs is a morphism in the category of small diagrams.  A subtle but important point about category theory that is almost never discussed in intro-to-cat-theory texts, is that all objects are implicitly typed. In the the category of Sets, the objects are all of the same kind: they are sets.  Its not mentioned because in a given category, all objects are of the same type; types change only when a functor maps from one to another.   So, to understand the category-theoretic equivalent of types in computer science, we must think of functors.  But, as we just saw, a graph rewriting rule is a morphism between functors.  So you could say that graph re-writing is just the category Cat of small categories.  Or you could slide down this slope in a different direction, and start calling it a 2-category. Whatever.  Perhaps its useful to point out that graph rewriting algorithms are sometimes expressed as being one-pushouts or as being 2-pushouts, with a pushout being a certain category-theoretic concept. Notable, for graph rewriting, is that any category with pushouts and equalizers has all (co-)limits. Except that, as we just saw, we want hyper-graph rewriting systems, not graph rewriting systems. So there.

What else are they good for?

In OpenCog, the Link and Node types inherit from the type Atom. This naming convention is intentionally suggestive: ‘Atom’ is meant to invoke the notion of an ‘atomic formula’ from model theory or first-order logic: that is, a formula that has no variables in it (its fully grounded), and that does have a truth value (its not composed of boolean connectives, and has no quantifiers in it).  This suggestive naming helps establish the intended use of OpenCog hypergraphs with regards to first-order logic.

The truth value is a different matter. The default (simplest) OpenCog truth value is a pair of floating point numbers: a probability and a confidence. These numbers allow several other AI concepts to be mapped into hypegraphs: Bayesian networks, Markov networks, and artificial neural networks. All three of these are graphs: directed graphs, at that. They differ in how they assign and propagate floating-point probabilites, entropies, activations. Ideas such as Markov logic networks, which implement maximum entropy principles (aka Boltzmann parition function) on a network of first-order logic expressions, can be represented with OpenCog hypergraphs.  Oh, and I should mention PLN (Probabilistic Logic Networks), which is what the atomspace was originally designed for. That’s what I like about the OpenCog hypergraph atomspace: it has a tremendously powerful ability to succinctly and easily represent complex modern AI concepts.

The good, the bad and the ugly.

You’ve heard about the good.  Now for the bad and the ugly.  First, the OpenCog atomspace implementation is slow and inefficient, over-complicated, badly architected, weakly-distributed, non-scalable, single-threaded. But lets not go there.  All this might be fixable, after a lot of programming effort (and deep, architectural thinking). Its been hotly debated in the past. Someday, maybe it’ll get fixed.

The bad thing about the OpenCog atomspace is that almost no one understands that, ahem, it is a programming language. Let me be very clear on this: OpenCog implements graph re-writing rules with the ImplicationLink. A sequence of ImplicationLinks can be used to compute things. In that sense, it is somewhat like the language Graph Programs, except that OpenCog allows fractional truth values, and logic programming and other good things.  If we stick to using ImplicationLinks with crisp truth values (T/F), then the resulting system is essentially Prolog. Of course you know that Prolog is popular for AI programming projects, because its fairly easy to write reasoning engines and expert systems and the like in Prolog.  What you may not know is that closely related to Prolog is Answer-Set Programming (ASP) . In fact, ASP uses exactly the same notation as Prolog does. It differs in two important ways: first, when you run a Prolog program, you get one answer. With ASP, you get all of the answers!  Its dramatically more powerful, and the reason for this is that  modern-day ASP solvers are built on top of modern-day Boolean SAT solvers. Which means that they are stunningly efficient and effective.

So what does this have to do with OpenCog? Well, here we have a system that, using ImplicationLinks, is essentially Prolog, more or less, when run in crisp-logic mode. Or, you could say, its like typed Lambda calculus. But do we have a simple, easy-to-use syntax like Prolog for it? No we don’t. That’s bad. Can we take an existing Prolog program, run a tool on it, and convert it to ImplicationLinks? No we don’t.  Would it run fast? No it wouldn’t: it would probably be slower than the slowest Prolog ever: Borland prolog running on a 5MHz IBM PC AT in 1986.  And forget an ASP solver for OpenCog.  For the special case where all OpenCog truth values are crisp T/F values, we do not have a Boolean SAT solver to find solutions for our graphs of ImplicationLinks.  This is bad, Really Bad. But I think that this is because very few people seem to understand that the OpenCog Atomspace really is a petri dish for programming languages.

Heck, we don’t even have anything equivalent to the RelEx Sentence Algorithms for OpenCog, even though RelEx is OpenCog-like. This absence is slowing down my efforts to continue work on the Link-Grammar parser, and to move natural language processing out of its stand-alone arena, into a general, flexible framework.

(And we’ve barely scratched the surface. In order to make implication and pattern mining run quickly in the atomspace, we need to implement something like the concept of ‘memoization‘ from lisp/scheme. But it turns out that memoization is really just a relational algebra: it is a database of short expressions that stand in for long ones. The OpenCog Atomspace is also, among other things, a relational database that can store and query not only flat tables or key-value pairs, but full-blown hypergraphs. And this isn’t a day-dream; its crucial for performance (and its partially implemented)).

Why don’t we have these things? Well, its hard. Its just not easy. We don’t have the infrastructure to make it easy, and we don’t have the users who demand these tools.   I don’t think most users are even aware of what the atomspace could even do.   Almost no one is thinking about ‘how to program in the language of OpenCog’ even though it has the potential of far surpassing any of the existing probabilistic programming languages out there.  Its time to change all this, but it will take someone smart and dedicated to do this. Many someones. This could be you.

Posted in Design, Introduction, Theory | Tagged , , , , , , , , , | 26 Comments

Catalog of Current OpenCog Atom Types

Alex van der Peet (of the OpenCog Hong Kong team) has been working on cataloguing all Atom types currently in use in the OpenCog code on the wiki site.

This page lists them all, with a page for each one:

http://wiki.opencog.org/w/Category:Atom_Types

A few of the pages still don’t have any information on them.

To all OpenCog developers: If you’re working heavily with a certain set of Atom types, please check out the corresponding wiki page, and think about adding some comments or examples.

Posted in Uncategorized | Leave a comment

The Viterbi Parser

I’ve recently made some good progress on something that I’m calling “the Viterbi decoder”, a new parser for the Link Grammar natural language parser.  So I guess that means its time to talk a bit about the why and how of this parser.

The goal of providing this decoder is to present a flexible, powerful interface for implementing high-level semantic algorithms on top of the the low-level link-grammar syntactic parser, and, in particular, for steering the parse based on high-level semantic knowledge. This allows the parser to move beyond being merely a syntactic parser, and to become fully integrated with general semantic artificial intelligence.

A less abstract list of expected benefits include:

  • Incremental parsing: the ability to obtain partial results after providing partial sentences, a word at a time.
  • Less sensitivity to sentence boundaries, allowing longer, run-on sentences to be parsed far more quickly.
  • Mitigation of the combinatorial explosion of parses.
  • Allow gramatically broken/incorrect chat dialog to be parsed; in general, to do better with slang, hip-speak.
  • Enable co-reference resolution and anaphora resolution across sentences (resolve pronouns, etc.)
  • Enable annotation of the parse graph with word-sense data, entity markers.
  • Allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternative reference resolutions.
  • Allow a plug-in architecture, so that plugins, employing higher- level semantic (AGI) algorithms can provide parse guidance and parse disambiguation.
  • Eliminate many of the hard-coded array sizes in the code.

The data structures used to implement this resemble those of the OpenCog AtomSpace. All data classes inherit from a class called Atom (which is an atomic predicate, in the sense of mathematical logic). Atoms are typed; the two core types are Links and Nodes. Thus, all data is represented in the form of a “term algebra” (aka the “Free Theory”, in the sense of model theory). This structure allows all data to be represented as (hyper-)graphs, which in turn makes the implementation of graph algorithms easier to implement. All these theoretical considerations provide a natural setting for storing Viterbi state information. Put differently, this provide a generic, uniform way of holding the various partly-finished parses, and effecting state transformations on them.

Since all of the data is represented dynamically (at run-time) by these (hyper-)graphs composed of atoms, developing custom algorithms to manipulate the parse becomes easy: there are no strange compile-time structures to master.  All algorithms can access the data in a uniform, common way.

Making the internal state directly visible allows low-level syntactic algorithms, as well as high-level, semantic algorithms to control parsing. In other words, the intended use of the Viterbi decoder is to provide a framework for parsing that should make it possible to integrate tightly (and cleanly) with high-level semantic analysis algorithms. Thus, reference and anaphora resolution can be done using the same graph structure as used for parsing; it should also allow graphical transformations, such as those currently implemented in RelEx.

One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found via the news story “Language Use is Simpler Than Prviously Thought“, per Morten Christiansen, Cornell professor of psychology.

Currently, the parser can correctly parse many short sentences. It currently runs very slowly, as no pruning algorithms have yet been implemented. Instructions for turning it on can be found in the viterbi/README file. The code is not in the 4.7.10 tarball; you need something newer: i.e. pull from the svn source tree. It will be in 4.7.11, whenever that comes out.

Here’s an example parse of “this is a test”. First, the usual link-parser output:

         +--Ost--+
   +-Ss*b+  +-Ds-+
   |     |  |    |
this.p is.v a test.n

or, with the wall words:

    +---------------RW--------------+
    |              +--Ost--+        |
    +---Wd---+-Ss*b+  +-Ds-+        |
    |        |     |  |    |        |
LEFT-WALL this.p is.v a test.n RIGHT-WALL

The output of viterbi, with some explanatory comments,  is this:

SEQ :                  # a sequence, an ordered set
  LING :               # a link-grammar link; naming conflict with opencog link.
    LING_TYPE : Wd     # the type of the link connecting two words.
    WORD_DISJ :        # holds the word and the connector used
      WORD : LEFT-WALL # all sentences begin with the left-wall.
      CONNECTOR : Wd+  # + means "connect to the right". - means left
    WORD_DISJ :
      WORD : this.p    # word with suffix as it appears in link-grammar dictionary
      CONNECTOR : Wd-
  LING :
    LING_TYPE : Ss*b   # and so on ...
    WORD_DISJ :
      WORD : this.p
      CONNECTOR : Ss*b+
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : Ss-
  LING :
    LING_TYPE : Ds
    WORD_DISJ :
      WORD : a
      CONNECTOR : Ds+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Ds-
  LING :
    LING_TYPE : Ost
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : O*t+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Os-

Oh, and I suppose its appropriate to answer the question “why is it called the Viterbi parser”?  I’m calling it that because it is inspired by (and vaguely resembles) the Viterbi algorithm famous from signal processing. A characteristic feature of that algorithm is that it maintains a set of states in parallel. As each new bit is received, some of the states become inherently inconsistent (e.g. because some checksum is violated), while other new states become possible. Once some certain number of bits have been received, the ones that can be consistently interpreted with the checksum constraints can be output. The process then repeats with each new bit streaming in.

In link-grammar, a “disjunct” can be thought of as a puzzle piece with a word printed on it. There are many different puzzle pieces with the same word on it. As each word comes in, one tries to find a piece that fits (this is like the viterbi checksum). Sometimes, more than one fits, so one has multiple ‘alternatives’ (this is like the viterbi state-vector). The algo keeps a set of these alternatives (of assembled pieces), and, as words come in, alternatives are either discarded (because nothing fits) or are elaborated on.

Unlike the viterbi algorithm, in natural language processing, it is useful to keep some of these alternatives or ambiguities around until much later stages of processing, when the disambiguation can finally be performed. As a famous example: “I saw the man with the telescope” has two valid syntactic parses, and two valid semantic interpretations.  Who was holding the telescope, me, or the man? Resolving this would be like applying a checksum to two different paths very late in the Viterbi game.

I like this analogy because it is vaguely biological as well: or perhaps I should say “neural net-ish”. The multiple, provisional states that are kept around are sort of like the activation states of a feed-forward artificial neural network. But this is not very deep: the feed-forward neural net looks like a Hidden Markov Model (HMM), and the Viterbi algorithm is essentially an HMM algorithm. No surprise!

But all this talk of algorithms hides the true reason for this work. The above algo is not strong enough to reproduce the old parser behavior: it can create islands; it ignores post-processing. The original algorithm uses integer-valued “cost” to rank parses; I want to replace this by floating point values (probabilities! maximum entropy!).

I also want to implement an “algorithm plug-in” API — basically, a way of offering “here’s the current state, go and modify it.” — to have ‘mind-agents’ in OpenCog terminology. The above puzzle-piece assembly algo would be the first to run, but clearly, others are needed to prevent islands, or to re-order states by probability/likelihood.   Some of these may be clearly distinct algos; others may end up as tangled balls of complexity.  Factorization into distinct algos is clearly possible: RelEx already had a list of algos there were applied in sequential order.  First, some POS tagging was done, then some head-word verb extraction, then some entity extraction, etc. Algorithms can be layered.

So really, the core issue I’m hoping to solve here is that of having a uniform development environment: link-grammar is in C, has no probability (besides cost), and no internal API. RelEx is in Java, is explicitly graphical, but is not a hypergraph, has no probabilities, and can’t provide parse feed-back to control link-grammar. RelEx output was pumped into OpenCog, which is in C++; it cannot feedback into RelEx or Link Grammar.  The Link Grammar dictionaries are files: how can an automated system learn a new word, and stick it into a file?

At the moment, there aren’t really any new or novel algorithms: I’ve day-dreamed some in the past, but the fractured approach halts progress. All these boundaries are barriers; the hope here is to get past all of these barriers.  The work is really an architectural re-design, and not a new whiz-bang algo.

Posted in Design, Development, Theory | Tagged , , , , , , | 1 Comment

The MOSES Metapopulation

Kaj Sotala recently asked for an update on how MOSES selects a “species” to be “mutated”, when it is searching for the fittest program tree. I have some on-going, unfinished research in this area, but perhaps this is a good time to explain why.

To recap: MOSES is a program-learning system. That is, given some input data, MOSES attempts to learn a computer program that reproduces the data. It does so by applying a mixture of evolutionary algorithms: an “inner loop” and an “outer loop”. The inner loop explores all of the mutations of a “species” (a “deme”, in MOSES terminology), while the outer loop chooses the next deme to explore. (Each deme is a “program tree”, that is, a program written in a certain lisp-like programming language).

So: the outer loop selects some program tree, whose mutations will be explored by the inner loop. The question becomes, “which program tree should be selected next?” Now, nature gets to evolve many different species in parallel; but here, where CPU cycles are expensive, its important to pick a tree whose mutations are “most likely to result in an even fitter program”. This is a bit challenging.

MOSES works from a pool of candidate trees, of various fitnesses. With each iteration of the inner loop, the pool is expanded: when some reasonably fit mutations are found, they are added to the pool. Think of this pool as a collection of “species”, some similar, some not, some fit, some, not so much. To iterate the outer loop, it seems plausible to take the fittest candidate in the pool, and mutate it, looking for improvements. If none are found, then in the next go-around, the second-most-fit program is explored, etc. (terminology: in moses, the pool is called the “metapopulation”).

It turns out (experimentally) that this results in a very slow algorithm. A much better approach is to pick randomly from the highest scorers: one has a much better chance of getting lucky this way. But how to pick randomly? The highest scorers are given a probability: p ~ exp (score /T) so in fact, the highest scoring have the highest probability of being picked, but the poorly-scoring have a chance too. This distribution is the “Gibbs measure” aka “Boltzmann distribution”; (T is a kind of “temperature”, it provides a scale; its held constant in the current algos)  I’m guessing that this is the right measure to apply here, and can do some deep theoretical handwaving, but haven’t really worked this out in detail. Experimentally, it works well; there even seems to be a preferred temperature that seems to work well for most/all different problems (but this is not exactly clear).

One can do even better. Instead of using the score, a blend of score minus program tree complexity works better; again, this is experimentally verified.   Nil added this back when, and his theoretical justification was to call it “Solomonoff complexity”, and turn it into a ‘Bayesian prior’. From an engineering viewpoint, its basically saying that, to create a good design suitable for some use, its better to start with a simple design and modify it, than to start with a complex design and modify it. In MOSES terminology, its better to pick an initial low-complexity but poorly scoring deme, and mutate it, than to start with something of high complexity, high score, and mutate that. Exactly what the blending ratio (between high score, and high complexity) is, and how to interpret it, is an interesting question.

Experimentally, I see another interesting behaviour, that I am trying to “fix”.   I see a very classic “flight of the swallow” learning curve, dating back to the earliest measurements of the speed of telegraph operators at the turn of the 19th century. At first, learning is fast, and then it stalls, until there is a break-through; then learning is again fast (for a very brief time — weeks for telegraph operators) and then stalls (years or a decade for telegraph operators). In MOSES, so, at first, one picks a deme, almost any deme, and almost any mutation will improve upon it. This goes on for a while, and then plateaus. Then there’s a long dry spell — picking deme after deme, mutating it, and finding very little or no improvement. This goes on for a long time (say, thousands of demes, hours of cpu time), when suddenly there is a break-through: dozens of different mutations to some very specific deme all improve the score by some large amount. The bolzmann weighting above causes these to be explored in the next go-around, and mutations of these, in turn, all yield improvements too. This lasts for maybe 10-20 steps, and then the scores plateau again. Exactly like the signalling rate of 19th century telegraph operators :-) Or the ability of guitar players. Or sportsmen, all of which have been measured in various social-science studies, and have shown the “flight of the swallow” curve on them.

(Can someone PLEASE fix the horribly deficient Wikipedia article on “learning curve”? It totally fails to cite any of the seminal research and breakthroughs on this topic. Check out google images for examples of fast learning, followed by long plateau.

So: e.g.

Learning curve

learning curve

Learning curve. In real life. For Salesmen.

Actual MOSES curves look more like this, with rapid progress followed by stagnant plateaus, punctuated with rapid progress, again. Exccept the plateaus are much flatter and much longer, and the upward curves are much sharper and faster.

All these curves beg the question: why is google finding only the highly stylized ones, and not showing any for raw, actual data? Has the learning curve turned into an urban legend??

Here's a real-life learning curve, taken from MOSES, using real data (the "bank" dataset) from a previous OpenCog blog post on MOSES. Although this learning curve shows a combination of the inner and outer loops, and so, strictly speaking does not represent what I'm discussing here.

).

Recently, I have been trying to shorten the plateau, by trying to make sure that the next deme I pick for exploration is one that is least similar to the last one explored. The rationale here is that the metapaopulation gets filled with lots of very very similar species, all of which are almost equally fit, all of which are “genetically” very similar. Trying to pick among these, to find the magic one, the one whose mutations will yeild a break-through, seems to be a losing strategy. So, instead, add a diversity penalty: explore these “species” that are as different as possible from the current one (but still have about the same fitness score). So far, this experiment is inconclusive; I wasn’t rewarded with instant success, but more work needs to be done. Its actually fairly tedious to take the data…

Posted in Design, Documentation, Theory | Tagged , , , | 6 Comments

Fishgram: Frequent Interesting Subhypergraph Mining for OpenCog

One of the tools OpenCog has needed for a long time, is something that can relatively quickly scan an Atomspace and find the interesting patterns in it.  ”Interesting” may be defined in a variety of ways, such as “frequent”, or “surprising” (as measured by information theory), etc.  This capability has often been referred to in OpenCog documents as “pattern mining.”

Jade O’Neill (fka Jared Wigmore) implemented python software doing this for the Atomspace some time ago — Fishgram, the Frequent Interesting SubHyperGRaph Miner.   Fishgram has been used to recognize patterns in Atomspaces resultant from OpenCog’s “perception” of a Unity3D based virtual world.

Now,  a wiki page has been created, covering some details of Fishgram – including pseudocode, an explanation of the algorithm, and some indication of which software classes carry out which parts of the algorithm…

http://wiki.opencog.org/w/Fishgram

Plenty more work needs to be done with Fishgram, yet, it does
currently work and can extract some interesting patterns from
Atomspaces….

Some simple examples have also been done, feeding patterns output via Fishgram into PLN…

I think this is a very valuable tool that could be used for a lot of different OpenCog applications, and it would be great to see others jump onto it and help with development.

The current version of Fishgram looks for frequent subhypergraphs (i.e. frequent subhypergraph patterns, which may contain  multiple variables).  One thing that Jade and I have talked about a lot is extending Fishgram to search for “surprising” subhypergraphs, where surprisingness may be measured using interaction information or synergy, as described in these papers:

http://www.rni.org/bell/nara4.pdf

http://arxiv.org/abs/1004.2515/

Those who like logic may also enjoy this paper, which connects interaction information with the logic of questions:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.6110

It seems that a good implementation of a good measure of surprisingness will be a valuable thing to have in OpenCog generally, not just in Fishgram.   If we want “novelty seeking” to be one of the top-level goals of a young AGI or proto-AGI system (which I think we do), then having a nice way to measure novelty seems like a good things — and the interaction information and the informational synergy, as described in these papers, seem a good approach.

Onward and upward ;)

Ben G

Posted in Uncategorized | Leave a comment

Genetic Crossover in MOSES

MOSES is a system for learning programs from input data.  Given a table of input values, and a column of outputs, MOSES tries to learn a program, the simplest program that can reproduce the output given the input values. The programs that it learns are in the form of a “program tree” —  a nested concatenation of operators, such as addition or multiplication, boolean AND’s or OR’s, if-statements, and the like, taking the inputs as arguments.  To learn a program, it starts by guessing a new random program.  More precisely, it generates a new, random program tree, with as-yet unspecified operators at the nodes of the tree. So, for example, an arithmetic node maybe be addition, or subtraction, or multiplication, division, or it may be entirely absent.  It hasn’t yet been decided which.   In MOSES, each such undecided node is termed a “knob”, and program learning is done by “turning the knobs” until a reasonable program is found.  But things don’t stop there: once a “reasonable” program is found, a new, random program tree is created by decorating this “most reasonable” program with a new set of knobs.  The process then repeats: knobs are turned until an even better program is found.

Thus, MOSES is a “metalearning” system: it consists of an outer loop, that creates trees and knobs, and an inner loop, that finds optimal knob settings.  Both loops “learn” or “optimize”; it is the nesting of these that garners the name “metalearning”. Each loop can use completely different optimization algorithms in its search for optimal results.

The rest of this post concerns this inner loop, and making sure that it finds optimal knob settings as quickly and efficiently as possible. The space of all possible knob settings is large: if, for example, each knob has 5 possible settings, and there are 100 knobs, then there is a total of 5100 possible different settings: a combinatorial explosion. Such spaces are hard to search. There are a variety of different algorithms for exploring such a space. One very simple, very traditional algorithm is “hillclimbing”. This algo starts somewhere in this space, at a single point, say, the one with all the knobs set to zero. It then searches the entire local neighborhood of this point: each knob is varied, one at a time, and a score is computed. Of these scores, one will be best. The corresponding knob setting is then picked a the new center, and the process then repeats; it repeats until there is no improvement: until one can’t “climb up this hill” any further. At this point, the inner loop is done; the “best possible” program has been found, and control is returned to the outer loop.

Hill-climbing is a rather stupid algorithm: most knob settings will result in terrible scores, and are pointless to explore, but the hill-climber does so anyway, as it has no clue as to where the “good knobs” lie. It does an exhaustive search of the local neighborhood of single-knob twists.  One can do much better by using estimation-of-distribution algorithms, such as the Bayesian Optimization Algorithm.  The basic premise is that knob settings are correlated: good settings are near other good settings.  By collecting statistics and computing probabilities, one can make informed, competent guesses at which knob settings might actually be good.  The downside to such algorithms is that they are complex:  the code is hard to write, hard to debug, and slow to run: there is a performance penalty for computing those “educated guesses”.

This post explores a middle ground: a genetic cross-over algorithm that improves on simple hill-climbing simply by blindly assuming that good knob settings really are “near each other”, without bothering to compute any probabilities to support this rash assumption.  The algorithm works; headway can be made by exploring only the small set of knob settings that correlate with previous good knob settings.

To explain this, it is time to take a look at some typical “real-life” data. In what follows, a dataset was collected from a customer-satisfaction survey; the goal is to predict satisfaction from a set of customer responses.  The dataset is a table; the outer loop has generated a program decorated with a set of knobs.  Starting with some initial knob setting, we vary each knob in turn, and compute the score. The first graph below shows what a  typical “nearest neighborhood” looks like.  The term “nearest neighborhood” simply means that, starting with the initial knob setting, the nearest neighbors are those that differ from it by exactly one knob setting, and no more.  There is also a distance=2 neighborhood: those instances that differ by exactly two knob settings from the “center” instance.  Likewise, there is a distance=3 neighborhood, differing by 3 knob settings, etc. The size of each neighborhood gets combinatorially larger.  So, if there are 100 knobs, and each knob has five settings, then there are 5 × 100=500 nearest neighbors. There are 500 × 499 / 2 = 125K next-nearest neighbors, and 500 × 499 × 498 / (2 × 3) = 21M instances at distance=3. In general, this is the binomial coefficient: (500 choose k) for distance k. Different knobs, however, may have more or fewer than just 5 settings, so the above is just a rough example.

Nearest Neighbor Scores

Nearest Neighbor Scores

The above graph shows the distribution of nearest neighbor scores, for a “typical” neighborhood. The score of the center instance (the center of the neighborhood) is indicated by the solid green line running across the graph, labelled “previous high score”.  All of the other instances differ by exactly one knob setting from this center.  They’ve been scored and ranked, so that the highest-scoring neighbors are to the left.  As can be seen, there are maybe 15 instances with higher scores than the center, another 5 that seem to tie.  A slow decline is followed by a precipitous drop; there are another 80 instances with scores so bad that they are not shown in this figure.  The hill-climbing algo merely picks the highest scorer, declares it to be the new center, and repeats the process.

All of the other neighborhoods look substantially similar. The graph below shows an average over many generations (here, each iteration of the inner loop is one generation). The jaggedness above is smoothed out by averaging.

Nearest Neighbor Score Change

Nearest Neighbor Score Change

Rather than searching the entire neighborhood, one would like to test only those knob settings likely to yield good scores. But which might these be?  For nearest neighbors, there is no way to tell, without going through the bother of collecting statistics, and running them through some or another Bayesian estimation algorithm.

However, for more distant neighbors, there is a way of guessing and getting lucky: perform genetic cross-overs.  That is, take the highest and next-highest scoring instances, and create a new instance that differs from the center by two knob-settings, the two knobs associated with the two high scorers.  In fact, this new instance will very often be quite good, beating both of its parents.   The graph below shows what happens when we cross the highest scorer with each one of the next 70 highest. The label “1-simplex” simply reminds us that these instances differ by exactly two knob settings from the center.  More on simplexes later.  The green zero line is located at the highest-scoring single-knob change.  The graph shows that by starting here, and twiddling the next-most-promising knob, can often be a win. Not always: in the graph below, only 4 different knobs showed improvement. However, we explored relatively few instances to find these four; for this dataset, most exemplars have thousands of knobs.

Average Score Change, 1-simplex

Average Score Change, 1-simplex

The take-away lesson here is that we can avoid exhaustive searches by simply crossing the 10 or 20 or 30 best instances, and hoping for the best. In fact, we get lucky with these guesses quite often. What happens if, instead of just crossing two, we cross three of the top scorers?  This is the “2-simplex”, below:

Average Score Change

Average Score Change, 2-simplex

Notice that there are now even more excellent candidates!  How far can we go?  The 3-simplex graph below shows the average score change from crossing over four high-scoring instances:

Average Score Change

Average Score Change. 3-simplex

The term “crossover” suggests some sort of “sexual genetic reproduction”. While this is correct, it is somewhat misleading.   The starting population is genetically very uniform, with little “genetic variation”.  The algorithm starts with one single “grandparent”, and produces a population of “parents”, each of which differ from the grandparent by exactly one knob setting. In the “nearest neighborhood” terminology, the “grandparent” is the “center”, and each “parent” is exactly one step away from this center. Any two “parents”, arbitrarily chosen, will always differ from one-another by exactly two knob settings. Thus, crossing over two parents will produce a child that differs by exactly one knob setting from each parent, and by two from the grandparent. In the “neighborhood” model, this child is a distance=2 from the grandparent.   For the case of  three parents, the child is at distance=3 from the grandparent, and so on: four parents produce a child that is distance=4 from the grandparent.  Thus, while “sexual reproduction” is a sexy term, it looses its punch with the rather stark uniformity of the parent population; thinking in terms of “neighbors” and “distance” provides a more accurate mental model of what is happening here.

The term “simplex” used above refers to the shape of the iteration over the ranked instances: a 1-simplex is a straight line segment, a 2-simplex is a right triangle, a 3-simplex is a right tetrahedron. The iteration is performed with 1, 2 or 3 nested loops that cross over 1, 2 or 3 instances against the highest. It is important to notice that the loops do not run over the entire range of nearest neighbors, but only over the top scoring ones. So, for example, crossing over the 7 highest-scoring instances for the 3-simplex generates 6!/(6-3)! = 6 × 5 × 4 = 120 candidates. Scoring a mere 120 candidates can be very quick, as compared to an exhaustive search of many thousands of nearest neighbors. Add to this the fact that most of the 120 are likely to score quite well, whereas only a tiny handful of the thousands of nearest neighbors will show any improvement, and the advantage of this guessing game is quite clear.

So what is it like, after we put it all together? The graph below shows the score as a function of runtime.

Score as function of time

Score as function of time

In the above graph, each tick mark represents one generation. The long horizontal stretches between tick marks shows the time taken to perform an exhaustive nearest-neighborhood search. For the first 100 seconds or so, the exemplar has very few knobs in it (a few hundred), and so an exhaustive search is quick and easy. After this point, the exemplars get dramatically more complex, and consist of thousands of knobs. At this point, an exhaustive neighborhood search becomes expensive: about 100 seconds or so, judging from the graph. While the exhaustive search is always finding an improvement for this dataset, it is clear that performing some optimistic guessing can improve the score a good bit faster. As can be seen from this graph, the algorithm falls back to an exhaustive search when the optimistic simplex-based guessing fails to show improvement; it then resumes with guessing.

To conclude: for many kinds of datasets, a very simple genetic-crossover algorithm combined with hillclimbing can prove a simple but effective search algorithm.

Note Bene: the above only works for some problem types; thus it is not (currently) enabled by default. To turn it on, specify the -Z1 flag when invoking moses.

Appendix

Just to keep things honest, and to show some of the difficulty of algorithm tuning, below is a graph of some intermediate results taken during the work.  I won’t explain what they all are, but do note one curious feature:  the algos which advance the fastest initially seem to have trouble advancing later on.  This suggests a somewhat “deceptive” scoring landscape: the strong early advancers get trapped in local maxima that they can’t escape.   The weak early advancers somehow avoid these traps.  Note also that results have some fair dependence on the random number generator seed; different algos effectively work with different random sequences, and so confuse direct comparison by some fair bit.

Many Different Algorithms

Many Different Algorithms

Posted in Design, Documentation, Introduction, Theory | Tagged , , , , , | 2 Comments

Tuning Metalearning in MOSES

I’ve been studying MOSES recently, with an eye towards performance tuning it. Turns out optimization algorithms don’t always behave the way you think they do, and certainly not the way you want them to.

Given a table of values, MOSES will automatically learn a program that reproduces those values. That is, MOSES performs table regression: given N columns of “input” values, and one column of “output”, MOSES will create a program that outputs the output, given the inputs.  MOSES can deal with both floating point and boolean inputs, and thus can learn, for example, expressions such as ((x<2) AND b) OR (x*(y+1) >3).  MOSES programs are real “programs”: it can even learn branches and loops, although I won’t explore that here. For performance tuning, I studied the 4-parity problem: given 4 input bits, compute the parity bit.  Written out in terms of just AND, OR and NOT, this is a fairly complex expression, and is rather non-trivial to learn.

MOSES performs learning by keeping a “metapopulation” of example programs, or “exemplars”.  These are graded on how well the match the output, given the inputs. For the 4-parity problem, there are 24=16 different possible inputs; a given program may get any number of these correct.  For example, there are 16 ways to get one answer wrong; 16×15 ways to get two wrong, 16×15×14 ways to get three wrong, etc. This is the binomial distribution: (16 choose k) ways to get k answers wrong, in general. But this doesn’t mean that there are only 16 different programs that get one answer wrong: there are zillions: some simple, some very very complex.

As MOSES iterates, it accumulates a metapopulation of programs that best fit the data. As soon as it finds a program that gets more correct answers than the others, the old metapopulation is wiped out; but then, it starts growing again, as new programs with equal score are found.  This is shown in the following graph:

Metapopulation size as function of generation.

Metapopulation size as function of generation number.

The red line shows the metapopulation size (divided by 50), as a function of the generation number (that is, the iteration count).  It can be seen to collapse every time the score improves; here, the “minus score”, in green is the number of wrong answers: a perfect score has zero wrong answers; the program stops when a perfect score is reached.

In blue, the complexity of the program — actually, the complexity of the least complex program that produces the given score. Computing the parity requires a fairly complex combination of AND’s OR’s and NOT’s; there is a minimum amount of complexity such a program can have.  Here, for example, are two different programs that compute the parity perfectly, a short one:

and(or(and(or(and(!#1 !#2) and(!#3 #4)) or(!#2 !#3))
   and(#1 #2) and(#3 !#4))
   or(and(!#1 #2) and(#1 !#2) and(!#3 !#4) and(#3 #4)))

and a longer one:

or(and(or(and(or(!#1 !#3) #4) and(or(#1 !#2) !#3 !#4)
   and(or(#3 #4) #2)) or(and(or(!#1 !#4) #2 !#3)
   and(or(!#2 #3) #1 #4) and(!#1 !#4) and(!#2 #3)))
   and(#1 !#2 #3 !#4))

More on complexity later.

But first: how long does it take for MOSES to find a solution to 4-parity? It turns out that this depends strongly on the random-number sequence.  MOSES makes heavy use of a random number generator to explore the problem space.  Each run can be started with a different seed value, to seed the random number generator.  Some runs find the correct solution, some take a surprisingly long amount of time. Amazingly so: the distribution appears to follow a logarithmic distribution, as in the following graph:

Runtime, showing temperature dependence

One the vertical axis, the amount of time, in seconds, to find a solution. One the horizontal axis, the order in which a solution was found, out of 20 random attempts. The way to read this graph is as follows: there is a probability Pr=1/20 chance of finding a solution in about 10 seconds. There is a Pr=2/20 chance of finding a solution in about 20 seconds, etc. Continuing: about a Pr=6/20 chance of finding a solution in less than about 100 seconds, and a Pr=17/20 chance of finding a solution in less than about 1000 seconds.

The shape of this graph indicates that there is a serious problem with the current algorithm. To see this, consider running two instances of the  algorithm for 300 seconds each. Per the above graph, there is a 50-50 chance that each one will finish, or a 75% chance that at least one of them will finish.  That is, we have a 75% chance of having an answer after 600 CPU-seconds.  This is better than running a single instance, which requires about 900 seconds before it has a 75% chance of finding an answer!  This is bad.  It appears that, in many cases, the algorithm is getting stuck in a region far away from the best solution.

Can we do better? Yes. Write p = Pr(t<T) for the probability that a single instance will find a solution in less time than T.  Then, from the complexity point of view, it would be nice if we had an algorithm if two instances did NOT run faster than a single instance taking twice as long; that is, if

Pr(t<2T) ≤ p2+2p(1-p)

The first term, p2, is the probability that both instances finished.  The second term is the probability that one instance finished, and the other one did not (times two, as there are two ways this could happen).   More generally, for n instances,  we sum the probability that all n finished, with the probability that n-1 finished, and one did not (n different ways), etc.:

Pr(t<nT) ≤ pn + npn-1(1-p) + n(n-1)pn-2(1-p)2 + … + np(1-p)n-1

This inequality, this desired bound on performance, has a simple solution, given by the exponential decay of probability:

Pr(t<T) = 1-exp(-T/m)

As before,  Pr(t<T) is the probability of finding a solution in less than time T, and m is the mean time to finding a solution (the expectation value). To better compare the measured performance to this desired bound, we need to graph the data differently:

Showing the exponential bound

This graph shows the same data as before, but graphed differently: the probability of not yet having found a solution is shown on the horizontal axis. Note that this axis is logarithmic, so that the exponential decay bound becomes a straight line.  Here, the straight purple line shows the bound for a 500 second decay constant; ideally, we’d like an algorithm that generates points below this line.

Before continuing, a short aside on the label “temp“, which we haven’t explained yet. During the search, MOSES typically picks one of the simplest possible programs out of the current metapopulation, and explores variations of it, it explores its local neighborhood.  If it cannot find a better program, it picks another, simple, exemplar out of the metapopulation, and tries with that, and so on.   It occurred to me that perhaps MOSES was being too conservative in always picking from among the least complex exemplars.  Perhaps it should be more adventurous, and occasionally pick a complex exemplar, and explore variations on that.   The results are shown in the green and blue lines in the graph above.  The select_exemplar() function uses a Boltzmann distribution to pick the next exemplar to explore. That is, the probability of picking an exemplar of complexity C as a starting point is

exp(-C/temp)

where temp is the “temperature” of the distribution. The original MOSES algorithm used temp=1, which appears to be a bit too cold; a temperature of 2 seems about right.  With luck, this new, improved code will be checked into BZR by the time you read this.

There is another issue: the unbounded size of the metapopulation. When MOSES stalls, grinding away and having trouble finding a better solution, the size of the metapopulation tends to grow without bounds, linearly over time. It can get truly huge: sometimes up to a million, after a few thousand generations.  Maintaining such a large metapopulation is costly: it takes up storage, and eats up CPU time to keep it sorted in order of complexity.  Realistically, with a metapopulation that large, there is only a tiny chance (exponentially small!) that one of the high-complexity programs will be selected for the next round. The obvious fix is to clamp down on the population size, getting rid of the unlikely, high-complexity members.   I like the results so far:

clamped data

Runtime, using a limited population size.

Clamping the population size clearly improves performance — by a factor of two or more, as compared to before.  However, the troublesome behavior, with some solutions being hard to discover, remains.

Now, to attack the main issue: Lets hypothesize what might be happening, that causes the exceptionally long runtimes.  Perhaps the algorithm is getting stuck at a local maximum?  Due to the knob-insertion/tweaking nature of the algorithm, there are no “true” local maxima, but some may just have very narrow exits.  The standard solution is to apply a simulated-annealing-type trick, to bounce the solver out of the local maximum.  But we are already using a Boltzmann factor, as described above, so what’s wrong?

The answer seems to be that the algorithm was discarding the “dominated”  exemplars, and was keeping only those with the best score, and varying levels of complexity. It only applied the Boltzmann factor to the complexity.  What if, instead, we applied the Boltzmann factor to mixture of score and complexity?  Specifically, lets try this:

exp(-(C – S W) / temp)

Here, C is the complexity, as before, while S is the score, and W a weight.  That is, some of the time, the algorithm will select exemplars with a poor score, thus bouncing out of the local maximum.  Setting W to zero regains the old behavior, where only the highest-scoring exemplars are explored.  So .. does this work? Yes! Bingo! Look at this graph:

score-weighted annealing

Score-weighted Annealing

Two sets of data points, those for W=1/4 and 1/3, look very good.  Its somewhat strange and confusing that other W values do so poorly.   I’m somewhat worried that the W=1/4 value is “magical”: take a look again at the very first graph in this post.  Notice that every time a better solution is found, the complexity jumps by about 4.  Is this the W=1/4 value special to the 4-parity problem? Will other problems behave similarly, or not?

I’m continuing to experiment. Collecting data takes a long time. More later… The above was obtained with the code in bzr revision 6573, with constant values for “temp” and “weight” hand-edited as per graphs. Later revisions have refinements that fundamentally alter some loops, including that in select_exemplar(), thus altering the range of reasonable values, and the meaning/effect of some of these parameters. Sorry :-)

I do hope that this post does offer some insight into how MOSES actually works.  A general overview of MOSES can be found on the MOSES wiki, as well as a detailed description of the MOSES algorithm. But even so, the actual behavior, above, wasn’t obvious, at least to me, until I did the experiments.

Appendix: Parallelizability

A short footnote about the generic and fundamental nature of the exponential decay of time-to-solution in search problems. Earlier in this post, there is a derivation of exponential decay as the result of running N instances in parallel.   How should this be understood, intuitively?

Search algorithms are, by nature, highly parallelizable: there are many paths (aka exemplars) to explore; some lead to a solution, some do not.  (An exemplar is like a point on a path: from it, there are many other paths leading away).  A serial search algorithm must implement a chooser: which exemplar to explore next? If this chooser  is unlucky/unwise, it will waste effort exploring exemplars that don’t lead to a solution, before it finally gets around to the ones that do.  By contrast, if one runs N instances in parallel (N large), then the chooser doesn’t matter, as the N-1 ‘bad’ exemplars don’t matter: the one good one that leads to a solution will end the show.

Thus, we conclude: if a serial search algorithm follows the exponential decay curve, then it has an optimal chooser for the next exemplar to explore.  If it is “worse” than exponential, then the chooser is poorly designed or incapable.  If it is “better” than exponential, then that means that there is a fixed startup cost associated with each parallel instance: cycles that each instance must  pay, to solve the problem, but do not directly advance towards a solution.  Ideal algorithms avoid/minimize such startup costs.  Thus, the perfect, optimal algorithm, when run in serial mode, will exhibit exponential solution-time decay.

The current MOSES algorithm very nearly achieves this for 4-parity, as shown in this last figure, which compares the original chooser to the current one (bzr revno 6579)

runtime, tuned chooser

Posted in Design, Development, Documentation, Theory | Tagged , , | 2 Comments

Preview of a virtual learning environment

It’s been a while since the last update, but be assured we’ve been very busy working away on the embodiment code and developing our virtual learning environment. With AGI-11 now in progress at the Googleplex, we’ve put together a few videos to give an outline of what we’re working on. There isn’t any overly advanced learning going on yet, but it gives you a feel for where the project is going.

First up is a video demonstrating a human controlled player navigating and interacting with the world. This world is built in Unity3D, so eventually we’ll be able to put this environment online, make it an iPad app, or whatever else, and let you guys interact and teach OpenCog directly.

The things to note are that it’s based off a minecraft-like environment, which means the player and AI will be able to modify the terrain and build things. Other objects can also be moved around and interacted with. We’ve got a very flexible action system that allows new action types to easily be added, and OpenCog will be able to learn the causal effect of executing these previously unknown actions.

Next is a demonstration of 3D pathfinding (implemented by Troy Deheng), with the satisfaction of a single “demand” for energy by consuming the battery object. In addition, it also shows simple planning by asking for a battery from the player if OpenCog can’t find one in the world. After asking this, the player spawns a new battery with a switch, and OpenCog efficiently detects this correlation between using the switch and new batteries appearing. In essence learning a new behaviour to satisfy it’s goals.

Jared Wigmore has been working on Fishgram, which is a frequent sub-hypergraph miner that detects patterns in the AtomSpace (and so by extension, also in the virtual world). This is the component used to detect that pushing a switch creates a new battery.

Last is a shorter video showing demands beyond just energy. These are analogous to goals, but not quite. They are slightly different in that they impel the system towards certain goals. The new demand in this video is one for integrity, which is roughly analogous to fatigue/health. In the video, the house is known to satisfy this demand and increase integrity, but it oscillates between which demand is most important. Integrity then Energy, then back again. Zhenhua Cai has already added a couple of more demands: Competence and Certainty.

Thanks to Cord Krohn for putting the videos together, as well as doing environment and character design and thanks to Michael Jia for 3d modelling and art assets.

Posted in Development | Tagged | 8 Comments