Why Hypergraphs?

OpenCog uses hypergraphs to represent knowledge.  Why?  I don’t think this is clearly, succinctly explained anywhere, so I will try to do so here.  This is a very important point: I can’t begin to tell you how many times I went searching for some whiz-bang logic programming system,  or inference engine, or theorem-prover, or some graph re-writing engine, or some probabilistic programming system, only to throw up my hands up and realize that, after many wasted hours, none of them do what I want.  If you’re interested in AGI, then let me assure you: they don’t do what you want, either.  So, what do I want them to do, and why?

Well, lets begin easy: with graph re-writing systems.  These days, almost everyone agrees that a great way to represent knowledge is with graphs.  The structure IsA(Cat, Animal) looks like a graph with two vertexes, Cat and Animal, and a labelled edge, IsA, between them.  If I also know that IsA(Binky, Cat), then, in principle, I should be able to deduce that IsA(Binky, Animal).  This is a simple transitive relationship, and the act of logical deduction, for this example, is a simple graph re-write rule: If you see two IsA edges in a row, you should draw a third IsA edge between the first and the last vertex.  Easy, right?

So perhaps you’d think that all logic induction and reasoning engines have graph rewrite systems at their core, right? So you’d think. In fact, almost none of them do.  And those that do, do it in some internal, ad hoc, non-public, undocumented way: there’s no API, its not exposed externally; its not an ‘official’ part of the system for you to use or tinker with.

OK, so why do I need a graph re-write system? Well, I’ve been working on a natural language parser, a so-called Viterbi decoder for Link Grammar.  My initial graph is a string of words: a sentence.  The vertexes are words, and the edges are arrows called “next word”. Real simple. To parse this sentence, I want to apply a certain set of simple graph-rewrite rules: for example, if word X is a noun, then create an arrow, called ‘part-of-speech’ (POS),  from word X to the special vertex ‘noun’.  If the word immediately before word X is an adjective (i.e. if  it has a POS arrow pointing to ‘adjective’), then create a new arrow, called ‘noun modifier’, pointing from X to this word before it.   This kind of graph markup is called ‘dependency parsing‘, and is a very popular way of doing natural language parsing.  So you’d think that all dependency parsers have a graph re-write system at their core, right?  Hardly. In fact, just about none of them do.  And if they do, they’re walled off, hacked up, undocumented … you get the idea.

The only dependency parser that I know of that has an explicit graph-rewriting system in it, that is open for tinkering, and is documented (!) is RelEx.  And that’s great.  Wow!  Although RelEx invented and used its own, custom, graph rewrite system, I suppose that, in principle, it could have used some other, pre-existing system to do this (Well, it couldn’t, because in 2005, there weren’t any robust, open-source graph rewriting systems. Whatever).

What else do I want to do? Well, I daydream about using a machine-learning system to learn new rules!   I mean, this is the goal of AGI, right? Have a machine that can learn new things?  Well, to learn new rules, lets see, I need to have some simple syntax for representing rules.  Basically a simple graph language.  So you’d think that all graph re-writing systems have some simple, easy-to-use graph language, right?  No. Emphatically, hell no. With maybe one exception, you have to program in Java or C++ or C#.net. Unfortunately, my machine learning system doesn’t yet know how to program in those languages.

Here’s the leap of faith, the leap of logic that I’ve come to: It would be convenient if I could express graph re-write rules as graphs themselves.  It would be convenient if I could express logical implications as graphs themselves.  It would be convenient if  my graphical programming language itself could be written as a graph. Well, it can be. In fact, it is easiest to do this if the graph is actually a hypergraph. I’ll explain why in the next section below.  If I had a hypergraph re-writing system, than I would have a place where I could unify natural language processing, logical reasoning and machine learning, all in one place.  So you’d think that anyone who was trying to build an AGI system wouldbe writing its foundations on a hypergraph rewriting system, right? No, you’d be wrong. Apparently, OpenCog is the only system that does this.  Now, the OpenCog implementation has many design warts and architectural flaws. Its hard to understand and hard to use.  But now, perhaps, now you see why I’ve pledged allegiance to it,   instead of running off with some other reasoning system or parser or Bayesian network or whatever.

Mathematical Foundations

In this section, I will try to put all of the above comments on a solid mathematical footing, invoking model theory, category theory, (and even n-categories!), and type theory.  The upshot of all of this will be that the easiest way to represent data structures so that machine learning algorithms can learn them, and then apply them both to natural-language parsing, and to logical reasoning, is to represent the data structures as hypergraphs.

From model theory and computer science, we have the concept of a signature: a set of functions which take some number of arguments and return some value (just like a signature in Java or C++).  If one ignores types for a moment (which is what lisp and scheme do), then, in principle, one can pass any value in any position of any function, and stack these up arbitrarily, recursively, even.  This is called a term algebra, or more precisely a free term algebra or ‘free theory’.  If the functions don’t have names, but are anonymous, then one has the lambda calculus.

One way to envision a member of a term algebra is as a directed tree graph.  So, if we have two functions f(x,y) and g(x,y) and three constants a,b,c, then f(a, g(b,c)) is a binary tree, with f at the top node, and g as the left node, and a, b and c as the leaves. A term algebra is then just the collection of all such trees. Nothing more, nothing less.

To do useful programming, one also needs predicates or relations: things that have truth values, and order terms. Thus, ‘greater then’ is a relation, and ‘a>b’ is either true or false.  Relations can also be things like IsA, HasA, BelongsTo, LivesIn, EmployedAt. The last two examples should make clear that relational algebras form the foundation of databases, whether SQL or noSQL.  Relations are combined with logical operators (employee X LivesIn city Y AND ReportsTo dept Z is a textbook example).

In general, one combines both term algebras and relational algebras, so that one writes things like 3<f(x,y) where f(x,y) is a term, < is a relation, 3 is a constant.  Add to this the special free-variable binding operators ForAll and ThereExists, one gets a First-Order Logic. So, for example, ForAll x ThereExists y such that 3<f(x,y).

A special case of a relation is a term re-write rule.  This is a relation that takes a term, and replaces it with a different term: for example, ab->c, which says ‘whenever you see the string ab, replace it with c’. The BNF notation of computer languages is just a collection of term re-writing relations. One uses a term rewriting system to parse a (formal) language. Graph rewriting is just a variation of this: whenever you see a graph x, replace it with a graph y.

So far, I’ve avoided the issue of types.  In programming, types allow type safety.  Types make code more readable: f(string, int) is less mysterious than f(x,y). Types solve certain abstract recursion problems in lambda calculus.  A re-write rule in BNF notation is a typed rewrite rule: a substitution a->bc holds not just for any a, but specifically, only when a is a web page, or an IP address or a URL.  A graph re-write rule that says ‘whenever you see x, replace it with y’ implicitly demands that x be typed: x can’t be just anything, it has to be a specific kind of graph, having a specific shape and linkage.  The rule applies for all graphs that have this shape, that are of this kind or type.  So a re-write rule x->y is really a rule (type x)->(type y). Graphically, its still two points x and y, with a directed edge -> in between them. Oh, wait, x and y aren’t points, x and y are graphs.  What kind of a graph has graphs as points?  What kind of graph has edges between graphs? A hypergraph!

And that is the main Ah-HA! moment.  Once you see that, you start seeing hypergraphs everywhere. Sure, you can visualize Set(a,b,c) as a tree-graph, with Set as the parent node, and three children a,b,c.  Or you can visualize this as a hypergraph: Set as a ‘link’ (a ‘hyper-edge’ with 3 endpoints, not 2), and the points a,b,c as the nodes contained in the link.  In fact, all hypergraphs are dual to these directed trees; if you have one, you can have the other.  Hypergraphs are just a convenient notation.

Lets take a moment to look back on what just happened: a function f(x,y,z) is just a hyperedge f connecting three nodes x,y,z. A boolean expression a AND b AND c can be written as AND(a,b,c), which shows a specific example of a hypergraph equivalance. It can be written as a reduction rule: (a AND b AND c) -> AND(a,b,c) which is itself just a hypergraph of the form x->y with x and y being hypergraphs.  The first-order logic constructs ‘for-all’ and ‘there-exists’ are just special cases of the lambda-calculus binding operation lambda, which binds free variables in an expression. Again, hypergraphs: lambda is just a hyperlink that binds a variable x in an expression y, and y was just a term, ahem, hypergraph!

I mentioned categories and n-categories, and I suppose I should justify this mention. Insofar as category theory is the theory of points and arrows, then a rewrite rule between graphs is a morphism in the category of small diagrams.  A subtle but important point about category theory that is almost never discussed in intro-to-cat-theory texts, is that all objects are implicitly typed. In the the category of Sets, the objects are all of the same kind: they are sets.  Its not mentioned because in a given category, all objects are of the same type; types change only when a functor maps from one to another.   So, to understand the category-theoretic equivalent of types in computer science, we must think of functors.  But, as we just saw, a graph rewriting rule is a morphism between functors.  So you could say that graph re-writing is just the category Cat of small categories.  Or you could slide down this slope in a different direction, and start calling it a 2-category. Whatever.  Perhaps its useful to point out that graph rewriting algorithms are sometimes expressed as being one-pushouts or as being 2-pushouts, with a pushout being a certain category-theoretic concept. Notable, for graph rewriting, is that any category with pushouts and equalizers has all (co-)limits. Except that, as we just saw, we want hyper-graph rewriting systems, not graph rewriting systems. So there.

What else are they good for?

In OpenCog, the Link and Node types inherit from the type Atom. This naming convention is intentionally suggestive: ‘Atom’ is meant to invoke the notion of an ‘atomic formula’ from model theory or first-order logic: that is, a formula that has no variables in it (its fully grounded), and that does have a truth value (its not composed of boolean connectives, and has no quantifiers in it).  This suggestive naming helps establish the intended use of OpenCog hypergraphs with regards to first-order logic.

The truth value is a different matter. The default (simplest) OpenCog truth value is a pair of floating point numbers: a probability and a confidence. These numbers allow several other AI concepts to be mapped into hypegraphs: Bayesian networks, Markov networks, and artificial neural networks. All three of these are graphs: directed graphs, at that. They differ in how they assign and propagate floating-point probabilites, entropies, activations. Ideas such as Markov logic networks, which implement maximum entropy principles (aka Boltzmann parition function) on a network of first-order logic expressions, can be represented with OpenCog hypergraphs.  Oh, and I should mention PLN (Probabilistic Logic Networks), which is what the atomspace was originally designed for. That’s what I like about the OpenCog hypergraph atomspace: it has a tremendously powerful ability to succinctly and easily represent complex modern AI concepts.

The good, the bad and the ugly.

You’ve heard about the good.  Now for the bad and the ugly.  First, the OpenCog atomspace implementation is slow and inefficient, over-complicated, badly architected, weakly-distributed, non-scalable, single-threaded. But lets not go there.  All this might be fixable, after a lot of programming effort (and deep, architectural thinking). Its been hotly debated in the past. Someday, maybe it’ll get fixed.

The bad thing about the OpenCog atomspace is that almost no one understands that, ahem, it is a programming language. Let me be very clear on this: OpenCog implements graph re-writing rules with the ImplicationLink. A sequence of ImplicationLinks can be used to compute things. In that sense, it is somewhat like the language Graph Programs, except that OpenCog allows fractional truth values, and logic programming and other good things.  If we stick to using ImplicationLinks with crisp truth values (T/F), then the resulting system is essentially Prolog. Of course you know that Prolog is popular for AI programming projects, because its fairly easy to write reasoning engines and expert systems and the like in Prolog.  What you may not know is that closely related to Prolog is Answer-Set Programming (ASP) . In fact, ASP uses exactly the same notation as Prolog does. It differs in two important ways: first, when you run a Prolog program, you get one answer. With ASP, you get all of the answers!  Its dramatically more powerful, and the reason for this is that  modern-day ASP solvers are built on top of modern-day Boolean SAT solvers. Which means that they are stunningly efficient and effective.

So what does this have to do with OpenCog? Well, here we have a system that, using ImplicationLinks, is essentially Prolog, more or less, when run in crisp-logic mode. Or, you could say, its like typed Lambda calculus. But do we have a simple, easy-to-use syntax like Prolog for it? No we don’t. That’s bad. Can we take an existing Prolog program, run a tool on it, and convert it to ImplicationLinks? No we don’t.  Would it run fast? No it wouldn’t: it would probably be slower than the slowest Prolog ever: Borland prolog running on a 5MHz IBM PC AT in 1986.  And forget an ASP solver for OpenCog.  For the special case where all OpenCog truth values are crisp T/F values, we do not have a Boolean SAT solver to find solutions for our graphs of ImplicationLinks.  This is bad, Really Bad. But I think that this is because very few people seem to understand that the OpenCog Atomspace really is a petri dish for programming languages.

Heck, we don’t even have anything equivalent to the RelEx Sentence Algorithms for OpenCog, even though RelEx is OpenCog-like. This absence is slowing down my efforts to continue work on the Link-Grammar parser, and to move natural language processing out of its stand-alone arena, into a general, flexible framework.

(And we’ve barely scratched the surface. In order to make implication and pattern mining run quickly in the atomspace, we need to implement something like the concept of ‘memoization‘ from lisp/scheme. But it turns out that memoization is really just a relational algebra: it is a database of short expressions that stand in for long ones. The OpenCog Atomspace is also, among other things, a relational database that can store and query not only flat tables or key-value pairs, but full-blown hypergraphs. And this isn’t a day-dream; its crucial for performance (and its partially implemented)).

Why don’t we have these things? Well, its hard. Its just not easy. We don’t have the infrastructure to make it easy, and we don’t have the users who demand these tools.   I don’t think most users are even aware of what the atomspace could even do.   Almost no one is thinking about ‘how to program in the language of OpenCog’ even though it has the potential of far surpassing any of the existing probabilistic programming languages out there.  Its time to change all this, but it will take someone smart and dedicated to do this. Many someones. This could be you.

Posted in Design, Introduction, Theory | Tagged , , , , , , , , , | 39 Comments

Catalog of Current OpenCog Atom Types

Alex van der Peet (of the OpenCog Hong Kong team) has been working on cataloguing all Atom types currently in use in the OpenCog code on the wiki site.

This page lists them all, with a page for each one:

http://wiki.opencog.org/w/Category:Atom_Types

A few of the pages still don’t have any information on them.

To all OpenCog developers: If you’re working heavily with a certain set of Atom types, please check out the corresponding wiki page, and think about adding some comments or examples.

Posted in Uncategorized | Leave a comment

The Viterbi Parser

I’ve recently made some good progress on something that I’m calling “the Viterbi decoder”, a new parser for the Link Grammar natural language parser.  So I guess that means its time to talk a bit about the why and how of this parser.

The goal of providing this decoder is to present a flexible, powerful interface for implementing high-level semantic algorithms on top of the the low-level link-grammar syntactic parser, and, in particular, for steering the parse based on high-level semantic knowledge. This allows the parser to move beyond being merely a syntactic parser, and to become fully integrated with general semantic artificial intelligence.

A less abstract list of expected benefits include:

  • Incremental parsing: the ability to obtain partial results after providing partial sentences, a word at a time.
  • Less sensitivity to sentence boundaries, allowing longer, run-on sentences to be parsed far more quickly.
  • Mitigation of the combinatorial explosion of parses.
  • Allow gramatically broken/incorrect chat dialog to be parsed; in general, to do better with slang, hip-speak.
  • Enable co-reference resolution and anaphora resolution across sentences (resolve pronouns, etc.)
  • Enable annotation of the parse graph with word-sense data, entity markers.
  • Allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternative reference resolutions.
  • Allow a plug-in architecture, so that plugins, employing higher- level semantic (AGI) algorithms can provide parse guidance and parse disambiguation.
  • Eliminate many of the hard-coded array sizes in the code.

The data structures used to implement this resemble those of the OpenCog AtomSpace. All data classes inherit from a class called Atom (which is an atomic predicate, in the sense of mathematical logic). Atoms are typed; the two core types are Links and Nodes. Thus, all data is represented in the form of a “term algebra” (aka the “Free Theory”, in the sense of model theory). This structure allows all data to be represented as (hyper-)graphs, which in turn makes the implementation of graph algorithms easier to implement. All these theoretical considerations provide a natural setting for storing Viterbi state information. Put differently, this provide a generic, uniform way of holding the various partly-finished parses, and effecting state transformations on them.

Since all of the data is represented dynamically (at run-time) by these (hyper-)graphs composed of atoms, developing custom algorithms to manipulate the parse becomes easy: there are no strange compile-time structures to master.  All algorithms can access the data in a uniform, common way.

Making the internal state directly visible allows low-level syntactic algorithms, as well as high-level, semantic algorithms to control parsing. In other words, the intended use of the Viterbi decoder is to provide a framework for parsing that should make it possible to integrate tightly (and cleanly) with high-level semantic analysis algorithms. Thus, reference and anaphora resolution can be done using the same graph structure as used for parsing; it should also allow graphical transformations, such as those currently implemented in RelEx.

One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found via the news story “Language Use is Simpler Than Prviously Thought“, per Morten Christiansen, Cornell professor of psychology.

Currently, the parser can correctly parse many short sentences. It currently runs very slowly, as no pruning algorithms have yet been implemented. Instructions for turning it on can be found in the viterbi/README file. The code is not in the 4.7.10 tarball; you need something newer: i.e. pull from the svn source tree. It will be in 4.7.11, whenever that comes out.

Here’s an example parse of “this is a test”. First, the usual link-parser output:

         +--Ost--+
   +-Ss*b+  +-Ds-+
   |     |  |    |
this.p is.v a test.n

or, with the wall words:

    +---------------RW--------------+
    |              +--Ost--+        |
    +---Wd---+-Ss*b+  +-Ds-+        |
    |        |     |  |    |        |
LEFT-WALL this.p is.v a test.n RIGHT-WALL

The output of viterbi, with some explanatory comments,  is this:

SEQ :                  # a sequence, an ordered set
  LING :               # a link-grammar link; naming conflict with opencog link.
    LING_TYPE : Wd     # the type of the link connecting two words.
    WORD_DISJ :        # holds the word and the connector used
      WORD : LEFT-WALL # all sentences begin with the left-wall.
      CONNECTOR : Wd+  # + means "connect to the right". - means left
    WORD_DISJ :
      WORD : this.p    # word with suffix as it appears in link-grammar dictionary
      CONNECTOR : Wd-
  LING :
    LING_TYPE : Ss*b   # and so on ...
    WORD_DISJ :
      WORD : this.p
      CONNECTOR : Ss*b+
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : Ss-
  LING :
    LING_TYPE : Ds
    WORD_DISJ :
      WORD : a
      CONNECTOR : Ds+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Ds-
  LING :
    LING_TYPE : Ost
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : O*t+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Os-

Oh, and I suppose its appropriate to answer the question “why is it called the Viterbi parser”?  I’m calling it that because it is inspired by (and vaguely resembles) the Viterbi algorithm famous from signal processing. A characteristic feature of that algorithm is that it maintains a set of states in parallel. As each new bit is received, some of the states become inherently inconsistent (e.g. because some checksum is violated), while other new states become possible. Once some certain number of bits have been received, the ones that can be consistently interpreted with the checksum constraints can be output. The process then repeats with each new bit streaming in.

In link-grammar, a “disjunct” can be thought of as a puzzle piece with a word printed on it. There are many different puzzle pieces with the same word on it. As each word comes in, one tries to find a piece that fits (this is like the viterbi checksum). Sometimes, more than one fits, so one has multiple ‘alternatives’ (this is like the viterbi state-vector). The algo keeps a set of these alternatives (of assembled pieces), and, as words come in, alternatives are either discarded (because nothing fits) or are elaborated on.

Unlike the viterbi algorithm, in natural language processing, it is useful to keep some of these alternatives or ambiguities around until much later stages of processing, when the disambiguation can finally be performed. As a famous example: “I saw the man with the telescope” has two valid syntactic parses, and two valid semantic interpretations.  Who was holding the telescope, me, or the man? Resolving this would be like applying a checksum to two different paths very late in the Viterbi game.

I like this analogy because it is vaguely biological as well: or perhaps I should say “neural net-ish”. The multiple, provisional states that are kept around are sort of like the activation states of a feed-forward artificial neural network. But this is not very deep: the feed-forward neural net looks like a Hidden Markov Model (HMM), and the Viterbi algorithm is essentially an HMM algorithm. No surprise!

But all this talk of algorithms hides the true reason for this work. The above algo is not strong enough to reproduce the old parser behavior: it can create islands; it ignores post-processing. The original algorithm uses integer-valued “cost” to rank parses; I want to replace this by floating point values (probabilities! maximum entropy!).

I also want to implement an “algorithm plug-in” API — basically, a way of offering “here’s the current state, go and modify it.” — to have ‘mind-agents’ in OpenCog terminology. The above puzzle-piece assembly algo would be the first to run, but clearly, others are needed to prevent islands, or to re-order states by probability/likelihood.   Some of these may be clearly distinct algos; others may end up as tangled balls of complexity.  Factorization into distinct algos is clearly possible: RelEx already had a list of algos there were applied in sequential order.  First, some POS tagging was done, then some head-word verb extraction, then some entity extraction, etc. Algorithms can be layered.

So really, the core issue I’m hoping to solve here is that of having a uniform development environment: link-grammar is in C, has no probability (besides cost), and no internal API. RelEx is in Java, is explicitly graphical, but is not a hypergraph, has no probabilities, and can’t provide parse feed-back to control link-grammar. RelEx output was pumped into OpenCog, which is in C++; it cannot feedback into RelEx or Link Grammar.  The Link Grammar dictionaries are files: how can an automated system learn a new word, and stick it into a file?

At the moment, there aren’t really any new or novel algorithms: I’ve day-dreamed some in the past, but the fractured approach halts progress. All these boundaries are barriers; the hope here is to get past all of these barriers.  The work is really an architectural re-design, and not a new whiz-bang algo.

Posted in Design, Development, Theory | Tagged , , , , , , | 3 Comments

The MOSES Metapopulation

Kaj Sotala recently asked for an update on how MOSES selects a “species” to be “mutated”, when it is searching for the fittest program tree. I have some on-going, unfinished research in this area, but perhaps this is a good time to explain why.

To recap: MOSES is a program-learning system. That is, given some input data, MOSES attempts to learn a computer program that reproduces the data. It does so by applying a mixture of evolutionary algorithms: an “inner loop” and an “outer loop”. The inner loop explores all of the mutations of a “species” (a “deme”, in MOSES terminology), while the outer loop chooses the next deme to explore. (Each deme is a “program tree”, that is, a program written in a certain lisp-like programming language).

So: the outer loop selects some program tree, whose mutations will be explored by the inner loop. The question becomes, “which program tree should be selected next?” Now, nature gets to evolve many different species in parallel; but here, where CPU cycles are expensive, its important to pick a tree whose mutations are “most likely to result in an even fitter program”. This is a bit challenging.

MOSES works from a pool of candidate trees, of various fitnesses. With each iteration of the inner loop, the pool is expanded: when some reasonably fit mutations are found, they are added to the pool. Think of this pool as a collection of “species”, some similar, some not, some fit, some, not so much. To iterate the outer loop, it seems plausible to take the fittest candidate in the pool, and mutate it, looking for improvements. If none are found, then in the next go-around, the second-most-fit program is explored, etc. (terminology: in moses, the pool is called the “metapopulation”).

It turns out (experimentally) that this results in a very slow algorithm. A much better approach is to pick randomly from the highest scorers: one has a much better chance of getting lucky this way. But how to pick randomly? The highest scorers are given a probability: p ~ exp (score /T) so in fact, the highest scoring have the highest probability of being picked, but the poorly-scoring have a chance too. This distribution is the “Gibbs measure” aka “Boltzmann distribution”; (T is a kind of “temperature”, it provides a scale; its held constant in the current algos)  I’m guessing that this is the right measure to apply here, and can do some deep theoretical handwaving, but haven’t really worked this out in detail. Experimentally, it works well; there even seems to be a preferred temperature that seems to work well for most/all different problems (but this is not exactly clear).

One can do even better. Instead of using the score, a blend of score minus program tree complexity works better; again, this is experimentally verified.   Nil added this back when, and his theoretical justification was to call it “Solomonoff complexity”, and turn it into a ‘Bayesian prior’. From an engineering viewpoint, its basically saying that, to create a good design suitable for some use, its better to start with a simple design and modify it, than to start with a complex design and modify it. In MOSES terminology, its better to pick an initial low-complexity but poorly scoring deme, and mutate it, than to start with something of high complexity, high score, and mutate that. Exactly what the blending ratio (between high score, and high complexity) is, and how to interpret it, is an interesting question.

Experimentally, I see another interesting behaviour, that I am trying to “fix”.   I see a very classic “flight of the swallow” learning curve, dating back to the earliest measurements of the speed of telegraph operators at the turn of the 19th century. At first, learning is fast, and then it stalls, until there is a break-through; then learning is again fast (for a very brief time — weeks for telegraph operators) and then stalls (years or a decade for telegraph operators). In MOSES, so, at first, one picks a deme, almost any deme, and almost any mutation will improve upon it. This goes on for a while, and then plateaus. Then there’s a long dry spell — picking deme after deme, mutating it, and finding very little or no improvement. This goes on for a long time (say, thousands of demes, hours of cpu time), when suddenly there is a break-through: dozens of different mutations to some very specific deme all improve the score by some large amount. The bolzmann weighting above causes these to be explored in the next go-around, and mutations of these, in turn, all yield improvements too. This lasts for maybe 10-20 steps, and then the scores plateau again. Exactly like the signalling rate of 19th century telegraph operators 🙂 Or the ability of guitar players. Or sportsmen, all of which have been measured in various social-science studies, and have shown the “flight of the swallow” curve on them.

(Can someone PLEASE fix the horribly deficient Wikipedia article on “learning curve”? It totally fails to cite any of the seminal research and breakthroughs on this topic. Check out google images for examples of fast learning, followed by long plateau.

So: e.g.

Learning curve

learning curve

Learning curve. In real life. For Salesmen.

Actual MOSES curves look more like this, with rapid progress followed by stagnant plateaus, punctuated with rapid progress, again. Exccept the plateaus are much flatter and much longer, and the upward curves are much sharper and faster.

All these curves beg the question: why is google finding only the highly stylized ones, and not showing any for raw, actual data? Has the learning curve turned into an urban legend??

Here's a real-life learning curve, taken from MOSES, using real data (the "bank" dataset) from a previous OpenCog blog post on MOSES. Although this learning curve shows a combination of the inner and outer loops, and so, strictly speaking does not represent what I'm discussing here.

).

Recently, I have been trying to shorten the plateau, by trying to make sure that the next deme I pick for exploration is one that is least similar to the last one explored. The rationale here is that the metapaopulation gets filled with lots of very very similar species, all of which are almost equally fit, all of which are “genetically” very similar. Trying to pick among these, to find the magic one, the one whose mutations will yeild a break-through, seems to be a losing strategy. So, instead, add a diversity penalty: explore these “species” that are as different as possible from the current one (but still have about the same fitness score). So far, this experiment is inconclusive; I wasn’t rewarded with instant success, but more work needs to be done. Its actually fairly tedious to take the data…

Posted in Design, Documentation, Theory | Tagged , , , | 6 Comments

Fishgram: Frequent Interesting Subhypergraph Mining for OpenCog

One of the tools OpenCog has needed for a long time, is something that can relatively quickly scan an Atomspace and find the interesting patterns in it.  “Interesting” may be defined in a variety of ways, such as “frequent”, or “surprising” (as measured by information theory), etc.  This capability has often been referred to in OpenCog documents as “pattern mining.”

Jade O’Neill (fka Jared Wigmore) implemented python software doing this for the Atomspace some time ago — Fishgram, the Frequent Interesting SubHyperGRaph Miner.   Fishgram has been used to recognize patterns in Atomspaces resultant from OpenCog’s “perception” of a Unity3D based virtual world.

Now,  a wiki page has been created, covering some details of Fishgram — including pseudocode, an explanation of the algorithm, and some indication of which software classes carry out which parts of the algorithm…

http://wiki.opencog.org/w/Fishgram

Plenty more work needs to be done with Fishgram, yet, it does
currently work and can extract some interesting patterns from
Atomspaces….

Some simple examples have also been done, feeding patterns output via Fishgram into PLN…

I think this is a very valuable tool that could be used for a lot of different OpenCog applications, and it would be great to see others jump onto it and help with development.

The current version of Fishgram looks for frequent subhypergraphs (i.e. frequent subhypergraph patterns, which may contain  multiple variables).  One thing that Jade and I have talked about a lot is extending Fishgram to search for “surprising” subhypergraphs, where surprisingness may be measured using interaction information or synergy, as described in these papers:

http://www.rni.org/bell/nara4.pdf

http://arxiv.org/abs/1004.2515/

Those who like logic may also enjoy this paper, which connects interaction information with the logic of questions:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.6110

It seems that a good implementation of a good measure of surprisingness will be a valuable thing to have in OpenCog generally, not just in Fishgram.   If we want “novelty seeking” to be one of the top-level goals of a young AGI or proto-AGI system (which I think we do), then having a nice way to measure novelty seems like a good things — and the interaction information and the informational synergy, as described in these papers, seem a good approach.

Onward and upward 😉

Ben G

Posted in Uncategorized | Leave a comment