Cogistry: Accelerating Algorithmic Chemistry via Cognitive Synergy

This post describes some speculative ideas for new AI/AGI/Artificial-Life/Artificial-Chemistry development that could be done on top of the OpenCog platform, leveraging OpenCog tools in a novel way.

These are not necessarily proposed for immediate-term development – we have a lot of other stuff going on, and a lot of stuff that’s half-finished or ¾-finished or more, that needs to be wrapped up and improved, etc.   They are, however, proposed as potentially fundamental enhancements to the OpenCog and PrimeAGI (CogPrime) designs in their current form.

The core motivation here is to add more self-organizing creativity to the AGI design.   One can view these ideas as extending what MOSES does in the PrimeAGI design – MOSES (at least in its aspirational version) is artificial evolution enhanced by pattern mining and probabilistic reasoning; whereas Cogistry is, very loosely, more like artificial biochemistry similarly enhanced.

Historical Background

Walter Fontana, in a fascinating 1990 paper, articulated a novel approach to “artificial life” style digital evolution of emergent phenomena, called “algorithmic chemistry.”   The basic concept was: Create small codelets, so that when codelet A acts on codelet B, it produces a new codelet C.   Then put a bunch of these codelets in a “primordial algorithmic soup” and let them act on each other, and see what interesting structures and dynamics emerge.

The paper reports some interesting emergent phenomena, but then the research programme was dropped off at an early stage.   Very broadly speaking, the story seems to have been similar to what happened with a lot of Alife-related work of that era: Some cool-looking self-organizational phenomena occurred, but not the emergence of highly complex structures and dynamics like the researchers were looking for.

These sorts of results spawned the natural question “why?”   Did the simulations involved not have a large enough scale?   (After all, the real primordial soup was BIG and based on parallel processing and apparently brewed for quite a long time before producing anything dramatic.)   Or were the underlying mechanisms simply not richly generative enough, in some way?   Or both?

What I am going to propose here is not so much a solution to this old “why” question, but rather a novel potential route around the problem that spawned the question.   My proposal – which I call “Cogistry” — is to enhance good old Fontana-style algorithmic chemistry by augmenting its “self-modifying program soup” with AI algorithms such as hypergraph pattern mining and probabilistic reasoning.   I believe that, if this is done right, it can lead to “algorithm soups” with robust, hierarchical, complex emergent structures – and also to something related but new and different: emergent, self-organizing program networks that carry out functions an AI agent desires for achievement of its goals.

That is: my aim with Cogistry is to use probabilistic inference and pattern mining to enhance algorithmic chemistry, so as to create “digital primordial soups” that evolve into interesting digital life-forms, but ALSO so as to create “life-like program networks” that transform inputs into outputs in a way that carries out useful functions as requested by an AI agent’s goal system.   The pure “digital primordial soup” case would occur when the inference and pattern mining are operating with the objective of spawning interesting structures and dynamics; whereas the “useful program network” case would occur when the inference and pattern mining are operating with an additional, specific, externally-supplied objective as well.

There is a rough analogy here with the relation between genetic algorithms and estimation of distribution algorithms (EDAs).   EDAs aim to augment GA mutation and crossover with explicit pattern mining and probabilistic reasoning.  But there are significant differences between EDAs and the present proposal as well.   There is a lot more flexibility in an algorithmic chemistry network than in an evolving populations of bit strings or typical GP program trees; and hence, I suspect, a lot more possibility for the “evolution/self-organization of evolvability” and the “evolution/self-organization of inferential analyzability” to occur.   Of course, though, this added flexibility also provides a lot more potential for messes to be made (including complex, original sorts of messes).

From an Alife point of view, the “chemistry” in the “algorithmic chemistry” metaphor is intended to be taken reasonably seriously.  A core intuition here is that to get rich emergent structures and dynamics from one’s “digital biology” it’s probably necessary to go a level deeper to some sort of “digital chemistry” with a rich combinatorial potential and a propensity to give rise to diverse stable structures, including some with complex hierarchical forms.    One might wonder whether this is even deep enough, whether one needs actually to dig down to a “digital physics” from which the digital chemistry emerges; my intuition is that this is not the case and focusing on the level of algorithmic chemistry (if it’s gotten right) is deep enough.

Actually, the “digital physics,” in the analogy pursued here, would be the code underlying the algorithms in the algorithm-soup: the programming-language interpreter and the underlying C and assembly code, etc.   So part of my suggestion here will be a suggestion regarding what kind of digital physics is likely to make algorithmic chemistry work best: e.g. functional programming and reversible computing.   But, according to my intuition, the core ideas of this “digital physics” have already been created by others for other purposes, and can be exploited for algorithmic-chemistry purposes without needing dramatic innovations on that level.

An Algorithmic Chemistry Framework

First I will define a framework for algorithmic chemistry in a reasonably general way.   I will then fill in more details, bit by bit.

The basic unit involved I will call a “codelet” – defined simply as a piece of code that maps one or more input codelets into one or more output codelets.   What language to use for specifying codelets is a subtle matter that I will address below, but that we don’t need to define in order to articulate a general algorithmic chemistry framework.

Fontana’s original work involved a chemical soup with no spatial structure, but other work with artificial life systems suggests that a spatial structure may be valuable.  So we propose a multi-compartment system, in which each compartment has its own “algorithmic chemical soup.”    A codelet, relative to a specific compartment, can be said to have a certain “count” indicating how many copies of that codelet exist in that compartment.

Inside each compartment, multiple dynamics happen concurrently:

  • Migration: codelets migrate from one compartment to neighboring compartments.    This can be done initially via random diffusion.
  • Importance spreading: codelets have short and long term importance values, which spread e.g. via ECAN dynamics
  • Forgetting: codelets with low long-term importance are “forgotten” from a compartment (and may or may not be saved to some persistent store, depending on system configuration)
  • Reaction: codelets in the compartment act on other codelets in the compartment, producing new codelets.   Codelets may be chosen for reaction with probability based on their short term importance values.
  • Generation: new codelets are introduced into the compartment, either representing inputs to the system from some external source, or generated randomly from some distribution, or drawn from some set of codelets believed to be generally useful

I will use the word “codenet” to refer to a collection of codelets that interact with each other in a coherent way.   This is intentionally a vague, intuitive definition – because there are many kinds of coherent networks of codelets, and it’s not obvious which ones are going to be the most useful to look at, in which contexts.   In some cases a chain of the form “In the presence of background codelet-set B, A1 reacts with B to make A2, which reacts with B to make A3, etc. …” may be very influential.   In other cases cycles of the form “A and B react to make C; then A and C create to make more B; and B and C react to make more A” may be critical.   In other cases it may be more complex sorts of networks.   Exactly how to chop up a soup of codelets, growing and interacting over time, into distinct or overlapping codenets is not entirely clear at present and may never be.   However, it’s clear that for understanding what is going on in an algorithmic-chemistry situation, it’s the codenets and not just the codelets that need to be looked at.   If codelets are like chemicals, then codenets are like chemical compounds and/or biological systems.

Implementation Notes

In terms of current computing architectures, it would be natural to run different compartments on different machines, and to run the four processes in different threads, perhaps with multiple threads handling reaction, which will generally be the most intensive process.

If implemented in an OpenCog framework, then potentially, separate compartments could be separate Atomspaces, and the dynamic processes could be separate MindAgents running in different threads, roughly similar to the agents now comprising the ECAN module.   Also, in OpenCog, codelets could be sub-hypergraphs in Atomspace, perhaps each codelet corresponding to a DefineLink.

Reactions would naturally be implemented using the Forward Chainer (a part of the Rule Engine, which leverages the Pattern Matcher).   This differs a bit from PLN’s use of the Forward Chainer, because in PLN one is applying an inference rule (drawn from a small set thereof) to premises, whereas here one is applying a codelet (drawn from a large set of possible codelets) to other codelets.

Measuring Interestingness

One interesting question is how to measure the interestingness of a codelet, or codenet.

For codelets, we can likely just bump the issue up a level: A codelet is as interesting as the codenets it’s part of.

For codenets, we can presumably rely on information theory.   A compartment, or a codenet, as it exists at a particular point in time, can be modeled using a sparse vector with an entry for each codelet that has nonzero count in the compartment or codenet (where the entry for a codelet contains the relevant count).      A compartment or codenet as it exists during an interval of time can then be modeled as a series of sparse vectors.   One can then calculate the interaction information of this vector-series (or the “surprisingness” as defined in the context of OpenCog’s Pattern Miner).   This is a good first stab at measuring how much novelty there is in the dynamics of a codenet or compartment.

In a codenet containing some codelets representing Inputs and others representing Outputs, one can also calculate interaction information based only on the Input and Output codelets.  This is a measure of the surprisingness or informativeness of the codenet’s relevant external behaviors, rather than its internal dynamics.

Pattern Mining and Inference for Algorithmic Chemistry

Given the above approach to assessing interestingness, one can use a modification of OpenCog’s Pattern Miner to search for codenets that have surprising dynamics.   One can also, in this way, search for patterns among codenets, so that specific codenets fulfilling the patterns have surprising dynamics.   Such patterns may be expressed in the Atomspace, in terms of “abstract codelets” — codelets that have some of their internal Atoms represented as VariableNodes instead.

An “abstract codenet” may be defined as a set of (possibly-abstract codelet, count-interval, time-interval) triples, where the time-interval is defined as a pair (start, end), where start and end are defined as offsets from the initiation of the codenet.   The interpretation of such a triple is that (C, (m,n) (s,e)) means that some codelet instantiating abstract codelet C exists with count between m and n, during the time interval spanning from s to e.

Note that in order to form useful abstractions from codenets involving different codelets, it will be useful to be able to represent codelets in some sort of normal form, so that comparison of different codelets is tractable and meaningful.  This suggests that having the codelet language support some sort of Elegant Normal Form, similar to the Reduct library used in OpenCog’s MOSES subsystem, will be valuable.

Using the Pattern Miner to search for abstract codenets with high degrees of informational surprisingness, should be a powerful way to drive algorithmic chemistry in interesting directions.   Once one has found abstract codenets that appear to systematically yield high surprisingness, one can then use these to drive probabilistic generation of concrete codenets, and let them evolve in the algorithmic soup, and see what they lead to.

Furthermore, once one has abstract codenets with apparent value, one can apply probabilistic inference to generalize from these codenets, using deductive, inductive and abductive reasoning, e.g. using OpenCog’s Probabilistic Logic Networks.   This can be used to drive additional probabilistic generation of concrete codenets to be tried out.

“Mutation” and “crossover” of codenets or codelets can be experimented with on the inferential level as well – i.e. one can ask one’s inference engine to estimate the likely interestingness of a mutation or crossover of observed codenets or codelets, and then try out the mutation or crossover products that have passed this “fitness estimation” test.

This kind of pattern mining and inference certainly will be far from trivial to get right.   However, conceptually, it seems a route with a reasonably high probability of surmounting the fundamental difficulties faced by earlier work in artificial life and algorithmic chemistry.   It is something conceptually different than “mere self-organization” or “mere logical reasoning” – it is Alife/Achem-style self-organization and self-modification, but explicitly guided by abstract pattern recognition and reasoning.   One is doing symbolic AI to accelerate and accentuate the creativity of subsymbolic AI.

The above pertains to the case where one is purely trying to create algorithmic soups with interesting internal dynamics and structures.   However, it applies also in cases where one is trying to use algorithmic chemistry to learn effective input-output functions according to some fitness criteria.   In that case, after doing pattern mining of surprising abstract codenets, one can ask a different question: Which codenets, and which combinations thereof, appear to differentially lead to high-fitness transformation of inputs into outputs?   One can then generate new codenets from the distribution obtained via answering this question.   This is an approach to solving the “assignment of credit” problem from a “God’s eye” point of view – by mining patterns from the network over time … a fundamentally different approach to assignment of credit than has been taken in subsymbolic AI systems in the past.

Desirable Properties of a Cogistry Codelet Language

Designing the right language for the above general “Cogistry” approach is a subtle task, and I won’t try to do so fully here.   I’ll just sketch some ideas and possible requirements.

Fontana’s original algorithmic chemistry work uses a variety of LISP, which seems a sound and uncontroversial choice (and the same choice we made in OpenCog’s MOSES GA-EDA tool, for example).   However, a few variations and additions to this basic LISP-ish framework seem potentially valuable:

  • Addition of unordered sets as primitives, alongside ordered lists; and then associated set operations
  • Addition of float-weighted lists and sets as primitives; and of basic vector operations acting on such lists, such as the dot product, and the multiplication of a vector by a scalar.

Float-weighted lists are very handy for dealing with perceptual data, for example.   They also provide an element of continuity, which may help with robustness.   Codelets relying on float vectors of weights can be modified slightly via modifying the weights, leading to codelets with slightly different behaviors – and this continuity may make learning of new codelets via sampling from distributions implied by abstract codenets easier.

Further, it seems to me we may want to make all operations reversible.  If the atomic operations on bits, ints and floats are reversible, then corresponding operations on lists and sets and weighted vectors can also easily be made reversible.  (For instance, removing the final element of a list can be made into a reversible operation by, instead, using an operation that splits a list into two parts: the list with the final element removed, and the final element itself.)   The intuition here is that reversibility introduces a kind of “conservation of information” into the system, which should prevent the advent of various sorts of pathological runaway dynamics like Fontana observed in his early simulations.   If codelets can produce “more” than they take in, then evolution will naturally lead to codelets that try to exploit this potential and produce more and more and more than they take in.   But if codelets have to be “conservative” and essentially act only by rearranging their inputs, then they have to be cleverer to survive and flourish, and are more strongly pushed to create complexly interlocking self-organizing structures.

I’m well aware that mining patterns among functions of float variables is difficult, and it would be easier to restrict attention to discrete variables – but ultimately I think this would be a mistake.   Perceptual data seems to be very naturally represented in terms of float vectors, for example.   Perhaps an innovative approach will be needed here, e.g. instead of floats one could use confidence intervals (x% chance of lying in the interval (L,U) ).   Reversible division on confidence intervals would require a bit of fiddling to work out, but seems unlikely to be fundamentally difficult.

Whoaaaoowwawwww!!

The idea of Cogistry seemed very simple when I initially thought of it; but when I sat down to write this post summarizing it, it started to seem a lot more complicated.  There are a lot of moving parts!   But, hey, nobody said building a thinking machine and creating digital life in one fell swoop was going to be easy….   (Well actually OK, some people did say that… er, actually, I did say that, but I took it back a decade and a half ago!! …)

What I like about the idea of Cogistry, though, is that – if it works (and I’m sure it will, after a suitable amount of research and fiddling!) — it provides a way to combine the fabulous generative creativity of biological systems, with the efficiency and precision of digital-computer-based pattern mining and probabilistic reasoning.   Such a combination has the potential to transcend the “symbolic versus subsymbolic” and “biological versus computational” dichotomies that have plagued the AI field since nearly its inception (and that indeed reflect deeper complex dichotomies and confusions in our contemporary culture with its mixture of information-age, machine-age, and biological/emotional/cultural-humanity aspects).   Some of the details are gonna be freakin’ complicated to work out but, though I realize it sounds a bit vain or whatever, I have to say I feel there is some profound stuff lurking here…

Notes Toward a Possible Development Plan

At first blush, it seems to me that most of the hard work here is either

  • A) testing and experimenting; or
  • B) re-using OpenCog AI tools that we need anyway for other reasons

From this perspective, an approach to making Cogistry real would be to start by

  • designing the programming language (with room for experimentation/variation) for codelets
  • implementing the basic framework in OpenCog
  • doing some basic experimentation with this as an Alife/Achem framework, without the added AI bells and whistles

Now, I would not expect this initial work to yield great results… since basically it’s a matter of reimplementing good old Alife/Achem stuff in a context where inference, pattern mining, ECAN etc. can be layered on.  Without the layering on of these AI tools, one would expect to find familiar Alife-y issues: some interesting structures emerging, but hitting a complexity ceiling … and then being uncertain whether increased scale or a change to the codelet language might be the key to getting more interesting things to emerge.

But beyond this basic framework, the other things needed for Cogistry are all things needed for other OpenCog AGI work anyway:

  • more general use of pattern mining than has been done so far
  • PLN on output of pattern mining
  • ECAN working together w/ pattern mining and PLN
  • probabilistic programming for guiding the forward chainer

With the basic codelet-system framework in place, using these things for Cogistry alongside their other uses would be “straightforward in principle”

– Thanks are due to Cassio Pennachin and Zar Goertzel for comments on an earlier version of the text contained in this post.

Posted in Uncategorized | 8 Comments

Smart Contract Blockchains

Blockchains and smart contracts are all the rage, these days. What does this have to do with OpenCog?  Let me explain.

TL;DR:

The idea of a block-chain comes from the idea of block ciphers, where you want to securely sign (or encrypt) some message, by chaining together blocks of data, in such a way that each prior encrypted block provides “salt” or a “seed” for the next block. Both bitcoin and git use block-chaining to provide cryptographic signatures authenticating the data that they store.  Now, git stores big blobs of ASCII text (aka “source code”), while bitcoin stores a very simple (and not at all general) general ledger.  Instead of storing text-blobs, like git, or storing an oversimplified financial ledger, like bitcoin, what if, instead, we could store general structured data?  Better yet: what if it was tuned for knowledge representation and reasoning? Better still: what if you could store algorithms in it, that could be executed? But all of these things together, and you’ve got exactly what you need for smart contracts: a “secure”, cryptographically-authenticated general data store with an auditable transaction history.  Think of it as internet-plus: a way of doing distributed agreements, world-wide.  It has been the cypher-punk day-dream for decades, and now maybe within reach. The rest of this essay unpacks these ideas a bit more.

Git vs. Bitcoin

When I say “git, the block-chain”, I’m joking or misunderstanding, I mean it.  Bitcoin takes the core idea of git, and adds a new component: incentives to provide an “Acked-by” or a “Signed-off-by” line, which git does not provide: with git, people attach Ack and Sign-off lines only to increase their personal status, rather than to accumulate wealth.  What is more, git does NOT handle acked-by/signed-off-by in a cryptographic fashion: it is purely manual; Torvalds or Andrew Morton or the other maintainers accumulate these, and they get added manually to the block chain, by cut-n-paste from email into the git commit message.

Some of the key differences between git and bitcoin are:

  • Bitcoin handles acked-by messages automatically, not manually, and they accumulate as distinct crypto signatures on the block-chain,  – by contrast, the git process is to cut-n-paste the acked-by sigs from the email and into the commit, and only then crypto-sign.    A modern  block-chain API should provide this automated acked-by handling that git does not.
  • Bitcoin provides (financial) incentives for people to generate acked-by messages: it does so through mining.  Unfortunately, mining is incredibly stupid and wasteful:  mining is designed to use immense amounts of cpu time before a new bitcoin is found.  This stupidity is to balance a human weakness, by appealing to human greed: the ability to get “free money” in exchange for the crypto-signed acked-by’s.  A modern block-chain API does NOT need to support mining, except maybe as an add-on feature so that it can say “hey, me too”.  Keeping up with the Jones’s.

For the things that I am interested in, I really don’t care about the mining aspect of blockchains. It’s just stupid.  Git is a superior block-chain to bitcoin.  It’s got more features, its got a better API, it offers consistent histories — that is, merging! Which bitcoin does not.  Understandably — bitcoin wants to prevent double-spending. But there are other ways to avoid double-spending, than to force a single master. Git shows the way.

Now, after building up git, it also has a lot of weaknesses: it does not provide any sort of built-in search or query.  You can say “git log” and view the commit messages, but you cannot search the contents: there is no such feature.

Structured Data

Git is designed for block-chaining unstructured ASCII (utf8) blobs of character strings — source-code, basically — it started life as a source-code control system.  Let’s compare that to structured data. So, in the 1960′s, the core concepts of relations and relational queries got worked out: the meaning of “is-a”, “has-a”, “part-of”, “is-owned-by”, etc. The result of this research was the concept of a relational database, and a structured query language (SQL) to query that structured data.  Businesses loved SQL, and Oracle, Sybase, IBM DB2 boomed in the 1970′s and 1980′s, and that is because the concept of relational data fit very well with the way that businesses organize data.

Lets compare SQL to bitcoin:  In bitcoin, there is only ONE relational table, and it is hard-coded. It can store only one thing: a quantity of bitcoin.  There is only one thing you can do to that table: add or remove bitcoin. That’s it.

In SQL, the user can design any kind of table at all, to hold any kind of data. Complete freedom.  So, if you wanted to implement block-chained smart contracts, that is what you would do: allow the user to create whatever structured data they might want.  For example: every month, the baker wants to buy X bags of flour from the miller for Y dollars: this is not just a contract, but a recurring contract: every month, it is the same.  To handle it, an SQL architect designs an SQL table to store dollars, bags of flour, multiple date-stamps: datestamp of when the order was made, date-stamp  of when the order was given to the shipping firm (who then crypto-signs the block-chain of this transaction), the datestamp of when the baker received the flour, the datestamp of when the baker paid the miller.  Each of these live on the block-chain, each get crypto-signed when the transaction occurs.

The SQL architect was able to design the data table in such a way that it is NATURAL for the purchase-ship-sell, inventory, accounts-payable, accounts-receivable way that this kind of business is conducted.

There are far more complicated transactions, in the petroleum industry, where revenue goes to pipeline owners, well owners, distillers, etc. in a very complicated process. Another example is the music-industry royalties.  Both of these industries use a rather complex financial ledger system that resemble financial derivatives, except that there is no futures contract structure to it: the pipeline owner cannot easily arbitrage the petroleum distiller. Anyway, this is what accounting programs and general ledgers excel at: they match up with the business process, and  the reason they can match up is because the SQL architect can design the database tables so that they fit well with the business process.

If you want to build a blockchain-based smart contract, you need to add structured data to the block-chain.  So this is an example of where git falls flat: its an excellent block-chain, but it can only store unstructured ASCII blobs.

Comparing Git to SQL: Git is also missing the ability to perform queries: but of course: the git data is unstructured, so queries are hard/impossible, by nature. A smart-contract block-chain MUST provide a query language!  Without that, it is useless. Let me say it again: SQL is KEY to business contracts.  If you build a blockchain without SQL-like features in it, it will TOTALLY SUCK. The world does not need another bitcoin!

I hope you have followed me so far.

The AtomSpace

OK, now, we are finally almost at where OpenCog is.  So: the idea of relational data and relational databases was fleshed out in the 1960′s and the 1970′s, and it appears to be enough for accounting.   However, it is not enough for other applications, in two different ways.

First, for “big data”, it is much more convenient to substitute SQL and ACID with NoSQL and BASE. The Google MapReduce system is a prime example of this.  It provides a highly distributed, highly-parallelizable query mechanism for structured data.   Conclusion: if you build a block-chain for structured data, but use only SQL-type PRIMARY-KEY’s for your tables, it will fail to scale to big-data levels.  Your block-chain needs to support both SQL and NoSQL.  The good news is that this is a “solved problem”: it is known that these are category-theoretic duals, there is a famous Microsoft paper on this: “ACM Queue March 18, 2011 Volume 9, issue 3 “A co-Relational Model of Data for Large Shared Data Banks”, Erik Meijer and Gavin Bierman, Microsoft. Contrary to popular belief, SQL and noSQL are really just two sides of the same coin.

Next problem: for the task of “knowledge representation” (ontology, triple-stores, OWL, SPARQL,) and “logical reasoning”, the flat tables and structures offered by SQL/noSQL are insufficient — it turns out that graphical databases are much better suited for this task. Thus, we have the concept of a graph_database, some well-known examples include Neo4j, tinkerpop, etc.

The OpenCog AtomSpace fits into this last category.  Here, the traditional 1960′s-era “is-a” relation corresponds to the OpenCog InheritanceLink.  Named relations (such as “Billy Bob is a part-time employee” in and SQL table) are expressed using EvaluationLinks and PredicateNodes:

(EvaluationLink
    (PredicateNode "is-employee")
    (ListLink
        (ConceptNode "BillyBob")
        (ConceptNode "employee")))

Its a bit verbose, but it is another way of expressing the traditional SQL relations.  It is somewhat No-SQL-like, because you do not have to declare an “is-employee” table in advance, the way you do in SQL — there  is no “hoisting” — instead, you can create new predicates dynamically, on the fly, at any time.

OpenCog has a centralized database, called the AtomSpace. Notice how the above is a tree, and so the  AtomSpace becomes a “forest of trees”. In the atomspace, each link or node is unique, and so each tree shares nodes and links: this is called a “Levi graph” and is a general bipartite way of representing hypergraphs.  So, the atomspace is not just a graph database, its a hypergraph database.

Edits to this database are very highly regulated and centralized: so there is a natural location where a blockchain signature could be computed: every time an atom is added or removed, that is a good place to hash the atomspace contents, and apply a signature.

The atomspace does NOT have any sort of history-of-transactions (we have not needed one, yet).  We are (actually, Nil is) working on something similar, though, called the “backwards-inference tree”, which is used to store a chain of logical deductions or inferences that get made.   Its kind-of-like a transaction history, but instead of storing any kind of transaction, it only stores those transactions that can be chained together to perform a forward-chained logical deduction.  Because each of these deductions lead to yet-another deduction, this is also a natural location to perform crypto block-chaining.  That is, if some early inference is wrong or corrupted, all later inferences become invalid – - that is the chaining.  So we chain, but we have not needed crypto signatures on that chain.

The atomspace also has a query language, called the “pattern matcher“.  It is designed to search only the current contents of the database.  I suppose it could be extended to search the transaction history.  The backward-inference-tree-chains were designed by Nil to be explicitly compatible with the pattern matcher.

The AtomSpace is a typed graph store, and some of the types are taken from predicate logic: there is a boolean AndLink, boolean OrLink, a boolean NotLink; but also an intuitionist-logic ChoiceLink, AbsentLink, PresentLink, and to round it out, a Kripke-frame ContextLink (similar to a CYC “microtheory” but much, much better). The reason I am mentioning these logic types is because they are the natural constructor types for smart contracts: in a legal contract, you want to say “this must be fulfilled and this or this but not this”, and so the logical connectives provide what you need for specifying contractual obligations.

Next, the AtomSpace has LambdaLinks, which implement the lambda abstractor from lambda calculus.  This enables generic computation: you need this for smart_contracts. The AtomSpace is NOT very strong in this area, though: it provides a rather basic computational ability with the LambdaLink, but it is very primitive, and does not go much farther.  We do some, but not a lot of computation in the AtomSpace.  It was not meant to be the kind of programming language that humans would want to code in.

The atomspace does NOT have any lambda flow in it, e.g. Marius Buliga’s ChemLambda.  I am still wrestling with that. The atomspace does have a distinct beta-reduction type, called PutLink, dual to the LambdaLink abstractor.  However, for theorem-proving, I believe that a few more abstractors are needed: Buliga has four: lambda and beta, and two more.  I am also trying to figure out Jean-Yves Girard’s Ludics.  Not there, yet.

Security, Scalability

Perhaps I failed to mention: the current AtomSpace design has no security features in it, whatsoever. Absolutely zero. Even the most trivial hostile action will wipe out everything.  There is a reason for this: development is focused on reasoning and thinking. Also, the current atomspace is not scalable.  It’s also rather low-performance. Its unsuitable for big-data. None of these checkboxes that many people look for are satisfied by the atomspace. That’s because these issues are, for this project, quite low priority. We are focused on reasoning and understanding, and just about nothing else.

So, taken at face value,  it is absurd to contemplate a blockchain for the atomspace; without even basic security, or decentralized, distributed storage, byzantine fault tolerance, and high performance, its a non-starter for serious consideration.  Can these checkboxes be added to the atomspace, someday? Maybe. Soon? Not at all likely.  These are nice-to-haves, but opencog’s primary focus must remain reasoning and thinking, not scalable, distributed, secure storage.

Conclusion

So that’s it, then: you can think of the OpenCog atomspace as a modern-day graphical, relational database that includes the datalog fragment of prolog, and lots of other parts as well.  It has an assortment of weaknesses and failures, which I know of, but won’t get into here. It is probably a decent rough sketch for the data storage system that you’d want for a block-chained smart contract.  To make it successful, you would need to a whole lotta things:

  • First, you’d need to actually get a real-life example of a smart-contract that people would want to have.
  • Next, you’d have to build a smart-phone app for it.
  • Next, it would have to appeal to some core class of users: I dunno — package tracking for UPS, or the collection of executive signatures by some legal dept, or maybe some accounts-receivable system that some billing dept. would want to use. There has got to be a hook: people have to want to use it.  It needs some magic ingredient.

The problem here is that, as a business, companies like IBM and PwC will trounce you at the high-end, cause they already have the business customers, and the IBM STSM’s are smart enough to figure out how block-chains work, and so will get some architects to create that kind of system for them.  At the low-end, there must be thousands of twenty-something programmers writing apps for cell-phones, daydreaming of being the next big unicorn, and they are all exploring payment systems and smart-cards and whatever, at a furious pace.  So if you really want a successful block-chain, smart-contract business, here, hold on to your butt.

I think that the only hope is to go open source, work with the Apache foundation, have them do the marketing for the AtomSpace or something like it, and set up API’s that people want to use.   That’s a lot of work.  But that is the way to go.

Posted in Design, Theory, Uncategorized | 2 Comments

Many Worlds: Reasoning about Reasoning

When one reasons about the world, what is one actually doing? This post is about that.

OpenCog has a reasoning system, called PLN, short for “Probabilistic Logic Networks”.  Its actually two things: first and foremost, its a set of “rules of inference”, which can be applied to “real world knowledge”, to deduce new “facts” about the world.  There are about half a dozen of these rules, and one of them resembles the classical “Modus Ponens“, except that it assigns a probability to the outcome, based on the probabilities of the inputs.  For the rest of this post, the details of PLN mostly don’t matter: if you wish, you can think of classical propositional logic, or of some kind of fuzzy logic, if you wish, or even competing systems such as NARS.  Anyway, PLN applies these rules of inference to the Atoms contained in the AtomSpace, to generate new Atoms. This is a fancy way of saying that the AtomSpace is the knowledge repository in OpenCog, an that the atoms are the “facts”. Its not much more than that: its just a big jumble of facts.

I want to talk about reasoning using PLN.  Now, this is NOT the way that the current opencog code base implements PLN reasoning; however, its a conceptual description of what it could (or should, or might) do.

Now, I mentioned that PLN consists of maybe a half-dozen or a dozen rules of inference.  They have fancy names like “modus ponens” but we could call them just “rule MP” … or just “rule A”, “rule B”, and so on.

Suppose I start with some atomspace contents, and apply the PLN rule A. As a result of this application, we have a “possible world 1″.  If, instead, we started with the same original atomspace contents as before, but applied rule B, then we would get “possible world 2″.  It might also be the case that PLN rule A can be applied to some different atoms from the atomspace, in which case, we get “possible world 3″.

Each possible world consists of the triple (some subset of the atomspace, some PLN inference rule, the result of applying the PLN rule to the input). Please note that some of these possible worlds are invalid or empty: it might not be possible to apply the choosen PLN rule to the chosen subset of the atomspace.  I guess we should call these “impossible worlds”.  You can say that their probability is exactly zero.

Observe that the triple above is an arrow:  the tail of the arrow is “some subset of the atomspace”, the head of the arrow is “the result of applying PLN rule X”, and the shaft of the arrow is given a name: its “rule X”. (In fancy-pants, peacock language, the arrows are morphisms, and the slinging together, here, are Kripke frames. But lets avoid the fancy language since its confuses things a lot. Just know that it’s there.)

Anyway — considering this process, this clearly results in a very shallow tree, with the original atomspace as the root, and each branch of the tree corresponding to possible world.  Note that each possible world is a new and different atomspace: The rules of the game here are that we are NOT allowed to dump the results of the PLN inference back into the original atomspace.  Instead, we MUST fork the atomspace.  Thus, if we have N possible worlds, then we have N distinct atomspaces (not counting the original, starting atomspace).  This is very different from what the PLN code base does today: it currently dumps its results back into the original atomspace. But, for this conceptual model, we don’t want to do that.

Now, for each possible world, we can apply the above procedure again. Naively, this is a combinatoric explosion. For the most part, each different possible world will be different than the others. They will share a lot of atoms in common, but some will be different. Note, also, that *some* of these worlds will NOT be different, but will converge, or be “confluent“, arriving at the same atomspace contents along different routes.  So, although, naively, we have a highly branching tree, it should be clear that sometimes, some of the branches come back together again.

I already pointed out that some of the worlds are “impossible” i.e. have a probability of zero. These can be discarded.  But wait, there’s more.  Suppose that one of the possible worlds contains the statement “John Kennedy is alive” (with a very very high confidence) , while another one contains the statement “John Kennedy is dead” (with a very very high confidence).  What I wish to claim is that, no matter what future PLN inferences might be made, these two worlds will never become confluent.

There is also a different effect: during inferencing (i.e. the repeated application of PLN), one might find oneself in a situation where the atoms being added to the atomspace, at each inference step, have lower and lower probability. At some point, this suggests that one should just plain quit — that particular branch is just not going anywhere. Its a dead end. A similar situation occurs when no further PLN rules can be applied. Dead end.

OK, that’s it.  The above provides a very generic description of how inferencing can be performed. It doesn’t have to be PLN — it could be anything — classical logic using sequent calculus, for example.  So far, everything I said is very easy-peasy, direct and straightforward. So now is where the fun starts.

First, (lets get it out of the way now) the above describes *exactly* how Link Grammar works.  For “atomspace” substitute “linkage” and for “PLN rule of inference” substitute “disjunct“.  That’s it. End of story. QED.

Oh, I forgot to introduce Link Grammar.  It is a system for parsing natural languages, such as English.  It does this by maintaining a dictionary of so-called “disjuncts”, which can be thought of “jigsaw puzzle pieces”.  The act of parsing requires finding and joining together the jigsaw pieces into a coherent whole.  The final result of the act of parsing is a linkage (a parse is a linkage – same thing).  These jigsaw puzzle pieces are nicely illustrated in the very first paper on Link Grammar.

Notice that each distinct linkage in link-grammar is a distinct possible-world. The result of parsing is to create a list of possible worlds (linkages, aka “parses”).  Now, link-grammar has a “cost system” that assigns different probabilities (different costs) to each possible world: this is “parse ranking”: some parses (linkages) are more likely than others. Note that each different parse is, in a sense, “not compatible” with every other parse.  Two different parses may share common elements, but other parts will differ.

Claim: the link-grammar is a closed monoidal category, where the words are the objects, and the disjuncts are the morphisms. I don’t have the time or space to articulate this claim, so you’ll have to take it on faith, or think it through, or compare it to other papers on categorial grammar or maybe pregroup grammar. There is nice example from Bob Coecke showing the jigsaw-puzzle pieces.  You can see a similar story develop in John Baez’s “Rosetta Stone” paper, although the jigsaw-pieces are less distinctly illustrated.

Theorem: the act of applying PLN, as described above, is a closed monoidal category. Proof:  A “PLN rule of inference” is, abstractly, exactly the same thing as a link-grammar disjunct. The contents of the atomspace is exactly the same thing as a (partially or fully) parsed sentence.  QED.

There is nothing more to this proof than that.  I mean, it can fleshed it out in much greater detail, but that’s the gist of it.

Observe two very important things:  (1) During the proof, I never once had to talk about modus ponens, or any of the other PLN inference rules.  (2) During the proof, I never had to invoke the specific mathematical formulas that compute the PLN “TruthValues” — that compute the strength and confidence.   Both of these aspects of PLN are completely and utterly irrelevant to the proof.  The only thing that mattered is that PLN takes, as input, some atoms, and applies some transformation, and generates atoms. That’s it.

The above theorem is *why* I keep talking about possible worlds and kripke-blah-blah and intuitionistic logic and linear logic. Its got nothing to do with the actual PLN rules! The only thing that matters is that there are rules, that get applied in some way.  The generic properties of linear logic and etc. are the generic properties of rule systems and Kripke frames. Examples of such rule systems include link-grammar, PLN, NARS, classical logic, and many more.  The details of the specific rule system do NOT alter the fundamental process of rule application aka “parsing” aka “reasoning” aka “natural deduction” aka “sequent calculus”.    In particular, it is a category error to confuse the details of PLN with the act of parsing: the logic that describes parsing is not PLN, and PLN dos not describe parsing: its an error to confuse the two.

Phew.

What remains to be done:  I believe that what I describe above, the “many-worlds hypothesis” of reasoning, can be used to create a system that is far more efficient than the current PLN backward/forward chainer.  It’s not easy, though: the link-parser algorithm struggles with the combinatoric explosion, and has some deep, tricky techniques to beat it down.  ECAN was invented to deal with the explosion in PLN.  But there are other ways.

By the way: the act of merging the results of a PLN inference back into the original atomspace corresponds, in a very literal sense, to a “wave function collapse”. As long as you keep around multiple atomspaces, each containing partial results, you have “many worlds”, but every time you discard or merge some of these atomspaces back into one, its a “collapse”.  That includes some of the truth-value merge rules that currently plague the system. To truly understand these last three sentences, you will, unfortunately, have to do a lot of studying. But I hope this blog post provides a good signpost.

Posted in Design, Development, Theory | 7 Comments

Putting Deep Perceptual Learning in OpenCog

This post presents some speculative ideas and plans, but I broadcast them here because I think they are of particular strategic importance for the OpenCog project….
The topic is: how OpenCog and “current-variety deep learning perception algorithms” can help each other.
Background: Modern Deep Learning Networks

“Deep learning” architectures have worked wonders on visual and auditory data in recent years, and have also shown limited interesting results on other sorts of data such as natural language.   The impressive applications have all involved training deep learning nets using a supervised learning methodology, on large training corpora; and the particulars of the network tend to be specifically customized to the problem at hand.   There is also work on unsupervised learning, though so far purely unsupervised learning has not yielded practically impressive results.  There is not much new conceptually in the new deep learning work, and nothing big that’s new mathematically; it’s mostly the availability of massive computing power and training data that has led to the recent, exciting successes…
These deep learning methods are founded on broad conceptual principles, such as
  • intelligence consists largely of hierarchical pattern recognition — recognition of patterns within patterns within patterns.. —
  • a mind should use both bottom-up and top-down dynamics to recognize patterns in a given data-item based on its own properties and based on experience from looking at other items
  • in many cases, the dimensional structure of spacetime can be used to guide hierarchical pattern recognition (so that patterns higher-up in the hierarchy pertain to larger regions of spacetime)
However, the tools normally labeled “deep learning” these days constitute a very, very particular way of embodying these general principles, using certain sorts of “formal neural nets” and related structures.  There are many other ways to manifest the general principles of “hierarchical pattern recognition via top-down and bottom-up learning, guided by spatiotemporal structure.”
The strongest advocates of the current deep learning methods claim that the deep networks currently used for perception, can be taken as templates or at least close inspirations for creating deep networks to be used for everything else a human-level intelligence needs to do.  The use of human-labeled training examples obviously doesn’t constitute a general-intelligence-capable methodology, but if one substitutes a reinforcement signal for a human label, then one has an in-principle workable methodology.
Weaker advocates claim that networks such as these may serve as a large part of a general intelligence architecture, but may ultimately need to be augmented by other components with (at least somewhat) different structures and dynamics.
It is sometimes suggested that the “right” deep learning network might serve the role of the “one crucial learning algorithm” underlying human and human-like general intelligence.   However, the deep learning paradigm does not rely on this idea… it might also be that a human-level intelligence requires a significant number of differently-configured deep networks, connected together in an appropriate architecture.
Deep Learning + OpenCog

My own intuition is that, given the realities of current (or near future) computer hardware technology, deep learning networks are a great way to handle visual and auditory perception and some other sorts of data processing; but that for many other critical parts of human-like cognition, deep learning is best suited for a peripheral role (or no role at all).   Based on this idea, Ted Sanders, Jade O’Neill and I did some prototype experiments a few years ago, connecting a deep learning vision system (DeSTIN) with OpenCog via extracting patterns from DeSTIN states over time and importing relations among these patterns into the OpenCog Atomspace.   This prototype work served to illustrate a principle, but did not represent a scalable methodology (the example dataset used was very small, and the different components of the architecture were piped together using ad hoc specialized scripts).
I’ve now started thinking seriously about how to resume this direction of work, but “doing it for real” this time.   What I’d like to do is build a deep learning architecture inside OpenCog, initially oriented toward vision and audition, with a goal of making it relatively straightforward to interface between deep learning perception networks and OpenCog’s cognitive mechanisms.
What cognitive mechanisms am I thinking of?
  1. The OpenCog Pattern Miner, written by Shujing Ke (in close collaboration with me on the conceptual and math aspects), can be used to recognize (frequent or surprising) patterns among states of a deep learning network — if this network’s states are represented as Atoms.   Spatiotemporal patterns among these “generally common or surprising” patterns may then be recognized and stored in the Atomspace as well. Inference may be done, using PLN, on the links representing these spatiotemporal patterns.  Clusters of spatiotemporal patterns may be formed, and inference may be done regarding these clusters.
  2. Having recognized common patterns within a set of states of a deep network, one can then annotate new deep-network states with the “generally common patterns” that they contain.   One may then use the links known in the Atomspace regarding these patterns, to create new *features* associated with nodes in the deep-network.  These features may be used as inputs for the processing occurring within the deep network.
This would be a quite thorough and profound form of interaction between perceptual and cognitive algorithms.
This sort of interaction could be done without implementing deep learning networks in the Atomspace, but it will be much easier operationally if they are represented in the Atomspace.
A Specific Proposal
So I’ve put together a specific proposal for putting deep learning into OpenCog, for computer vision (at first) and audition.   In its initial version, this would let one build quite flexible deep learning networks in OpenCog, deferring the expensive number-crunching operations to the GPU via the Theano library developed by Yoshua Bengio’s group at U. Montreal.
As it may get tweaked and improved or augmented by others, I’ve put it at the OpenCog wiki site instead of packing it into this blog post… you can read it at

Posted in Uncategorized | 10 Comments

What is consciousness?

… and can we implement it in OpenCog?  I think we can.  It might not even be that hard!   Consciousness isn’t this magical pixie dust that it’s often made out to be.  I’d like to provide a sketch.

In order for machine intelligence to perform in the real world, it needs to create an internal model of the external world. This can be as trite as a model of a chessboard that a chess-playing algo maintains.  As information flows in from the senses, that model is updated; the current model is used to create future plans (e.g. the next move, for a chess-playing computer).

Another important part of an effective machine algo is “attentional focus”: so, for a chess-playing computer, it is focusing compute resources on exploring those chess-board positions that seem most likely to improve the score, instead of somewhere else. Insert favorite score-maximizing algo here.

Self-aware systems are those that have an internal model of self. Conscious systems are those that have an internal model of attentional focus.   I’m conscious because I maintain an internal model of what I am thinking about, and I can think about that, if I so choose. I can ask myself what I’m thinking about, and get an answer to that question, much in the same way that I can ask myself  what my teenage son is doing, and sort-of get an answer to that (I imagine, in my minds eye, that he is sitting in his room, doing his homework. I might be wrong.)    I can steer my attention the way I steer my limbs, but this is only possible because I have that internal model (of my focus, of my limbs), and I can use that model to plan, to adjust, to control.

So, can we use this to build an AGI?

Well, we already have machines that can add numbers together better than us, can play chess better than us, and apparently, can drive cars better than us.  Only the last can be said to have any inkling of self-awareness, and that is fairly minimal: just enough to locate itself in the middle of the road, and maintain a safe distance between it and obstacles.

I am not aware of any system that maintains an internal model of its own attentional focus (and then uses that model to perform prediction, planning and control of that focus). This, in itself, might not be that hard to do, if one set out to explicitly accomplish just that. I don’t believe anyone has ever tried it. The fun begins when you give such a system senses and a body to play with. It gets serious when you provide it with linguistic abilities.

I admit I’m not entirely clear on how to create a model of attentional focus when language is involved; I plan to think heavily on this topic in the coming weeks/months/years. At any rate, I suspect its doable.

I believe that if someone builds such a device, they will have the fabled conscious, self-aware system of sci-fi. It’s likely to be flawed, stupid, and psychotic: common-sense reasoning algorithms are in a very primitive state (among (many) other technical issues).  But I figure that we will notice, and agree that its self-aware, long before its intelligent enough to self-augument itself out of its pathetic state: I’m thinking it will behave a bit like a rabid talking dog: not a charming personality, but certainly “conscious”, self-aware, intelligent, unpredictable, and dangerous.

To be charming, one must develop a very detailed model of humans, and what humans like, and how they respond to situations. This could prove to be quite hard.  Most humans can’t do it very well. For an AGI to self-augument itself, it would have to convince it’s human masters to let it tinker with itself.  Given that charm just might be a pre-requisite, that would be a significant challenge, even for a rather smart AGI.  Never mind that self-augumentation can be fatal, as anyone who’s overdosed on heroin might fail to point out.

I’m sure the military and certain darker political forces would have considerable interest in building a charming personality, especially if its really, really smart.  We already know that people can be charming and psychotic all at the same time; ethics or lack thereof is not somehow mutually exclusive of intelligence. That kind of a machine, unleashed on the world, would be … an existential threat.   Could end well, could end badly.

Anyway, I think that’s the outline of a valid course of research.  It leaves open some huge questions, but it does narrow the range of the project to some concrete and achievable goals.

Posted in Design, Theory | 43 Comments

The Relationship Between PLN Inference and Gibbs Sampling (Some Thought-Experiments)

This post describes some new thought-experiments regarding PLN, which have not yet been tested nor worked out mathematically in detail… Reader beware — there could be some mistakes here! But I think the ideas are interesting enough to be worth sharing….

These ideas are part of the same train of thought as the New PLN Design, currently being implemented bit-by-bit (and with interesting variations and deviations from the rough spec I just linked to) by Jade O’Neill and Ramin Barati. But this blog post contains new ideas not contained on that page.

Actually, I am unsure if I will end up recommending the ideas outlined here for implementation or not.   But even if not, I think they are interesting for the light they shed on what is going on with PLN conceptually and mathematically.

For one thing, on the theoretical side, I will outline here an argument why inference trails are ultimately unnecessary in PLN.   (They are needed in Pei Wang’s NARS system, from which PLN originally borrowed them; but this is because NARS is not probabilistic, so that the sorts of Gibbs sampling based arguments I outline here can’t be applied to NARS.)

Rough Summary / Prelude

Basically: In this post I will describe how to reformulate PLN inference as (very broadly speaking) to make use of Gibbs Sampling.   As Gibbs Sampling is used in the standard approach to Markov Logic Networks, this also serves (among other more practical purposes) to make clearer the relationship between PLN and MLN.

Broadly speaking, the idea here is to have two different, interlocking levels of PLN inference, with different truth values and different dynamics associated with them

  • a Gibbs sampling based layer, corresponding very roughly to shallow, massively parallel, “unconscious” inference (more like inference based “activation spreading”, to use a neural net metaphor)
  • a forward/backward chaining based layer, corresponding very roughly to “conscious”, deliberative inference

It seems possible that doing this might speed the convergence of a PLN network toward maximally intelligent conclusions based on the knowledge implicit in it.

Consideration of this possibility leads to an understanding of the relation between PLN dynamics and Gibbs sampling, which leads to an argument (at this stage, a sketch of a proof rather than a proof) that inference trails are not really needed in PLN.

Two preliminary notes before getting started:

  • The ideas given here are related, though far from identical, to the work by myself and Cassio Pennachin, reported in Section 3.1 of the paper “PLN and the Brain” from the proceedings of AGI-08:  ….
  • These ideas will make the most sense to the reader who knows the basic ideas of Gibbs sampling, and will make even more sense to readers who know about Markov Logic Networks.  Advanced knowledge of all the details and variations of these topics is not necessary, though.

Without further ado, I will now present two thought-experiments in PLN design: one fairly extreme, the other less so.

Thought-Experiment #1: PLN Inference via Gibbs Sampling on Distributional Truth Values

In this section I’ll describe a hypothetical way of doing PLN inference via Gibbs sampling.

Suppose that, instead of a single truth value, we let each PLN Atom have two truth values:

  • the current truth value (which we may call the “primary truth value”)
  • a new entity called the “instantaneous truth value,” which consists of: a series of K values called the “sample distribution”

The sample distribution consists of a series of values that define the shape of a distribution.    For example, the template sample distribution might comprise K=5 values corresponding to the intervals [0, .2] , [.2, .4], [.4,.6], [.6,.8], [.8,1].  The values would be viewed as a step value approximation to an underlying first-order probability distribution.

Next, the instantaneous truth values would be updated via Gibbs sampling. What I mean by this is, a process by which: the Atoms in the Atomspace are looped through, and when each Atom X is visited, its sampled strengths are replaced with the result of the following Gibbs-type Update Rule:

  1. Find all inference rules R that, in a single step from some set of premise Atoms existing in the Atomspace currently, would result in an estimate for the truth value of X
  2. Execute all the (rule, premise-set) pairs found in Step 1.   That is,
    1. for each pair, repeat the following process some number N of times: choose a specific value from the distribution comprising the instantaneous truth value for each premise, and draw a conclusion from these specific values.  This produces a truth value distribution for the conclusion.
    2. merge these distributions via revision (weighted averaging), obtaining an overall truth value distribution for the conclusion
  3. Replace the existing instantaneous truth value of X with (a discretized version of) the result of Step 2

The instantaneous truth value would then impact the primary truth value as follows

Periodically (every N cycles), the primary truth value of A is revised with the instantaneous truth value of A

(i.e. the primary truth value is replaced with a weighted average of itself & the instantaneous truth value)

Note that one could vary on this process in multiple ways — e.g. via making the instantaneous truth value an imprecise or indefinite probability, or a second order probability distribution.   The above procedure is given as it is, more out of a desire for relative simplicity of presentation, than because it necessarily seems the best approach.

If nothing else besides this updating happened with the primary truth values of logical Atoms (and if the various logical relations in the Atomspace all possessed a consistent probabilistic interpretation in terms of some grounding) — then according to the theory of Gibbs sampling, each Atom would get a primary strength approximating its correct strength according to the joint distribution implicit in all the logical Atoms in the Atomspace.

(The above description, involved as it is, still finesses a bit of mathematical fancy footwork.   It’s important to remember that, in spite of the Gibbs sampling, the PLN heuristic inference rules (which are derived using probability theory, but also various other heuristics) are being used to define the relationships between the variables (i.e. the truth value strengths of Atoms) in the network.

So the Gibbs sampling must be viewed as taking place, not on the variables (the Atom strengths) themselves, but on propositions of the form “the strength of Atom A lies in interval [x,y]“.   One can thus view the sampling as happening on a second-order probability distribution defined over the main probability distribution of strengths.

So the joint distribution on the truth value strength distributions in the PLN network, has to be calculated consistently with the results of the PLN probabilistic/heuristic inference rules.   If the PLN inference rules deviated far from probability theory, then the Gibbs sampling would result in a network that didn’t make sense as a probabilistic model of the world to which the variables in the network refer, but did make sense as a model of the relationship between the variables according to the PLN  inference rules.

This is pretty different from a MLN, because in an MLN the Gibbs sampling just has to find a distribution consistent with certain propositional logic relations, not consistent with certain heuristic uncertain truth value estimation functions.

Anyway: this sort of subtlety is the reason that the idea presented here is not “obvious” and hasn’t emerged in PLN theory before.

So then, if this were the only kind of inference dynamic happening in PLN, we could view PLN as something vaguely analogous to a second-order Markov Logic Network incorporating a wider variety of logical constructs (more general quantifier logic, intensional inference, etc.) via heuristic formulas.

However, the thought-experiment I am outlining in this section is not to have this kind of sampling be the only thing happening in PLN.   My suggestion is that in any new PLN, just like in the current and prior PLN, primary strengths may also be modified via forward and backward chaining inference. These inference methods do something different than the Gibbs-type updating mentioned above, because they add new logical links (and in some cases nodes) to the network.

This is vaguely comparable to how, in some cases, Gibbs sampling or message-passing in Markov Logic Networks have been coupled with Inductive Logic Programming.  ILP, vaguely similarly to PLN forward and backward inference, adds new logical links to a network. I.e., to use MLN / Bayes Nets terminology, both ILP and PLN chaining are concerned with structure building, whereas Gibbs sampling, message-passing and other comparable methods of probabilistic inference are concerned with calculating probabilities based on a given network structure.

Also note: If there is information coming into the system from outside PLN, then this information should be revised into the instantaneous truth values as well as the primary ones.  (This point was raised by Abram Demski in response to an earlier version of this post.) ….  And this leads to the interesting question of when, and to what extent, it is useful to revise the primary truth values back into the instantaneous truth values, based on the modifications of the primary truth values due to regular PLN forward and backward inference.

If we do both the Gibbs sampling suggested above and the traditional PLN chaining on the same network, what we have is a probabilistic network that is constantly adapting its structure (and a subset of its truth values) based on chains of inference rules, and constantly updating its truth values based on its structure according to Gibbs type (and vaguely MLN-ish) methods.

Note that the Gibbs sampling forms a consistent model of the joint distribution of all the Atoms in the Atomspace, without needing a trail-like mechanism. Clearly the Gibbs-type approach is much more like what could be realized in a brain-like system (though OpenCog is not really a brain-like system in any strong sense).

Inference trails would still be useful for chaining-based inferences, in the suggested framework. However, if the trail mechanism screws up in some cases and we get truth values that handle dependencies incorrectly — in the medium run, this won’t matter so much, because the Gibbs sampling mechanism will eventually find more correct versions for those truth values, which will be revised into the truth values. Note that incorrect truth values gotten by inadequate use of trails will still affect the results of the sampling, because they will weight some of the links used in the sampling-based inference — but the sampling-based inference will “merge” these incorrect truth values with the truth values of the relations embodying the dependencies they ignore, muting the influence of the incorrect values.

Also: one problem I’ve noted before with MLN and related ideas is that they assume a fully consistent interpretation of all the links in their network.    But a complex knowledge network reflecting the world-understanding of an AGI system, is not going to be fully consistent.  However, I believe the approach described here would inherit PLN’s robustness with regard to inconsistency.   The PLN heuristic inference rules are designed to dampen inconsistencies via locally ignoring them (e.g. if the premises of the PLN deduction rule are wildly inconsistent so that the rule gives a truth value strength outside [0,1], the resultant inference will simply not be revised into the truth value of the conclusion Atom).   In the current proposal, this sort of mechanism would be used both in the Gibbs sampling and the chaining control mechanisms.

Revision versus Gibbs Sampling

Now — if anyone is still following me by this point — I want to take the discussion in a slightly different direction.   I’m going to use the above ideas to make an argument why inference trails are unnecessary in PLN even without Gibbs sampling.

Reading through Thought Experiment #1 above, one might wonder why bother to maintain two truth values, an instantaneous and a primary one.  Why is this better than the traditional PLN approach, where you do the updating directly on the primary truth values, but instead of (as in Gibbs sampling) replacing the old truth value with the new one at each step, just revise the new truth value with the old one?

The answer seems to be: In the long run, if one assumes a fixed set of knowledge in the inference network during the learning process, both approaches amount to the same thing.  So in this somewhat artificial “fixed knowledge” setting, it’s really mainly a matter of convergence rates.   (Which means it’s a matter of the speed of coming to modestly intelligent conclusions, since in a real-world system in a dynamic environment, there is no hope of an inference network converging to a fully coherent conclusion based on its existing data before new data comes in and disrupts things).

Viewed at a sufficient level of abstraction, the Gibbs sampling approach corresponds to taking a Markov matrix M and taking the limit of the power M^n as n goes to infinity, till (M^n x), where x is the initial condition, converges to a stationary distribution.

Specifically, in the approach outlined above, one can think about a long vector, each entry of which refers to a “truth value state” of the PLN system as a whole.   The k’th truth value state corresponds to a proposition of the form “Truth value of Atom 1 lies in interval I_k(1), AND truth value of Atom 2 lies in interval I_k(2), AND … truth value of Atom lies in interval I_k(n).”   So this is a very high dimensional vector.  Given the specific set of inference rules and truth value formulas in a PLN system, if one iterates PLN using parallel forward chaining (i.e. executing all possible single-step forward inferences at the same time, and revising the results together); then PLN execution corresponds to multiplying by a large Markov matrix M.

On the other hand, the standard PLN approach with only one truth value for each Atom and a fixed weight c in the revision rule, corresponds roughly to taking the limit of the power ( c I + (1-c) M )^n as n goes to infinity.   The latter approach will generally take significantly longer to converge to the stationary distribution, because the ratio (second largest eigenvalue) / (largest eigenvalue) will be closer to 1.

Actually it’s a bit subtler than that, because the revision weight c isn’t a constant in PLN. Rather, as the system accumulates more evidence, c gets larger, so that the existing evidence is weighted more and the new evidence is weighted less.

But for each fixed value of c, the iteration would converge to the same stationary distribution as the Gibbs sampling approach (under reasonable assumptions, for a network with fixed knowledge).   And we may assume that as the network learns, eventually c will reach some maximum short of 1 (c=.9999 or whatever).   Under this assumption, it seems PLN iteration with adaptive revision weight will converge to the stationary distribution — eventually.

So the apparent conclusion of this somewhat sketchy mathematical thinking (if all the details work out!) is that, if one makes the (unrealistic) assumption of a fixed body of knowledge in the system,

  • The current PLN revision-based approach will get to the same place as the hypothetical Gibbs Sampling based approach outlined in Thought-Experiment #1 above
  • In this setting, we don’t need trails.  Dependencies will take care of themselves eventually as the network iterates.  (i.e., since Gibbs sampling doesn’t need trails, and the standard PLN approach is equivalent to Gibbs sampling on second-order distributions in the long run, the standard PLN approach also doesn’t need trails)

Now, it may be that trails are still useful in the short run.   On the other hand, there seem other ways to handle the matter.  For instance: If one has a sub-network of tightly interlinked Atoms, then one can do a lot of inference on these Atoms, i.e. accelerating the iterative sampling process as regards the relationships between these Atoms.  In this way the mutual dependencies among those Atoms will get resolved faster, much as if one were using trails.

Thought-Experiment #2

Finally, I’ll present a less extreme thought-experiment, which I think has a greater likelihood of actually being useful for PLN in OpenCog.

Instead of having two truth values per Atom — one the primary, traditional PLN truth value and the other an instantaneous truth value used for Gibbs sampling — what if one had two truth values, both updated via the standard PLN approach, but with widely differing default revision weights?

The standard default revision weight in PLN now is driven by the confidence factor

c = n/(n+k)

where n is a number of observations, and k is the “personality parameter.”  But layered on top of this (in the PLN theory, though not currently in the code), is a “confidence decay factor”, which decays confidence values over time.

One possibility would be to have two different truth values associated with each Atom: one conservative and one adventurous.   The two would differ in their personality parameters.  The conservative truth value would get updated with a small value of k, meaning that it would tend to weight its past experience highly and its new conclusions not so much.   The adventurous truth value would get updated with a large value of k, meaning that it would weight its new conclusions much more than its past experience.

What Thought Experiment #1 teaches us is that: As k goes to infinity, if one follows a simple inference control strategy as outlined there, the adventurous truth value will basically be getting updated according to Gibbs sampling (on second order probability distributions).

We have seen that both the adventurous and conservative truth values will converge to the same stationary distribution in the long run, under unrealistic assumptions of fixed knowledge in the network.  But so what?  Under realistic conditions they will behave quite differently.

There is much to experiment with here.   My point in this post has merely been to suggest some new experiments, and indicate some theoretical connections between PLN, sampling theory, and other probabilistic inference methods like MLN.

OK, that’s a rough summary of my train of thought on these topics at the moment. Feedback from folks with knowledge of PLN, MLNs and sampling would be valued. Am I thinking about this stuff in a sensible way? What do you think?

The current version of this post owes something to a critique of the previous version by Abram Demski.

Posted in Theory, Uncategorized | 2 Comments

Why Hypergraphs?

OpenCog uses hypergraphs to represent knowledge.  Why?  I don’t think this is clearly, succinctly explained anywhere, so I will try to do so here.  This is a very important point: I can’t begin to tell you how many times I went searching for some whiz-bang logic programming system,  or inference engine, or theorem-prover, or some graph re-writing engine, or some probabilistic programming system, only to throw up my hands up and realize that, after many wasted hours, none of them do what I want.  If you’re interested in AGI, then let me assure you: they don’t do what you want, either.  So, what do I want them to do, and why?

Well, lets begin easy: with graph re-writing systems.  These days, almost everyone agrees that a great way to represent knowledge is with graphs.  The structure IsA(Cat, Animal) looks like a graph with two vertexes, Cat and Animal, and a labelled edge, IsA, between them.  If I also know that IsA(Binky, Cat), then, in principle, I should be able to deduce that IsA(Binky, Animal).  This is a simple transitive relationship, and the act of logical deduction, for this example, is a simple graph re-write rule: If you see two IsA edges in a row, you should draw a third IsA edge between the first and the last vertex.  Easy, right?

So perhaps you’d think that all logic induction and reasoning engines have graph rewrite systems at their core, right? So you’d think. In fact, almost none of them do.  And those that do, do it in some internal, ad hoc, non-public, undocumented way: there’s no API, its not exposed externally; its not an ‘official’ part of the system for you to use or tinker with.

OK, so why do I need a graph re-write system? Well, I’ve been working on a natural language parser, a so-called Viterbi decoder for Link Grammar.  My initial graph is a string of words: a sentence.  The vertexes are words, and the edges are arrows called “next word”. Real simple. To parse this sentence, I want to apply a certain set of simple graph-rewrite rules: for example, if word X is a noun, then create an arrow, called ‘part-of-speech’ (POS),  from word X to the special vertex ‘noun’.  If the word immediately before word X is an adjective (i.e. if  it has a POS arrow pointing to ‘adjective’), then create a new arrow, called ‘noun modifier’, pointing from X to this word before it.   This kind of graph markup is called ‘dependency parsing‘, and is a very popular way of doing natural language parsing.  So you’d think that all dependency parsers have a graph re-write system at their core, right?  Hardly. In fact, just about none of them do.  And if they do, they’re walled off, hacked up, undocumented … you get the idea.

The only dependency parser that I know of that has an explicit graph-rewriting system in it, that is open for tinkering, and is documented (!) is RelEx.  And that’s great.  Wow!  Although RelEx invented and used its own, custom, graph rewrite system, I suppose that, in principle, it could have used some other, pre-existing system to do this (Well, it couldn’t, because in 2005, there weren’t any robust, open-source graph rewriting systems. Whatever).

What else do I want to do? Well, I daydream about using a machine-learning system to learn new rules!   I mean, this is the goal of AGI, right? Have a machine that can learn new things?  Well, to learn new rules, lets see, I need to have some simple syntax for representing rules.  Basically a simple graph language.  So you’d think that all graph re-writing systems have some simple, easy-to-use graph language, right?  No. Emphatically, hell no. With maybe one exception, you have to program in Java or C++ or C#.net. Unfortunately, my machine learning system doesn’t yet know how to program in those languages.

Here’s the leap of faith, the leap of logic that I’ve come to: It would be convenient if I could express graph re-write rules as graphs themselves.  It would be convenient if I could express logical implications as graphs themselves.  It would be convenient if  my graphical programming language itself could be written as a graph. Well, it can be. In fact, it is easiest to do this if the graph is actually a hypergraph. I’ll explain why in the next section below.  If I had a hypergraph re-writing system, than I would have a place where I could unify natural language processing, logical reasoning and machine learning, all in one place.  So you’d think that anyone who was trying to build an AGI system wouldbe writing its foundations on a hypergraph rewriting system, right? No, you’d be wrong. Apparently, OpenCog is the only system that does this.  Now, the OpenCog implementation has many design warts and architectural flaws. Its hard to understand and hard to use.  But now, perhaps, now you see why I’ve pledged allegiance to it,   instead of running off with some other reasoning system or parser or Bayesian network or whatever.

Mathematical Foundations

In this section, I will try to put all of the above comments on a solid mathematical footing, invoking model theory, category theory, (and even n-categories!), and type theory.  The upshot of all of this will be that the easiest way to represent data structures so that machine learning algorithms can learn them, and then apply them both to natural-language parsing, and to logical reasoning, is to represent the data structures as hypergraphs.

From model theory and computer science, we have the concept of a signature: a set of functions which take some number of arguments and return some value (just like a signature in Java or C++).  If one ignores types for a moment (which is what lisp and scheme do), then, in principle, one can pass any value in any position of any function, and stack these up arbitrarily, recursively, even.  This is called a term algebra, or more precisely a free term algebra or ‘free theory’.  If the functions don’t have names, but are anonymous, then one has the lambda calculus.

One way to envision a member of a term algebra is as a directed tree graph.  So, if we have two functions f(x,y) and g(x,y) and three constants a,b,c, then f(a, g(b,c)) is a binary tree, with f at the top node, and g as the left node, and a, b and c as the leaves. A term algebra is then just the collection of all such trees. Nothing more, nothing less.

To do useful programming, one also needs predicates or relations: things that have truth values, and order terms. Thus, ‘greater then’ is a relation, and ‘a>b’ is either true or false.  Relations can also be things like IsA, HasA, BelongsTo, LivesIn, EmployedAt. The last two examples should make clear that relational algebras form the foundation of databases, whether SQL or noSQL.  Relations are combined with logical operators (employee X LivesIn city Y AND ReportsTo dept Z is a textbook example).

In general, one combines both term algebras and relational algebras, so that one writes things like 3<f(x,y) where f(x,y) is a term, < is a relation, 3 is a constant.  Add to this the special free-variable binding operators ForAll and ThereExists, one gets a First-Order Logic. So, for example, ForAll x ThereExists y such that 3<f(x,y).

A special case of a relation is a term re-write rule.  This is a relation that takes a term, and replaces it with a different term: for example, ab->c, which says ‘whenever you see the string ab, replace it with c’. The BNF notation of computer languages is just a collection of term re-writing relations. One uses a term rewriting system to parse a (formal) language. Graph rewriting is just a variation of this: whenever you see a graph x, replace it with a graph y.

So far, I’ve avoided the issue of types.  In programming, types allow type safety.  Types make code more readable: f(string, int) is less mysterious than f(x,y). Types solve certain abstract recursion problems in lambda calculus.  A re-write rule in BNF notation is a typed rewrite rule: a substitution a->bc holds not just for any a, but specifically, only when a is a web page, or an IP address or a URL.  A graph re-write rule that says ‘whenever you see x, replace it with y’ implicitly demands that x be typed: x can’t be just anything, it has to be a specific kind of graph, having a specific shape and linkage.  The rule applies for all graphs that have this shape, that are of this kind or type.  So a re-write rule x->y is really a rule (type x)->(type y). Graphically, its still two points x and y, with a directed edge -> in between them. Oh, wait, x and y aren’t points, x and y are graphs.  What kind of a graph has graphs as points?  What kind of graph has edges between graphs? A hypergraph!

And that is the main Ah-HA! moment.  Once you see that, you start seeing hypergraphs everywhere. Sure, you can visualize Set(a,b,c) as a tree-graph, with Set as the parent node, and three children a,b,c.  Or you can visualize this as a hypergraph: Set as a ‘link’ (a ‘hyper-edge’ with 3 endpoints, not 2), and the points a,b,c as the nodes contained in the link.  In fact, all hypergraphs are dual to these directed trees; if you have one, you can have the other.  Hypergraphs are just a convenient notation.

Lets take a moment to look back on what just happened: a function f(x,y,z) is just a hyperedge f connecting three nodes x,y,z. A boolean expression a AND b AND c can be written as AND(a,b,c), which shows a specific example of a hypergraph equivalance. It can be written as a reduction rule: (a AND b AND c) -> AND(a,b,c) which is itself just a hypergraph of the form x->y with x and y being hypergraphs.  The first-order logic constructs ‘for-all’ and ‘there-exists’ are just special cases of the lambda-calculus binding operation lambda, which binds free variables in an expression. Again, hypergraphs: lambda is just a hyperlink that binds a variable x in an expression y, and y was just a term, ahem, hypergraph!

I mentioned categories and n-categories, and I suppose I should justify this mention. Insofar as category theory is the theory of points and arrows, then a rewrite rule between graphs is a morphism in the category of small diagrams.  A subtle but important point about category theory that is almost never discussed in intro-to-cat-theory texts, is that all objects are implicitly typed. In the the category of Sets, the objects are all of the same kind: they are sets.  Its not mentioned because in a given category, all objects are of the same type; types change only when a functor maps from one to another.   So, to understand the category-theoretic equivalent of types in computer science, we must think of functors.  But, as we just saw, a graph rewriting rule is a morphism between functors.  So you could say that graph re-writing is just the category Cat of small categories.  Or you could slide down this slope in a different direction, and start calling it a 2-category. Whatever.  Perhaps its useful to point out that graph rewriting algorithms are sometimes expressed as being one-pushouts or as being 2-pushouts, with a pushout being a certain category-theoretic concept. Notable, for graph rewriting, is that any category with pushouts and equalizers has all (co-)limits. Except that, as we just saw, we want hyper-graph rewriting systems, not graph rewriting systems. So there.

What else are they good for?

In OpenCog, the Link and Node types inherit from the type Atom. This naming convention is intentionally suggestive: ‘Atom’ is meant to invoke the notion of an ‘atomic formula’ from model theory or first-order logic: that is, a formula that has no variables in it (its fully grounded), and that does have a truth value (its not composed of boolean connectives, and has no quantifiers in it).  This suggestive naming helps establish the intended use of OpenCog hypergraphs with regards to first-order logic.

The truth value is a different matter. The default (simplest) OpenCog truth value is a pair of floating point numbers: a probability and a confidence. These numbers allow several other AI concepts to be mapped into hypegraphs: Bayesian networks, Markov networks, and artificial neural networks. All three of these are graphs: directed graphs, at that. They differ in how they assign and propagate floating-point probabilites, entropies, activations. Ideas such as Markov logic networks, which implement maximum entropy principles (aka Boltzmann parition function) on a network of first-order logic expressions, can be represented with OpenCog hypergraphs.  Oh, and I should mention PLN (Probabilistic Logic Networks), which is what the atomspace was originally designed for. That’s what I like about the OpenCog hypergraph atomspace: it has a tremendously powerful ability to succinctly and easily represent complex modern AI concepts.

The good, the bad and the ugly.

You’ve heard about the good.  Now for the bad and the ugly.  First, the OpenCog atomspace implementation is slow and inefficient, over-complicated, badly architected, weakly-distributed, non-scalable, single-threaded. But lets not go there.  All this might be fixable, after a lot of programming effort (and deep, architectural thinking). Its been hotly debated in the past. Someday, maybe it’ll get fixed.

The bad thing about the OpenCog atomspace is that almost no one understands that, ahem, it is a programming language. Let me be very clear on this: OpenCog implements graph re-writing rules with the ImplicationLink. A sequence of ImplicationLinks can be used to compute things. In that sense, it is somewhat like the language Graph Programs, except that OpenCog allows fractional truth values, and logic programming and other good things.  If we stick to using ImplicationLinks with crisp truth values (T/F), then the resulting system is essentially Prolog. Of course you know that Prolog is popular for AI programming projects, because its fairly easy to write reasoning engines and expert systems and the like in Prolog.  What you may not know is that closely related to Prolog is Answer-Set Programming (ASP) . In fact, ASP uses exactly the same notation as Prolog does. It differs in two important ways: first, when you run a Prolog program, you get one answer. With ASP, you get all of the answers!  Its dramatically more powerful, and the reason for this is that  modern-day ASP solvers are built on top of modern-day Boolean SAT solvers. Which means that they are stunningly efficient and effective.

So what does this have to do with OpenCog? Well, here we have a system that, using ImplicationLinks, is essentially Prolog, more or less, when run in crisp-logic mode. Or, you could say, its like typed Lambda calculus. But do we have a simple, easy-to-use syntax like Prolog for it? No we don’t. That’s bad. Can we take an existing Prolog program, run a tool on it, and convert it to ImplicationLinks? No we don’t.  Would it run fast? No it wouldn’t: it would probably be slower than the slowest Prolog ever: Borland prolog running on a 5MHz IBM PC AT in 1986.  And forget an ASP solver for OpenCog.  For the special case where all OpenCog truth values are crisp T/F values, we do not have a Boolean SAT solver to find solutions for our graphs of ImplicationLinks.  This is bad, Really Bad. But I think that this is because very few people seem to understand that the OpenCog Atomspace really is a petri dish for programming languages.

Heck, we don’t even have anything equivalent to the RelEx Sentence Algorithms for OpenCog, even though RelEx is OpenCog-like. This absence is slowing down my efforts to continue work on the Link-Grammar parser, and to move natural language processing out of its stand-alone arena, into a general, flexible framework.

(And we’ve barely scratched the surface. In order to make implication and pattern mining run quickly in the atomspace, we need to implement something like the concept of ‘memoization‘ from lisp/scheme. But it turns out that memoization is really just a relational algebra: it is a database of short expressions that stand in for long ones. The OpenCog Atomspace is also, among other things, a relational database that can store and query not only flat tables or key-value pairs, but full-blown hypergraphs. And this isn’t a day-dream; its crucial for performance (and its partially implemented)).

Why don’t we have these things? Well, its hard. Its just not easy. We don’t have the infrastructure to make it easy, and we don’t have the users who demand these tools.   I don’t think most users are even aware of what the atomspace could even do.   Almost no one is thinking about ‘how to program in the language of OpenCog’ even though it has the potential of far surpassing any of the existing probabilistic programming languages out there.  Its time to change all this, but it will take someone smart and dedicated to do this. Many someones. This could be you.

Posted in Design, Introduction, Theory | Tagged , , , , , , , , , | 39 Comments

Catalog of Current OpenCog Atom Types

Alex van der Peet (of the OpenCog Hong Kong team) has been working on cataloguing all Atom types currently in use in the OpenCog code on the wiki site.

This page lists them all, with a page for each one:

http://wiki.opencog.org/w/Category:Atom_Types

A few of the pages still don’t have any information on them.

To all OpenCog developers: If you’re working heavily with a certain set of Atom types, please check out the corresponding wiki page, and think about adding some comments or examples.

Posted in Uncategorized | Leave a comment

The Viterbi Parser

I’ve recently made some good progress on something that I’m calling “the Viterbi decoder”, a new parser for the Link Grammar natural language parser.  So I guess that means its time to talk a bit about the why and how of this parser.

The goal of providing this decoder is to present a flexible, powerful interface for implementing high-level semantic algorithms on top of the the low-level link-grammar syntactic parser, and, in particular, for steering the parse based on high-level semantic knowledge. This allows the parser to move beyond being merely a syntactic parser, and to become fully integrated with general semantic artificial intelligence.

A less abstract list of expected benefits include:

  • Incremental parsing: the ability to obtain partial results after providing partial sentences, a word at a time.
  • Less sensitivity to sentence boundaries, allowing longer, run-on sentences to be parsed far more quickly.
  • Mitigation of the combinatorial explosion of parses.
  • Allow gramatically broken/incorrect chat dialog to be parsed; in general, to do better with slang, hip-speak.
  • Enable co-reference resolution and anaphora resolution across sentences (resolve pronouns, etc.)
  • Enable annotation of the parse graph with word-sense data, entity markers.
  • Allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternative reference resolutions.
  • Allow a plug-in architecture, so that plugins, employing higher- level semantic (AGI) algorithms can provide parse guidance and parse disambiguation.
  • Eliminate many of the hard-coded array sizes in the code.

The data structures used to implement this resemble those of the OpenCog AtomSpace. All data classes inherit from a class called Atom (which is an atomic predicate, in the sense of mathematical logic). Atoms are typed; the two core types are Links and Nodes. Thus, all data is represented in the form of a “term algebra” (aka the “Free Theory”, in the sense of model theory). This structure allows all data to be represented as (hyper-)graphs, which in turn makes the implementation of graph algorithms easier to implement. All these theoretical considerations provide a natural setting for storing Viterbi state information. Put differently, this provide a generic, uniform way of holding the various partly-finished parses, and effecting state transformations on them.

Since all of the data is represented dynamically (at run-time) by these (hyper-)graphs composed of atoms, developing custom algorithms to manipulate the parse becomes easy: there are no strange compile-time structures to master.  All algorithms can access the data in a uniform, common way.

Making the internal state directly visible allows low-level syntactic algorithms, as well as high-level, semantic algorithms to control parsing. In other words, the intended use of the Viterbi decoder is to provide a framework for parsing that should make it possible to integrate tightly (and cleanly) with high-level semantic analysis algorithms. Thus, reference and anaphora resolution can be done using the same graph structure as used for parsing; it should also allow graphical transformations, such as those currently implemented in RelEx.

One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found via the news story “Language Use is Simpler Than Prviously Thought“, per Morten Christiansen, Cornell professor of psychology.

Currently, the parser can correctly parse many short sentences. It currently runs very slowly, as no pruning algorithms have yet been implemented. Instructions for turning it on can be found in the viterbi/README file. The code is not in the 4.7.10 tarball; you need something newer: i.e. pull from the svn source tree. It will be in 4.7.11, whenever that comes out.

Here’s an example parse of “this is a test”. First, the usual link-parser output:

         +--Ost--+
   +-Ss*b+  +-Ds-+
   |     |  |    |
this.p is.v a test.n

or, with the wall words:

    +---------------RW--------------+
    |              +--Ost--+        |
    +---Wd---+-Ss*b+  +-Ds-+        |
    |        |     |  |    |        |
LEFT-WALL this.p is.v a test.n RIGHT-WALL

The output of viterbi, with some explanatory comments,  is this:

SEQ :                  # a sequence, an ordered set
  LING :               # a link-grammar link; naming conflict with opencog link.
    LING_TYPE : Wd     # the type of the link connecting two words.
    WORD_DISJ :        # holds the word and the connector used
      WORD : LEFT-WALL # all sentences begin with the left-wall.
      CONNECTOR : Wd+  # + means "connect to the right". - means left
    WORD_DISJ :
      WORD : this.p    # word with suffix as it appears in link-grammar dictionary
      CONNECTOR : Wd-
  LING :
    LING_TYPE : Ss*b   # and so on ...
    WORD_DISJ :
      WORD : this.p
      CONNECTOR : Ss*b+
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : Ss-
  LING :
    LING_TYPE : Ds
    WORD_DISJ :
      WORD : a
      CONNECTOR : Ds+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Ds-
  LING :
    LING_TYPE : Ost
    WORD_DISJ :
      WORD : is.v
      CONNECTOR : O*t+
    WORD_DISJ :
      WORD : test.n
      CONNECTOR : Os-

Oh, and I suppose its appropriate to answer the question “why is it called the Viterbi parser”?  I’m calling it that because it is inspired by (and vaguely resembles) the Viterbi algorithm famous from signal processing. A characteristic feature of that algorithm is that it maintains a set of states in parallel. As each new bit is received, some of the states become inherently inconsistent (e.g. because some checksum is violated), while other new states become possible. Once some certain number of bits have been received, the ones that can be consistently interpreted with the checksum constraints can be output. The process then repeats with each new bit streaming in.

In link-grammar, a “disjunct” can be thought of as a puzzle piece with a word printed on it. There are many different puzzle pieces with the same word on it. As each word comes in, one tries to find a piece that fits (this is like the viterbi checksum). Sometimes, more than one fits, so one has multiple ‘alternatives’ (this is like the viterbi state-vector). The algo keeps a set of these alternatives (of assembled pieces), and, as words come in, alternatives are either discarded (because nothing fits) or are elaborated on.

Unlike the viterbi algorithm, in natural language processing, it is useful to keep some of these alternatives or ambiguities around until much later stages of processing, when the disambiguation can finally be performed. As a famous example: “I saw the man with the telescope” has two valid syntactic parses, and two valid semantic interpretations.  Who was holding the telescope, me, or the man? Resolving this would be like applying a checksum to two different paths very late in the Viterbi game.

I like this analogy because it is vaguely biological as well: or perhaps I should say “neural net-ish”. The multiple, provisional states that are kept around are sort of like the activation states of a feed-forward artificial neural network. But this is not very deep: the feed-forward neural net looks like a Hidden Markov Model (HMM), and the Viterbi algorithm is essentially an HMM algorithm. No surprise!

But all this talk of algorithms hides the true reason for this work. The above algo is not strong enough to reproduce the old parser behavior: it can create islands; it ignores post-processing. The original algorithm uses integer-valued “cost” to rank parses; I want to replace this by floating point values (probabilities! maximum entropy!).

I also want to implement an “algorithm plug-in” API — basically, a way of offering “here’s the current state, go and modify it.” — to have ‘mind-agents’ in OpenCog terminology. The above puzzle-piece assembly algo would be the first to run, but clearly, others are needed to prevent islands, or to re-order states by probability/likelihood.   Some of these may be clearly distinct algos; others may end up as tangled balls of complexity.  Factorization into distinct algos is clearly possible: RelEx already had a list of algos there were applied in sequential order.  First, some POS tagging was done, then some head-word verb extraction, then some entity extraction, etc. Algorithms can be layered.

So really, the core issue I’m hoping to solve here is that of having a uniform development environment: link-grammar is in C, has no probability (besides cost), and no internal API. RelEx is in Java, is explicitly graphical, but is not a hypergraph, has no probabilities, and can’t provide parse feed-back to control link-grammar. RelEx output was pumped into OpenCog, which is in C++; it cannot feedback into RelEx or Link Grammar.  The Link Grammar dictionaries are files: how can an automated system learn a new word, and stick it into a file?

At the moment, there aren’t really any new or novel algorithms: I’ve day-dreamed some in the past, but the fractured approach halts progress. All these boundaries are barriers; the hope here is to get past all of these barriers.  The work is really an architectural re-design, and not a new whiz-bang algo.

Posted in Design, Development, Theory | Tagged , , , , , , | 3 Comments

The MOSES Metapopulation

Kaj Sotala recently asked for an update on how MOSES selects a “species” to be “mutated”, when it is searching for the fittest program tree. I have some on-going, unfinished research in this area, but perhaps this is a good time to explain why.

To recap: MOSES is a program-learning system. That is, given some input data, MOSES attempts to learn a computer program that reproduces the data. It does so by applying a mixture of evolutionary algorithms: an “inner loop” and an “outer loop”. The inner loop explores all of the mutations of a “species” (a “deme”, in MOSES terminology), while the outer loop chooses the next deme to explore. (Each deme is a “program tree”, that is, a program written in a certain lisp-like programming language).

So: the outer loop selects some program tree, whose mutations will be explored by the inner loop. The question becomes, “which program tree should be selected next?” Now, nature gets to evolve many different species in parallel; but here, where CPU cycles are expensive, its important to pick a tree whose mutations are “most likely to result in an even fitter program”. This is a bit challenging.

MOSES works from a pool of candidate trees, of various fitnesses. With each iteration of the inner loop, the pool is expanded: when some reasonably fit mutations are found, they are added to the pool. Think of this pool as a collection of “species”, some similar, some not, some fit, some, not so much. To iterate the outer loop, it seems plausible to take the fittest candidate in the pool, and mutate it, looking for improvements. If none are found, then in the next go-around, the second-most-fit program is explored, etc. (terminology: in moses, the pool is called the “metapopulation”).

It turns out (experimentally) that this results in a very slow algorithm. A much better approach is to pick randomly from the highest scorers: one has a much better chance of getting lucky this way. But how to pick randomly? The highest scorers are given a probability: p ~ exp (score /T) so in fact, the highest scoring have the highest probability of being picked, but the poorly-scoring have a chance too. This distribution is the “Gibbs measure” aka “Boltzmann distribution”; (T is a kind of “temperature”, it provides a scale; its held constant in the current algos)  I’m guessing that this is the right measure to apply here, and can do some deep theoretical handwaving, but haven’t really worked this out in detail. Experimentally, it works well; there even seems to be a preferred temperature that seems to work well for most/all different problems (but this is not exactly clear).

One can do even better. Instead of using the score, a blend of score minus program tree complexity works better; again, this is experimentally verified.   Nil added this back when, and his theoretical justification was to call it “Solomonoff complexity”, and turn it into a ‘Bayesian prior’. From an engineering viewpoint, its basically saying that, to create a good design suitable for some use, its better to start with a simple design and modify it, than to start with a complex design and modify it. In MOSES terminology, its better to pick an initial low-complexity but poorly scoring deme, and mutate it, than to start with something of high complexity, high score, and mutate that. Exactly what the blending ratio (between high score, and high complexity) is, and how to interpret it, is an interesting question.

Experimentally, I see another interesting behaviour, that I am trying to “fix”.   I see a very classic “flight of the swallow” learning curve, dating back to the earliest measurements of the speed of telegraph operators at the turn of the 19th century. At first, learning is fast, and then it stalls, until there is a break-through; then learning is again fast (for a very brief time — weeks for telegraph operators) and then stalls (years or a decade for telegraph operators). In MOSES, so, at first, one picks a deme, almost any deme, and almost any mutation will improve upon it. This goes on for a while, and then plateaus. Then there’s a long dry spell — picking deme after deme, mutating it, and finding very little or no improvement. This goes on for a long time (say, thousands of demes, hours of cpu time), when suddenly there is a break-through: dozens of different mutations to some very specific deme all improve the score by some large amount. The bolzmann weighting above causes these to be explored in the next go-around, and mutations of these, in turn, all yield improvements too. This lasts for maybe 10-20 steps, and then the scores plateau again. Exactly like the signalling rate of 19th century telegraph operators :-) Or the ability of guitar players. Or sportsmen, all of which have been measured in various social-science studies, and have shown the “flight of the swallow” curve on them.

(Can someone PLEASE fix the horribly deficient Wikipedia article on “learning curve”? It totally fails to cite any of the seminal research and breakthroughs on this topic. Check out google images for examples of fast learning, followed by long plateau.

So: e.g.

Learning curve

learning curve

Learning curve. In real life. For Salesmen.

Actual MOSES curves look more like this, with rapid progress followed by stagnant plateaus, punctuated with rapid progress, again. Exccept the plateaus are much flatter and much longer, and the upward curves are much sharper and faster.

All these curves beg the question: why is google finding only the highly stylized ones, and not showing any for raw, actual data? Has the learning curve turned into an urban legend??

Here's a real-life learning curve, taken from MOSES, using real data (the "bank" dataset) from a previous OpenCog blog post on MOSES. Although this learning curve shows a combination of the inner and outer loops, and so, strictly speaking does not represent what I'm discussing here.

).

Recently, I have been trying to shorten the plateau, by trying to make sure that the next deme I pick for exploration is one that is least similar to the last one explored. The rationale here is that the metapaopulation gets filled with lots of very very similar species, all of which are almost equally fit, all of which are “genetically” very similar. Trying to pick among these, to find the magic one, the one whose mutations will yeild a break-through, seems to be a losing strategy. So, instead, add a diversity penalty: explore these “species” that are as different as possible from the current one (but still have about the same fitness score). So far, this experiment is inconclusive; I wasn’t rewarded with instant success, but more work needs to be done. Its actually fairly tedious to take the data…

Posted in Design, Documentation, Theory | Tagged , , , | 6 Comments