AGI stands for Artificial General Intelligence: it is the idea of creating a thinking machine, as smart and clever and powerful as a human being. OpenCog is a project pursuing basic research in AGI. It is attempting to build such a thinking machine.
Let’s compare the search for AGI to Astronomy. An amateur with binoculars can do astronomy and have lots of fun, but will not contribute to scientific advance. For that, you need highly-trained scientists, and you need sophisticated instruments (viz. fancy telescopes.)
OpenCog-the-project is a loose affiliation of about a half-dozen “scientists” of varying talents and varying time devoted to it. It has a loosely articulated research program; some roughly-agreed-upon directions to go in.
OpenCog-the-software is a telescope. It’s the primary tool that I (the author of this blog entry) personally use for research. Others use it, but they struggle to be effective. They lose interest after a while. (Seriously: if you had full-time access to some major scientific instrument: some big telescope, particle accelerator, would you be effective? Would you lose interest after a while? What would you even do with it? Use it to explode cans of soda and post the videos on youtube?)
It’s worth mentioning what OpenCog isn’t: it is not a grand theory of AGI. It is not a master blue-print of a machine, waiting to be built, certain of success once completed. We hope to find some portion of that grand theory. We hope to build some portion of that machine. Bit by bit.
Lets compare this to mainstream big-science. OpenCog the project is comparable to a small department at a small university. A handful of half-active people. OpenCog the software is… well, I spent a decade building it out. It was not clear what to build, or what it should even do. It’s a bit clearer now. I still think it’s on the right track.
The first particle accelerators were built by people like Ben Franklin, who had no clue that there were particles, or that they were being accelerated. Neither of those ideas gelled until over 100 years later. When they did, the first intentionally-designed particle accelerators were built by single individuals, over the course of a few months or a few years. (The story of early telescopes is similar). In the meantime, the general idea of why particle accelerators are useful became clear. (Its not obviously obvious, not like a telescope can be obviously used for looking. What would you even look at? The neighbor’s house?)
But once you get the idea of what something is, and why it is useful, you can build bigger and better ones, and when you can’t build it yourself in a few years, you can organize a staff of grad students and assistants. Maybe even a part-time engineer. Eventually, the general idea becomes so very clear, and you are so certain of success, that you can raise funds to pour concrete for a new building to house it in.
Neither OpenCog, nor AGI are at that level yet. Deep learning is like the invention of the Geiger counter: thousands of people went off and built themselves a Geiger counter (or downloaded one off the net) and started messing with it. And they’re having lots of fun with it; some are making great discoveries, while others are building better Geiger counters.
OpenCog-the-software is not anywhere near as fun as a Geiger counter. Is it even on the right track? Well, ten years ago, OpenCog-the-software was almost the only graph database on the planet. Now there are more than you can easily count, all enjoying huge commercial success. So maybe we did something right, when we designed it to be a graph database. Meanwhile, OpenCog has grown with new features, capabilities, and stuff that no other graph DB has, or has even contemplated. So I feel like we’re way ahead in the race. But…
But are these features/functions what are actually needed for AGI? Maybe it’s not a telescope/particle-accelerator at all, but just a crazy back-yard sculpture? How can you tell?
I would like to attract more people to the project, but we (OpenCog) have a way of under-publicizing it. We consistently fail to write up high-quality research, we fail to even write blog posts (making amends, here). We fail to twitter about it. We fail to keep compelling demos running (Ten years ago, we had a decent question-answering chatbot that could have attracted attention. It … bit-rotted. )
There’s some excuse, some reason for that: every time you create an experiment, and run it, you will typically get some insightful results. While interesting, those results are not “true AGI”, not by a long-shot. So what do you do with your experiment? Announce to the world “Hey, we once again failed to create AGI?” Do you moth-ball the code? Do you even mention it to newcomers? Perhaps we need to keep a gallery of neat things that we’ve built… or at least, a collection of basic experiments with basic instruments that newcomers need to master.
The OpenCog AtomSpace is a typed graphical distributed in-RAM knowledgebase. The meaning of all those buzzwords is that it is trying to fix many different, related problems that computers (and people) have when working with data. This blog-post discusses some basic issues of working with knowledge, and the various ways that people have tried to work around these issues over the decades. It will illuminate why graph databases are explosively popular these days, and will also show how that’s not the end of the story. I’m trying to start the next chapter already.
But first, lets start at the very beginning. Everyone calls it “data” instead of “knowledge” because everyone is a bit embarrassed to admit that they don’t really know how to stick “knowledge” onto a computer. The 1970’s saw the invention and development of relational databases and its mathematical underpinnings, the relational algebra. This formalized the idea of “is-a”, “has-a”, “is a part of”, “inherits from”, “is-similar-to” and whatever other relation you could possibly think of. Not a bad start for representing knowledge. The API for this relational knowledge eventually congealed into SQL.
A lot of people don’t like SQL. Those three letters make them run in the opposite direction. What’s wrong with SQL? Many things; here’s one: you have to put all your data into tables. You have to invent the table first, before you can use it. You have to plan ahead. Compare this to DataLog, the data-only subset of ProLog. Here, you could declare any fact, any statement, any relation at any time, without prior planning. Great, right? What’s wrong with that? Well, two things: a lack of indexes for quickly finding the data you want, and some trouble searching for it. Indexes are great: you just jump to where you want, instead of having to crawl all over your immense dataset. Another problem was the lack of a uniform query language that ran in some predictable, non-surprising amount of time.
Sometimes structure is nice. For example (a bad example?) with XML, you can also declare knowledge in a free-form fashion: put a tag here, a tag there, anywhere, you’re all good. Mark it up with your markup language. The people who want some structure in their life use XML schemas, which constrain some of this free-form, anything-goes attitude. The people who want to treat XML as a large distributed database have a query language to search XML from all over the web: XQuery, the XML query language. We should be in the seventh heaven: world-wide searchable free-form data? What can possibly be wrong with that?
Well, one thing is that you don’t write XML Schemas in XML; you write them in this other weird language. You don’t write XQuery in XML; it’s got it’s own new, weird and different syntax and structure. It a lot like what’s wrong with SQL: an SQL query is not actually a row or a table. It’s so very different from being a “row in a table” that its hard to even imagine that. This is one thing that ProLog got right: a ProLog query to find data stored in ProLog is written in ProLog. There’s no clear distinction between the data itself, and the way you work with it. Yes, that’s a good thing. A very good thing, actually
These various stumbling blocks have driven the explosive popularity of graph databases. You get to write free-form declarations of relationships, abandoning the strictures of tables. You get something a bit more modern that DataLog, which is a bit old and set in it’s ways: the syntax isn’t liberating. You get something that is infinitely more readable and writable that XML. And, with the well-designed graph databases (I’m thinking of you, grakn.ai) you get to avoid a lot of arcane markup and obtuse documentation. At least in the beginning. Graph databases are the second coming. We should be in heaven by now, right? What could possibly be wrong with graph databases?
Well, in most cases, graph queries (expressed in graph query languages) are not themselves actually graphs. They still sit “outside the system”, like SQL, but unlike ProLog. But when you represent your query as a graph itself, you can do neat things. You can go recursive. You can query for queries. You can find, edit and update queries with other queries. Create new ones on the fly. Even go backwards: given an “answer”, you can ask “what are the questions (the queries) that would return this answer?” This is actually a big deal in chatbot systems, where a human says something, and the chatbot has to search for all patterns (all templates) that match what was said. Amazingly, pretty much no general-purpose database can do this, even though its foundational for certain algorithms.
Working directly with queries is a big deal in rule engines, where each rule is a effectively a query, and you want to chain them together. But first you have to find the rule. How do you do that, if you can’t query your queries? Of course you can hack it. But seriously? hack? And then there’s theorem-proving and inferencing, where the reasoning/inferencing has to pick and choose through a collection rules to find which ones might be attachable next. And sometimes, you want to learn new rules: statistically, by counting, or randomly, by trial and error and cross-over (aka “genetic algorithms”) Perhaps by hill-climbing or gradient-descent. If you want to learn a new query, its best if you can store the query itself in your database.
There’s also a different problem. Graph databases don’t have a type system. It turns out that types are incredibly useful for structuring your data. Its the foundation-stone of object-oriented programming. A Java class is a type. A Python class/object is a type. C++ classes are types. For a while, object-oriented databases were all the rage. This is in contrast to SQL, which has a poor type system with only a handful of types: A table. A row in a table. An entry in a row in a table. Which could be a string, or a number. That’s pretty much it. Compare that every object you’ve ever known.
But there’s something way cooler about types that the Java&Python crowd has missed, but the functional-programming crowd (Scala and Haskell) has grokked: type constructors. The true geeks have gone even farther: proofs-as-programs, aka Curry-Howard correspondence. All the hip theorem provers (Agda and Coq) are dependent on types. What makes type constructors neat? Well, you can write programs to compute types, but unlike ordinary programs, they are guaranteed to terminate. This means that many basic tasks become simple, doable, predictable.
Like what? Well, reasoning, inference, learning. The sorts of things you want your AI/AGI system to do. Its not an accident that theorem provers are fundamentally tangled with type theory –blecch, you might think, but reasoning and theorem proving are kind-of the same thing. Who needs all that when we’ve got deep-learning neural nets? OK, off-topic – my earlier blog post dealt with how neural nets and symbolic reasoning are a lot more similar than they are different. The similarities are bit hidden, but they are there, and clear, once you see them. Anyway, off-topic, but when deep-learning hits a wall, and it will, it will be because it lakes any notion of types. Not that it matters just right now: neural nets already lack a formal relational database structure. You can’t ask a neural net “How many sales has Joe Smith made this month?”
Here’s my sales pitch: you want a graph database with a sophisticated type system built into it. Maybe you don’t know this yet. But you do. You will. You’ll have trouble doing anything reasonable with your knowledge (like reasoning, inferencing and learning) if you don’t. This is why the OpenCog AtomSpace is a graph database, with types.
There’s one more thing that the AtomSpace does, that no other graph database does (well, there are several things that no one else does, but this is the only one I want to talk about now). They’re called “Values”. Turns out there’s a cost to storing a graph: you have to know which edge is attached to what vertex, if you are going to walk the graph. You have to keep an index to track edges (I can hear the boss now: “You’re telling me we have five guys named Joe Smith in Sales? Which one did we promote?”) Bummer if your database can’t keep track of who’s who. But sometimes you don’t need to stick every last piece of trivia in an index, to place it where it’s instantly searchable. Sometimes, a key-value store is enough. This is kind-of the noSQL ethos: Got the key? Get the value; don’t need a query to do that.
Every vertex, every edge in the AtomSpace has a key-value database built into it. Its small, its lightweight. It enables fast-changing data without the overhead of indexing it. It takes up less RAM, literally. We’re experimenting with using it for streaming data: video streams, audio streams. Hooking them up the way that tensorflow might hook up data with Keras. Its pretty neat. You can stick truth values, attention values; anything changing quickly over time, into Values. Any floating point number. And if you have a number that almost never changes (like Joe Smith’s salary), and you really do want to query it (so you can pay him), then just store that number in the graph database (and not in the per-edge key-value mini-database). You have a choice. Speed when you need it, searchability where you need it.
Well, that’s it for now. The OpenCog AtomSpace is an in-RAM distributed typed graph knowledgebase. I guess I forgot to talk about “distributed”. It is; run the demo. You can share data across many different machines. It’s in-RAM, which mostly means that you can work with things at more-or-less the speed of accessing RAM. Unlike SQL, it has a rich type system. Unlike SQL or pretty much any graph database, (but like ProLog) the queries are themselves graphs. So anyway, that’s what we tried to do with the AtomSpace. Everything I talked about above works today. Worked last year. Most of it worked just fine 5 years ago. It is still a bit rough around the edges. Its got some usability issues. There’s no pretty website (oh, a wet dream). No easy documentation. (The documentation is accurate. There’s lots of it. Its almost unreadable. I know; I wrote it.) But we’re working on it. Come help us out.
Deep learning and neural nets are all the rage, today, and have displaced symbolic AI systems in most applications. It’s commonly believed that the two approaches have nothing to do with each other; that they’re just completely different, and that’s that. But this is false: there are some profound similarities; they are not only variants of one-another, but, in a deep way, they have commonalities that render them essentially identical. This blog post attempts to explain how. It introduces a much longer and rather mathematically dense writeup articulating these ideas more precisely.
The clearest starting point for seeing this seems to be natural language, where neural net methods have made great strides, but have not surpassed traditional (symbolic) linguistic theory. However, once this similarity is understood, it can be ported over to other domains, including deep-learning strongholds such as vision. To keep the discussion anchored, and to avoid confusing abstractions, what follows will focus entirely on linguistics; it is up to you, the reader, to imagine other, more general settings.
The starting point for probabilistic approaches (including deep learning) is the Bayesian network: a probability P(x1,x2,…,xn) of observing n events. For language, the xk are taken to be words, and n is the length of the sentence, so that P(x1,x2,…,xn) is the “probability” of observing the sequence x1,x2,…,xn of words. The technical problem with this viewpoint is the explosively large space: if one limits oneself to a vocabulary of 10 thousand words (and many people don’t) and sentences of 20 words or less, that’s (104)20=1080 =2270 probabilities, an absurdly large number even for Jupiter-sized computers. If n is the length of this blog post, then it really seems impossible. The key, of course, is to realize that almost all of these probabilities are effectively zero; the goal of machine learning is to find a format, a representation for grammar (and meaning) that effortlessly avoids that vast ocean of zero entries.
Traditional linguistics has done exactly that: when one has a theory of syntax, one has a formulation that clearly states which sentences should be considered to be grammatically valid, and which ones not. The trick is to provide a lexis (a lookup table), and some fairly small number of rules that define how words can be combined; i.e. arranged to the left and right of one-another. You look up a word in the lexis (the dictionary) to find a listing of what other words are allowed to surround it. Try every possible combination of rules until you find one that works, where all of the words can hook up to one-another. For the purposes here, the easiest and the best way to visualize this is with Link Grammar, a specific kind of dependency grammar. All theories of grammar will constrain allowed syntax; but Link Grammar is useful for this blog post because it explicitly identifies words to the left, and words to the right. Each lexical entry is like a template, a fill-in-the-blanks form, telling you exactly what other words are allowed to the left, and to the right, of the given word. This left-right sequence makes it directly comparable to what neural-net approaches, such as Word2Vec or SkipGram do.
What does Word2Vec do? Clearly, the 2270 probabilities in a twenty-word sentence is overwhelming; one obvious simplification is to look only at N-grams: that is, to only look at the closest neighboring words, working in a window that is N words wide. For N=5, this gives (104)5=1020 =265 which is still huge, but is bearable. When scanning actual text, almost all of these combinations won’t be observed; this is just an upper bound. In practice, a table of 5-grams fit in latter-day computer RAM. The statistical model is to map each N-gram to a vector, use that vector to define a Boltzmann distribution (P=exp(v⋅w)/Z), and then use gradient ascent (hill-climbing) to adjust the vector coefficients, so as to maximize the probability P.
How are Word2Vec and Link Grammar similar? Well, I hope that the above description of Link Grammar already planted the idea that each lexical entry is a lot like an N-gram. Each lexical entry tells you which words can appear to the right, and which words to the left of a given word. Its a bit less constrained than an N-gram: there’s no particular distance limitation on the dependency links. It can also skip over words: a lexical entry is more like a skip-gram. Although there is no distance limitation, lexical entries still have a small-N-like behavior, not in window size, but in attachment complexity. Determiners have a valency of 1; nouns a valency of 2 or 3 (a link to a verb, to an adjective, to a determiner); verbs a valency of 2, 3 or 4 (subject, object, etc.). So lexical entries are like skip-grams: the window size is effectively unbounded, but the size of the context remains small (N is small).
Are there other similarities? Yes, but first, a detour.
What happened to the 2270 probabilities? A symbolic theory of grammar, such as Link Grammar, is saying that nearly all of these are zero; the only ones that are not zero are the ones that obey the rules of the grammar. Consider the verb “throw” (“Kevin threw the ball”). A symbolic theory of grammar effectively sates that the only non-vanishing probabilities are those of the form P(x1,x2,…,xk=noun,…,xm=throw,…,xp=object,…,xn) and that all the others must be zero. Now, since nouns make up maybe 1/2 of all words (the subject and object are both nouns), this constraint eliminates 104x104/(2×2)= 224 possibilities (shrinking 2270 to 2270-24 = 2246). Nothing to sneeze at, given that its just one fairly simple rule. But this is only one rule: there are others, which say things like “singular count nouns must be preceded by a determiner” (so, “the ball”). These constraints are multiplicative: if the determiner is missing, then the probability is exactly zero. There are only a handful of determiners, so another factor of 104=213 is vaporized. And so on. A relatively small lexis quickly collapses the set of possibilities. Can we make a back of the envelope estimate? A noun-constraint eliminates half the words (leaving 5K of 10K possibilities). A determiner constraint removes all but 10 possibilities. Many grammatical classes have only a few hundred members in them (“throw” is like “hit”, but is not like “smile”). So, realistically, each xk will have from 10 to 1000 possibilities, maybe about 100 possibilities “on average” (averages are dicey to talk about, as these are Zipfian distributions). Thus there are only about 10020 =1040 =2130 grammatically valid sentences that are 20 words long, and these can be encoded fairly accurately with a few thousand lexical entries.
In essence, a symbolic theory of grammar, and more specifically, dependency grammars, accomplish the holy grail of Bayesian networks: factorizing the Bayesian network. The lexical rules state that there is a node in the network, for example, P(x1,x2,…,xk=noun,…,xm=throw,…,xp=object,…,xn) and that the entire sentences is a product of such nodes: The probability of “Kevin threw the ball” is the product P(x1=Kevin, x2=verb, …, xn) P(x1,x2, …, xk=noun, …, xm=throw, …, xp=object, …, xn) P(x1,x2, …, xk=the, …, xm=noun, …, xn) P(x1,x2, …, xn=ball) Stitch them all together, you’ve got a sentence, and its probability. (In Link Grammar, -log P is called the “cost”, and costs are additive, for parse ordering.) To be explicit: lexical entries are exactly the same thing as the factors of a factorized Bayesian network. What’s more, figuring out which of these factors come to play in analyzing a specific sentence is called “parsing”. One picks through the lookup table of possible network factors, and wires them up, so that there are no dangling endpoints. Lexical entries are look like subsets of a graph: a vertex, and some dangling edges hanging from the vertex. Pick out the right vertexes (one per word), wire them together so that there are no dangling unconnected edges, and viola! One has a graph: the graph is the Bayesian network. Linguists use a different name: they call it a dependency parse.
The Word2Vec/SkipGram model also factorizes, in much the same way! First, note that the above parse can be written as a product of factors of the form P(word|context), the product running over all of the words in the sentence. For a dependency grammar, the context expresses the dependency relations. The Word2Vec factorization is identical; the context is simpler. In it’s most naive form, its just a bag of the N nearest words, ignoring the word-order. But the word-order is ignored for practical reasons, not theoretical ones: it reduces the size of the computational problem; it speeds convergence. Smaller values of N mean that long-distance dependencies are hard to discover; the skipgram model partly overcomes this by keeping the bag small, while expanding the size of the window. If one uses the skip-gram model, with a large window size, and also keep track of the word-order, and restrict to low valencies, then one very nearly has a dependency grammar, in the style of Link Grammar. The only difference is that such a model does not force any explicit dependency constraints; rather, they are implicit, as the words must appear in the context. Compared to a normal dependency grammar, this might allow some words to be accidentally double-linked, when they should not have been. Dependency grammar constraints are sharper than merely asking for the correct context. To summarize: the notions of context are different, but there’s a clear path from one to the other, with several interesting midway points.
The next obvious difference between Link Grammar and Word2Vec/SkipGram are the mechanisms for obtaining the probabilities. But this is naive: in fact, they are much more similar than it first appears. In both systems, the starting point is (conceptually) a matrix, of dimension WxK, with W the size of the vocabulary, and K the size of the context. In general, K is much much larger than W; for Word2Vec, K could get as large as WN for a window size of N, although, in practice, only a tiny fraction of that is observed. Both Link Grammar and Word2Vec perform approximate matrix factorization on this matrix. The way the approximations are done is different. In the case of Word2Vec, one picks some much smaller dimension M, typically around M=200, or maybe twice as large; this is the number of “neurons” in the middle layer. Then, all of W is projected down to this M-dimensional space (with a linear projection matrix). Separately, the K-dimensional context is also projected down. Given a word, let its projection be the (M-dimensional) vector u. Given a context, let it’s projection be the vector v. The probability of a word-in-context is given by a Boltzmann distribution, as exp(u⋅v)/Z where u⋅v is the dot product and Z is a scale factor (called the “partition function“). The basis elements in this M-dimensional space have no specific meaning; the grand-total vector space is rotationally invariant (only the dot product matters, and dot products are scalars).
The primary task for Word2Vec/SkipGram is to discover the two projection matrices. This can be done by gradient ascent (hillclimbing), looking to maximize the probability. The primary output of Word2Vec are the two projection matrices: one that is WxM-dimensional, the other that is MxK-dimensional. In general, neither of these matrices are sparse (that is, most entries are non-zero).
Link Grammar also performs a dimensional reduction, but not quite exactly by using projection matrices. Rather, a word can be assigned to several different word-categories (there are de facto about 2300 of these in the hand-built English dictionaries). Associated with each category is a list of dozens to thousands of “disjuncts” (dependency-grammar dependencies), which play the role analogous to “context”. However, there are far, far fewer disjuncts than there are contexts. This is because every (multi-word) context is associated with a handful of disjuncts, in such a way that each disjunct stands for hundreds to as many as millions of different contexts. Effectively, the lexis of Link Grammar is a sparse CxD-dimensional matrix, with C grammatical categories, and D disjuncts, and most entries in this CxD dimensional matrix being zero. (The upper bound on D is LV, where L is the number of link types, and V is the maximum valency – about 5 or 6. In practice, D is in the tens of thousands.) The act of parsing selects a single entry from this matrix for each word in the sentence. The probability associated to that word is exp(-c) where c is the “cost”, the numerical value stored at this matrix entry.
Thus, both systems perform a rather sharp dimensional reduction, to obtain a much-lower dimensional intermediate form. Word2Vec is explicitly linear, Link Grammar is not exactly. However (and this is important, but very abstract) Link Grammar can be described by a (non-symmetric) monoidal category. This category is similar to that of the so-called “pregroup grammar”, and is described in a number of places (some predating both Link Grammar an pregroup grammar). The curious thing is that linear algebra is also described by a monoidal category. One might say that this “explains” why Word2Vec works well: it is using the same underlying structural framework (monoidal categories) as traditional symbolic linguistics. The precise details are too complex to sketch here, and must remain cryptic, for now, although they are open to those versed in category theory. The curious reader is encouraged to explore category-theoretic approaches to grammar, safe in the understanding that they provide a foundational understanding, no matter which detailed theory one works in. At the same time, the category-theoretic approach suggests how Word2Vec (or any other neural-net or vector-based approach to grammar) can be improved upon: it shows how syntax can be restored, with the result still looking like a funny/unusual kind of sparse neural-net. These are not conflicting approaches; they have far far more in common than meets the eye.
A few words about word-senses and semantics are in order. It has been generally observed that Word2Vec seems to encode “semantics” in some opaque way, in that it can distinguish different word-senses, based on context. The same is true for Link Grammar: when a word is used in a specific context, the result of parsing selects a single disjunct. That disjunct can be thought of as a hyper-fine grammatical category; but these are strongly correlated with meaning. Synonyms can be discovered in similar ways in both systems: if two different words all share a lot of the same disjuncts, they are effectively synonymous, and can be used interchangeably in sentences.
Similarly, given two different words in Word2Vec/SkipGram, if they both project down to approximately the same vector in the intermediate layer, they can be considered to be synonymous. This illustrates yet another way that Link Grammar and Word2Vec/SkipGram are similar: the list of all possible disjuncts associated with a word is also a vector, and, if two words have almost co-linear disjunct vectors, they are effectively synonymous. That is, disjunct-vectors behave almost exactly like neuron intermediate-layer vectors. They encode similar kinds of information into a vector format.
This is also where we have the largest, and the most important difference between neural net approaches, and Link Grammar. In the neural net approach, the intermediate neuron layer is a black box, completely opaque and un-analyzable, just some meaningless collection of floating-point numbers. In Link Grammar, the disjunct vector is clear, overt, and understandable: you can see exactly what it is encoding, because each disjunct tells you exactly the syntactic relationship between a word, and its neighbors. This is the great power of symbolic approaches to natural-language: they are human-auditable, human understandable in a way that neural nets are not. (Currently; I think that what this essay describes is an effective sketch for a technique for prying open the lid of the black box of neural nets. But that’s for a different day.)
Many thanks and appreciation to Hanson Robotics for providing the time to think and write about such things. (Much much earlier and kludgier versions of these ideas have been seen in public, in older robot demos.)
A cherry picked quote from a review of a new book:
“In his new book, Pearl, now 81, elaborates a vision for how truly intelligent machines would think. The key, he argues, is to replace reasoning by association with causal reasoning. Instead of the mere ability to correlate fever and malaria, machines need the capacity to reason that malaria causes fever. Once this kind of causal framework is in place, it becomes possible for machines to ask counterfactual questions—to inquire how the causal relationships would change given some kind of intervention—which Pearl views as the cornerstone of scientific thought.” 
The exploration of causal reasoning, and how it works, dates back to the Medieval times, and specifically the work of the Scholastics, on whose theories our modern legal system is built.
Recall a classic example: a dead body is discovered, and a man with a bloody sword stands above it. Did the man commit the crime, or was he simply the first on the scene, and picked up the sword? How can you know? What is the “probability” of guilt?
The Scholastics struggled mightily with the concept of “probability”. Eventually, Blaise Pascal, and many others (Huygens, etc…) developed the modern mathematical theory that explains dice games, and how to place bets on them.  This mathematical theory is called “probability theory”, and it works. However, in a courtroom trial for murder, it is not the theory that is applied to determine the “probability of innocence or guilt”.
What actually happens is that the prosecution, and the defense, assemble two different “cases”, two competing “theories”, two different networks of “facts”, of “proofs”, one network showing innocence, the other showing guilt. The jury is asked to select the one, or the other network, as the true model of what actually happened.
The networks consist of logically inter-connected “facts”. Ideally, those facts are related in a self-consistent fashion, without contradicting one-another. Ideally, the network of facts connects to the broader shared network that we call “reality”. Ideally, the various “facts” have been demonstrated to be “true”, by eye-witness testimony, or by forensic reasoning (which itself is a huge network of “facts”, e.g that it takes x hours for blood to clot, y hours for rigor mortis to set in). Ideally, the network connections themselves are “logical”, rather than being various forms of faulty argumentation (appeals to emotion, appeals to authority, etc.) You really have to put yourself into a Medieval state of mind to grok what is happening here: men pacing in long, fur-trimmed coats, presenting arguments, writing hundred-page-long commentaries-on-commentaries-on-commentaries. Picking apart statements that seems superficially coherent and believable, but can be shown to be built from faulty reasoning.
What is being done, here? Two different, competing models of the world are being built: in one model, the accused is guilty. In the other model, the accused is innocent. Clearly, both models cannot be right. Only one can be believed, and not the other. Is there a basis on which to doubt one of the models? Is that doubt “reasonable”? If the accused is to be harmed, viz. imprisoned, and its been shown that the accused is guilty “beyond a reasonable doubt”, then one must “believe”, accept as “reality”, that model, that version of the network of facts. The other proof is unconvincing; it must be wrong.
I think that it is possible to build AI/AGI systems, within the next 5-10 years, that construct multiple, credible, competing networks of “facts”, tied together with various kinds of evidence and inference and deduction, relation and causation. Having constructed such competing models, such alternative world-views of deeper “reality”, these AI/AGI systems will be able to disentangle the nature of reality in a rational manner. And, as I hope to have demonstrated, the mode of reasoning that these systems will employ will be distinctly Medieval.
University students being disciplined.
P.S. The current tumult in social-media, society and politics is very much one of trying to build different, competing “models of reality”, of explanations for the world-as-it-is. In the roughest of outlines, this is (in the USA) the red-state vs. blue-state political divides, the “liberal” vs. “conservative” arguments. Looking more carefully, one can see differences of opinion, of world-view on a vast number of topics. Every individual human appears to hold a puzzle-piece, a little micro-vision of what is (social) reality, a quasi-coherent tangle of networked facts that stick together, for them, and that they try to integrate into the rest of society, via identity politics, via social-media posts, via protest marches.
The creation and integration of (competing) models of reality is no longer just a courtroom activity; it is a modern-day social-media and political obsession. It is possible today, unlike before, only because of the high brain-to-brain (person-to-person) data bandwidth that the Internet, and social-media now provides. One can encounter more competing theories of reality than every before, and one can investigate them to a greater level of detail, than before.
If and when we build AGI systems capable of simultaneously manipulating multiple competing models of the world, I think we can take a number of lessons from social science and psychology as to how these networks might behave. There is currently tremendous concern about propaganda and brain-washing (beliefs in quasi-coherent networks of “facts”, that are non-the-less disconnected from mainstream “reality”). There is tremendous concern about the veracity of main-stream media, and various well-documented pathologies there-of: viz. the need to be profitable forces main-steam media to propagate outrageous but bogus and unimportant news. The equivalent AGI risk is that the sensory-input system floods the reasoning system with bogus information, and that there is no counter-vailing mechanism to adjust for it. Viz.: we can’t just unplug journalism and journalists from the capitalist system; nor is it clear that doing so would increase the quality of the broad-cast news.
Some of the issues facing society are because human brains are not sufficiently integrated, in the sense of “integrated information”.  Any individual human mind can make sense of only one small part of the world; we do not have the short-term memory, the long-term memory, or the reasoning capacity to take on more. This is not a limitation we can expect in AGI. However, instead of having individual humans each representing the pro and con of a given issue, its reasonable to expect that the AGI will simultaneously develop multiple competing theories, in an attempt to find the better, stronger one – the debate does not stop; it only getters bigger and more abstract.
Another on-line concern is that much of on-line political posting and argumentation is emotionally driven, anchored in gut-sense arguments, which could certainly be found to be full of logical fallacies, if they were to be picked apart. “Humans are irrational”, it is said. But are they really? In Bayesian inference, one averages together, blurs together vast tracts of “information”, and reduces it to a single number, a probability. Why? Because all of those inputs look like “noise”, and any given, specific Bayesian model cannot discriminate between all the different things going on. Thus, average it together, boil it all down, despite the fact that this is sure to erase important distinctions (e.g. “all cats have fur”, except, of course, when they don’t.) This kind of categorical, lumped-together, based-on-prior-experience, “common sense” kind of thinking sure seems to be exactly what we accuse our debate opponents of doing: either they’re “ignoring the details”, or failing to “see the big picture”. I don’t see how this is avoidable in the reasoning of Bayesian networks, or any other kind of network reasoning. Sooner or later, the boundaries of a fact network terminate in irrelevant facts, or details that are too small to consider. Those areas are necessarily fuzzed over, averaged, and ignored: gut-intuition, common-sense will be applied to them, and this suffers from all of well-known pitfalls of gut-sense reasoning. AGI might be smarter than us; and yet, it might suffer from a very similar set of logical deficiencies and irrational behaviors.
The future will look very much like today. But different.
 Meaningness – Thinking, feeling, and acting—about problems of meaning and meaninglessness; self and society; ethics, purpose, and value. (A work in progress, consisting of a hypertext book and a metablog that comments on it.)
This is the introductory part of a blog series about my work on inference control learning for OpenCog. It introduces the problematic, an initial design solution and some early results. More interesting results will come in subsequent parts.
Inference chainers, automatic theorem provers and the likes are programs that given a theory attempt to build a proof of a given theorem. They are generally inefficient. An proof, or inference is in essence a program, as the Curry-Howard correspondence says. In OpenCog it is very apparent as an inference is an actual atomese program. A function call in this program is a formula associated to an inference rule, the conclusion being the output of the call, the premises being its inputs. See for instance a double deduction inference tree
a bottom-down notation borrowed from the classical notations such as https://en.wikipedia.org/wiki/Rule_of_inference except that premises are aligned horizontally, and the rule formula, here f, overlays the line separating the premises and the conclusion.
the cryptic numbers indicate the hash values of the atoms involved. The hash value  at the bottom of the tree is the target’s, the hash values at the leaves ,  and  are the premises, and  is an intermediary target. bc-deduction-formula is the name of the formula associated to the deduction rule.
This inference tree would correspond to the atomese program
The specificity here is that the pattern matcher (the query corresponds to the outer BindLink) is used to fetch the relevant axioms from the atomspace, the most afferent inputs of that program.
The question we attempt to address here is: how to efficiently build inferences, specifically back-inferences.
A back-inference, an inference built backward, is done by providing a target, or theorem, and grow an inference tree going from the target to the axioms, so that once run, the inference will evaluate the truth value of that target. The problem is that such growth, if not carefully controlled, can be very inefficient.
The idea is to learn from past inferences. This is not a new idea, but is still fairely under-explored. I give some references at the end.
Let me sketch how we intend to do that in OpenCog:
Run the backward chainer in a greedy manner over a collection of problems
Store the traces of the backward chainer in a history atomspace
Mine the history atomspace to discover regularities
Turn these regularities into control rules for the backward chainer
Repeat 1, passing these control rules to the backward chainer on the same or related collection of problems, see how it improves.
The backward chainer is a rather a elementary algorithm. Given a target, pick a rule so that its conclusion unifies with the target (that is such rule can possibly produce such target), create an initial inference tree. The leaves of that tree are the premises, the root is the target. Treat the premises as new targets and re-iterate the process. Grow the tree (or rather the trees, as we keep the former ones around) till you get something that, if evaluated, produces the target. For that of course the premises need to be present, as axioms, in the atomspace.
The problem though comes from the combinatorial explosion of having to choose the inference tree, the premise to expand and the rule. The complexity of the algorithm grows exponentially with respect to the size of the inference tree.
So here’s the crux of the idea of how to make it practical:
Defer the hard decisions to a cognitive process.
To repeat myself, the hard decisions in this program, the backward chainer, are
pick an inference tree
given an inference tree, pick a premise
given an inference tree and a premise, pick a rule to expand
These are very hard decisions that only an intelligent entity can make, ultimately an AGI. We don’t have an AGI, but we are building one, thus if we can plug our proto-AGI to solve these sub-problems, it becomes a recursive divide-and-conquer type of construct. Turtles all the way down if you like, that hopefully ends. The idea of SampleLink introduced by Ben  was intended to make this recursive construct explicit. Here we do not use SampleLink, it hasn’t been implemented yet, but that doesn’t stop us from using OpenCog to learn control rules and utilize them in the backward chainer. What it means is that in the backward chainer code, when such a hard decision occurs, the code that would otherwise look like a hard wired heuristic now looks like this
Try to find control rules in the atomspace related to that decision
If you find some, combine them to make a decision
Otherwise, assume an uninformative control rule, or possibly a default one encapsulating some simple heuristic.
That is what is meant by a cognitive process, a process that uses the available knowledge to make decisions.
More specifically these control rules are Cognitive Schematics , pieces of knowledge that tells our cognitive process how actions relate to goals (or subgoals) conditioned by contexts
Context & Action -> Goal
They are usually represented by an Implication (or PredictiveImplication if time matters) in OpenCog. That is a conditional probability, or meta-probability, as that is what a TV represents.
Here these cognitive schematics will be about rules, premises, proofs, etc. For instance a cognitive schematic about the third hard decision
3. given an inference tree and a premise, pick a rule to expand
could be formulated as followed
VariableList A L R B T
Preproof A T
Expand (List A L R) B
Preproof B T
meaning that if inference tree A is a preproof of theorem T (a preproof in an inference tree that may not proove T yet, but that may eventually proove T if adequately expanded), conditioned by some pattern involving A, L, R, and some action, expanding inference tree A into inference tree B, via premise L, using rule R, then the produced inference tree B will also be a preproof of T, with truth value TV. The cognitive schematic parts are
Context = (Preproof A T) And <some pattern>
Action = Expand (List A L R) B
Goal = Preproof B T
Once we have such cognitive schematics
C1 & A -> G
Cn & A -> G
we need to combine then. We could consider a rule formed with the conjunction or disjunction of all contexts, or any other way of aggregating. The dilemma here is that
The smaller (more restrictive) the context, the lower the confidence of the rule.
The larger (more abstract) the context, the lower the discriminative power of the rule.
So to rephrase, if a context is too small its associated rule might overfit, but if the context is too large, its rule will not be informative enough, its conditional probability will tend to the marginal probability of the goal.
To address this dilemma we have choosen to rely on Universal Operator Induction , albeit some modification of it to account for the fact that a control rule is only a partial operator (see  for more details). Once this aggregation is done, we can assign a TV for each action, i.e. inference rule, and sample our next inference rule accordingly (here we have choosen Thomson Sampling due to its asymptotical optimality property ).
At this point the reader might ask, what about managing the complexity of this decision process? The short answer is ECAN . The less short answer is turtles all the way down… SampleLink, etc. But suffice to say that ECAN has been designed to be a cognitive process as well, utilizing knowledge via a dedicated type of atoms called HebbianLinks.
The problem space of our first experiments is very simple. Given knowledge about the alphabet, infer that 2 letters are alphabetically ordered.
There are two collections of axioms
1. Enumeration of all consecutively ordered letters
where X⊂Y is a notation for the atomese program
(Inheritance (stv 1 1) X Y)
Inheritance is used so that we can directly use the PLN deduction rule to infer order transitivity. Given this collection of axioms, all the backward chainer needs is to chain a series of deduction rules, as many as it requires. For instance inferring a⊂c will only require 2 deductions, while inferring h⊂z will require 17 deductions.
In the end, only the deduction rule is required. Figuring that out is not challenging but serves as a simple test for our algorithms. That is what we have accomplished so far. To create a more interesting case, we introduce a second collection of axioms
2. Express that letter a occurs before any other letter
where X<Y is a notation for the atomese program
(Evaluation (stv 1 1) (Predicate "alphabetical-order") (List X Y))
Alongside an implication
X<Y ⇒ X⊂Y
or in atomese
ImplicationScope (stv 1 1)
This second collection of axioms allows us to prove with just one inference step, using the PLN instantiation rule, that
a⊂X for any letter X != a.
however, unlike deduction, using this rule is only fruitful if the first letter is a, otherwise it will actually slow down the backward chainer, so it is important to be able to discover this context.
Learning Inference Control Rules
Context-Free Control Rules
The control rule of deduction is very simple as it has no extra context
VariableList A L B T
Preproof A T
Expand (List A L <deduction-rule>) B
Preproof B T
tells that if A is a preproof and gets expanded into B by a deduction rule, then B has a certain probability, expressed by TV, of being a preproof.
Learning that control rule is easy, we just need to apply PLN direct evaluation rule to calculate the TV based on the available evidence, the traces gathered while solving the problem collection. Indeed, while the backward chainer is running, it stores in a history atomspace every hard decision that has been made. In particular all inference tree expensions, and of course which expansion led to a proof, which allows us to build a corpus of expansions and preproofs. The PLN direct evaluation rule will merely count the positive and negative instances and come up with a conditional probability and a confidence.
Context-Sensitive Control Rules
Learning context-sensitive control rules is harder. In fact it may be arbitrary hard, but the initial plan is to experiment with frequent subgraph mining , using OpenCog's pattern miner .
We haven't reached that part yet, but it is expected that such rule will look something like
VariableList A L B T
Preproof A T
Expand (List A L <conditional-instantiation-rule>) B
Preproof B T
the pattern in question
will have to express that the premise L, looks like
or more generally that the expansion looks like
GroundedSchemaNode "scm: conditional-full-instantiation-scope-formula"
Although this part will rely on the pattern miner, in the end the calculations of these rules will be performed by PLN, so in a way PLN will be handling the meta-learning part as well. Will come back to that.
Results so far
Let me detail the experiment a bit further. The problem set is composed one 100 targets, randomly generated so that 2 ordered letters are queried, such as
We run 2 iterations, the first one with an empty control atomspace, then ask OpenCog to discover control rules, populate the control atomspace with them and rerun, and see how many more we have solved.
Just by learning context-free control rules, saying basically that deduction is often useful, conditional instantiation is sometimes useful, and all other rules are useless, we can go from solving 34 to 52.
Below are examples of control rules that we have learned.
These rules are rather simple, but, as reported, can already speed up the backward chainer.
Related Work and Conclusion
This relates to the work of Ray Solomonoff , Juergen Schimdhuber with OOPS , Eray Ozkural with HAM , Irvin Hwang et al with BPM , Josef Urban with MaLARea , Alexander A. Alemi et al with DeepMath  to cite only a few. What I believe sets it apart though, is that the system used for solving the problem is the same one used for solving the meta-problem. Which leads to an interesting question, that we may well be able to put to the test in the future. Can skills acquired to solve a problem be transfered to the meta-level?
Let me expand, if you ask an AI to solve a collections of down to earth problems, it will accumulate a lot of knowledge, say rules. Some will be very concrete, related to specificities of the problems, such as pushing the gas pedal while driving a car, and will not be transferable to the meta-level, because pushing a gas pedal is unrelated to discovering control rules to speed up a program. They will basically remain mute when asked to serve the cognitive process in charge of solving the meta-problem. But some will be more abstract, abstract enough that they will be recognized by the meta-solver as potentially useful. If these abstract rules can indeed help and be transfered to the upper levels, then it opens the possiblity for true intelligence bootstrapping. If it can, then it means we can improve not just learning, but also meta-learning, meta-meta-learning, and so on to infinity, at once. But realistically, even if it doesn't, or does to some limited extend, possibly evaporating as the meta-levels go higher, meta-learning may still result in considerable performance gains. In any instance, it is our only magic bullet, isn't it?
Thanks to Linas for his feedback.
 Ben Goertzel. Probabilistic Growth and Mining of Combinations: A
Unifying Meta-Algorithm for Practical General Intelligence
 Ben Goertzel. Cognitive Synergy: a Universal Principle For
Feasible General Intelligence?
 Ray Solomonoff. Three Kinds of Probabilistic Induction: Universal
Distributions and Convergence Theorems
 Jan Leike et al. Thompson Sampling is Asymptotically Optimal in
 Nil Geisweiller. Inference Control Learning Experiment README.md
 Matthew Ikle’ et al. Economic Attention Networks: Associative
Memory and Resource Allocation for General Intelligence
 Ben Goertzel et al. Integrating Deep Learning Based Perception
with Probabilistic Logic via Frequent Pattern Mining
 Yun Chi et al. Mining Closed and Maximal Frequent Subtrees from
Databases of Labeled Rooted Trees
 Ray J. Solomonoff. Progress in incremental machine learning. In
NIPS Workshop on Universal Learning Algorithms and Optimal Search,
Whistler, B.C., Canada, December 2002.
 Schmidhuber J. Optimal ordered problem solver. Machine Learning
54 (2004) 211–256 https://arxiv.org/pdf/cs/0207097.pdf
 Eray Ozkural. Towards Heuristic Algorithmic Memory.
 Irvin Hwang et al. Inducing Probabilistic Programs by Bayesian
 Josef Urban. MaLARea: a Metasystem for Automated Reasoning in
 Alexander A. Alemi et al. DeepMath – Deep Sequence Models for
 Ben Goertzel et al. Metalearning for Feature Selection.