The clearest starting point for seeing this seems to be natural language, where neural net methods have made great strides, but have not surpassed traditional (symbolic) linguistic theory. However, once this similarity is understood, it can be ported over to other domains, including deep-learning strongholds such as vision. To keep the discussion anchored, and to avoid confusing abstractions, what follows will focus entirely on linguistics; it is up to you, the reader, to imagine other, more general settings.
The starting point for probabilistic approaches (including deep learning) is the Bayesian network: a probability P(x_{1},x_{2},…,x_{n}) of observing n events. For language, the x_{k} are taken to be words, and n is the length of the sentence, so that P(x_{1},x_{2},…,x_{n}) is the “probability” of observing the sequence x_{1},x_{2},…,x_{n} of words. The technical problem with this viewpoint is the explosively large space: if one limits oneself to a vocabulary of 10 thousand words (and many people don’t) and sentences of 20 words or less, that’s (10^{4})^{20}=10^{80} =2^{270} probabilities, an absurdly large number even for Jupiter-sized computers. If n is the length of this blog post, then it really seems impossible. The key, of course, is to realize that almost all of these probabilities are effectively zero; the goal of machine learning is to find a format, a representation for grammar (and meaning) that effortlessly avoids that vast ocean of zero entries.
Traditional linguistics has done exactly that: when one has a theory of syntax, one has a formulation that clearly states which sentences should be considered to be grammatically valid, and which ones not. The trick is to provide a lexis (a lookup table), and some fairly small number of rules that define how words can be combined; i.e. arranged to the left and right of one-another. You look up a word in the lexis (the dictionary) to find a listing of what other words are allowed to surround it. Try every possible combination of rules until you find one that works, where all of the words can hook up to one-another. For the purposes here, the easiest and the best way to visualize this is with Link Grammar, a specific kind of dependency grammar. All theories of grammar will constrain allowed syntax; but Link Grammar is useful for this blog post because it explicitly identifies words to the left, and words to the right. Each lexical entry is like a template, a fill-in-the-blanks form, telling you exactly what other words are allowed to the left, and to the right, of the given word. This left-right sequence makes it directly comparable to what neural-net approaches, such as Word2Vec or SkipGram do.
What does Word2Vec do? Clearly, the 2^{270} probabilities in a twenty-word sentence is overwhelming; one obvious simplification is to look only at N-grams: that is, to only look at the closest neighboring words, working in a window that is N words wide. For N=5, this gives (10^{4})^{5}=10^{20} =2^{65} which is still huge, but is bearable. When scanning actual text, almost all of these combinations won’t be observed; this is just an upper bound. In practice, a table of 5-grams fit in latter-day computer RAM. The statistical model is to map each N-gram to a vector, use that vector to define a Boltzmann distribution (P=exp(v⋅w)/Z), and then use gradient ascent (hill-climbing) to adjust the vector coefficients, so as to maximize the probability P.
How are Word2Vec and Link Grammar similar? Well, I hope that the above description of Link Grammar already planted the idea that each lexical entry is a lot like an N-gram. Each lexical entry tells you which words can appear to the right, and which words to the left of a given word. Its a bit less constrained than an N-gram: there’s no particular distance limitation on the dependency links. It can also skip over words: a lexical entry is more like a skip-gram. Although there is no distance limitation, lexical entries still have a small-N-like behavior, not in window size, but in attachment complexity. Determiners have a valency of 1; nouns a valency of 2 or 3 (a link to a verb, to an adjective, to a determiner); verbs a valency of 2, 3 or 4 (subject, object, etc.). So lexical entries are like skip-grams: the window size is effectively unbounded, but the size of the context remains small (N is small).
Are there other similarities? Yes, but first, a detour.
What happened to the 2^{270} probabilities? A symbolic theory of grammar, such as Link Grammar, is saying that nearly all of these are zero; the only ones that are not zero are the ones that obey the rules of the grammar. Consider the verb “throw” (“Kevin threw the ball”). A symbolic theory of grammar effectively sates that the only non-vanishing probabilities are those of the form P(x_{1},x_{2},…,x_{k}=noun,…,x_{m}=throw,…,x_{p}=object,…,x_{n}) and that all the others must be zero. Now, since nouns make up maybe 1/2 of all words (the subject and object are both nouns), this constraint eliminates 10^{4}x10^{4}/(2×2)= 2^{24} possibilities (shrinking 2^{270} to 2^{270-24} = 2^{246}). Nothing to sneeze at, given that its just one fairly simple rule. But this is only one rule: there are others, which say things like “singular count nouns must be preceded by a determiner” (so, “the ball”). These constraints are multiplicative: if the determiner is missing, then the probability is exactly zero. There are only a handful of determiners, so another factor of 10^{4}=2^{13 }is vaporized. And so on. A relatively small lexis quickly collapses the set of possibilities. Can we make a back of the envelope estimate? A noun-constraint eliminates half the words (leaving 5K of 10K possibilities). A determiner constraint removes all but 10 possibilities. Many grammatical classes have only a few hundred members in them (“throw” is like “hit”, but is not like “smile”). So, realistically, each x_{k} will have from 10 to 1000 possibilities, maybe about 100 possibilities “on average” (averages are dicey to talk about, as these are Zipfian distributions). Thus there are only about 100^{20} =10^{40} =2^{130} grammatically valid sentences that are 20 words long, and these can be encoded fairly accurately with a few thousand lexical entries.
In essence, a symbolic theory of grammar, and more specifically, dependency grammars, accomplish the holy grail of Bayesian networks: factorizing the Bayesian network. The lexical rules state that there is a node in the network, for example, P(x_{1},x_{2},…,x_{k}=noun,…,x_{m}=throw,…,x_{p}=object,…,x_{n}) and that the entire sentences is a product of such nodes: The probability of “Kevin threw the ball” is the product P(x_{1}=Kevin, x_{2}=verb, …, x_{n}) P(x_{1},x_{2}, …, x_{k}=noun, …, x_{m}=throw, …, x_{p}=object, …, x_{n}) P(x_{1},x_{2}, …, x_{k}=the, …, x_{m}=noun, …, x_{n}) P(x_{1},x_{2}, …, x_{n}=ball) Stitch them all together, you’ve got a sentence, and its probability. (In Link Grammar, -log P is called the “cost”, and costs are additive, for parse ordering.) To be explicit: lexical entries are exactly the same thing as the factors of a factorized Bayesian network. What’s more, figuring out which of these factors come to play in analyzing a specific sentence is called “parsing”. One picks through the lookup table of possible network factors, and wires them up, so that there are no dangling endpoints. Lexical entries are look like subsets of a graph: a vertex, and some dangling edges hanging from the vertex. Pick out the right vertexes (one per word), wire them together so that there are no dangling unconnected edges, and viola! One has a graph: the graph is the Bayesian network. Linguists use a different name: they call it a dependency parse.
The Word2Vec/SkipGram model also factorizes, in much the same way! First, note that the above parse can be written as a product of factors of the form P(word|context), the product running over all of the words in the sentence. For a dependency grammar, the context expresses the dependency relations. The Word2Vec factorization is identical; the context is simpler. In it’s most naive form, its just a bag of the N nearest words, ignoring the word-order. But the word-order is ignored for practical reasons, not theoretical ones: it reduces the size of the computational problem; it speeds convergence. Smaller values of N mean that long-distance dependencies are hard to discover; the skipgram model partly overcomes this by keeping the bag small, while expanding the size of the window. If one uses the skip-gram model, with a large window size, and also keep track of the word-order, and restrict to low valencies, then one very nearly has a dependency grammar, in the style of Link Grammar. The only difference is that such a model does not force any explicit dependency constraints; rather, they are implicit, as the words must appear in the context. Compared to a normal dependency grammar, this might allow some words to be accidentally double-linked, when they should not have been. Dependency grammar constraints are sharper than merely asking for the correct context. To summarize: the notions of context are different, but there’s a clear path from one to the other, with several interesting midway points.
The next obvious difference between Link Grammar and Word2Vec/SkipGram are the mechanisms for obtaining the probabilities. But this is naive: in fact, they are much more similar than it first appears. In both systems, the starting point is (conceptually) a matrix, of dimension WxK, with W the size of the vocabulary, and K the size of the context. In general, K is much much larger than W; for Word2Vec, K could get as large as W^{N} for a window size of N, although, in practice, only a tiny fraction of that is observed. Both Link Grammar and Word2Vec perform approximate matrix factorization on this matrix. The way the approximations are done is different. In the case of Word2Vec, one picks some much smaller dimension M, typically around M=200, or maybe twice as large; this is the number of “neurons” in the middle layer. Then, all of W is projected down to this M-dimensional space (with a linear projection matrix). Separately, the K-dimensional context is also projected down. Given a word, let its projection be the (M-dimensional) vector u. Given a context, let it’s projection be the vector v. The probability of a word-in-context is given by a Boltzmann distribution, as exp(u⋅v)/Z where u⋅v is the dot product and Z is a scale factor (called the “partition function“). The basis elements in this M-dimensional space have no specific meaning; the grand-total vector space is rotationally invariant (only the dot product matters, and dot products are scalars).
The primary task for Word2Vec/SkipGram is to discover the two projection matrices. This can be done by gradient ascent (hillclimbing), looking to maximize the probability. The primary output of Word2Vec are the two projection matrices: one that is WxM-dimensional, the other that is MxK-dimensional. In general, neither of these matrices are sparse (that is, most entries are non-zero).
Link Grammar also performs a dimensional reduction, but not quite exactly by using projection matrices. Rather, a word can be assigned to several different word-categories (there are de facto about 2300 of these in the hand-built English dictionaries). Associated with each category is a list of dozens to thousands of “disjuncts” (dependency-grammar dependencies), which play the role analogous to “context”. However, there are far, far fewer disjuncts than there are contexts. This is because every (multi-word) context is associated with a handful of disjuncts, in such a way that each disjunct stands for hundreds to as many as millions of different contexts. Effectively, the lexis of Link Grammar is a sparse CxD-dimensional matrix, with C grammatical categories, and D disjuncts, and most entries in this CxD dimensional matrix being zero. (The upper bound on D is L^{V}, where L is the number of link types, and V is the maximum valency – about 5 or 6. In practice, D is in the tens of thousands.) The act of parsing selects a single entry from this matrix for each word in the sentence. The probability associated to that word is exp(-c) where c is the “cost”, the numerical value stored at this matrix entry.
Thus, both systems perform a rather sharp dimensional reduction, to obtain a much-lower dimensional intermediate form. Word2Vec is explicitly linear, Link Grammar is not exactly. However (and this is important, but very abstract) Link Grammar can be described by a (non-symmetric) monoidal category. This category is similar to that of the so-called “pregroup grammar”, and is described in a number of places (some predating both Link Grammar an pregroup grammar). The curious thing is that linear algebra is also described by a monoidal category. One might say that this “explains” why Word2Vec works well: it is using the same underlying structural framework (monoidal categories) as traditional symbolic linguistics. The precise details are too complex to sketch here, and must remain cryptic, for now, although they are open to those versed in category theory. The curious reader is encouraged to explore category-theoretic approaches to grammar, safe in the understanding that they provide a foundational understanding, no matter which detailed theory one works in. At the same time, the category-theoretic approach suggests how Word2Vec (or any other neural-net or vector-based approach to grammar) can be improved upon: it shows how syntax can be restored, with the result still looking like a funny/unusual kind of sparse neural-net. These are not conflicting approaches; they have far far more in common than meets the eye.
A few words about word-senses and semantics are in order. It has been generally observed that Word2Vec seems to encode “semantics” in some opaque way, in that it can distinguish different word-senses, based on context. The same is true for Link Grammar: when a word is used in a specific context, the result of parsing selects a single disjunct. That disjunct can be thought of as a hyper-fine grammatical category; but these are strongly correlated with meaning. Synonyms can be discovered in similar ways in both systems: if two different words all share a lot of the same disjuncts, they are effectively synonymous, and can be used interchangeably in sentences.
Similarly, given two different words in Word2Vec/SkipGram, if they both project down to approximately the same vector in the intermediate layer, they can be considered to be synonymous. This illustrates yet another way that Link Grammar and Word2Vec/SkipGram are similar: the list of all possible disjuncts associated with a word is also a vector, and, if two words have almost co-linear disjunct vectors, they are effectively synonymous. That is, disjunct-vectors behave almost exactly like neuron intermediate-layer vectors. They encode similar kinds of information into a vector format.
This is also where we have the largest, and the most important difference between neural net approaches, and Link Grammar. In the neural net approach, the intermediate neuron layer is a black box, completely opaque and un-analyzable, just some meaningless collection of floating-point numbers. In Link Grammar, the disjunct vector is clear, overt, and understandable: you can see exactly what it is encoding, because each disjunct tells you exactly the syntactic relationship between a word, and its neighbors. This is the great power of symbolic approaches to natural-language: they are human-auditable, human understandable in a way that neural nets are not. (Currently; I think that what this essay describes is an effective sketch for a technique for prying open the lid of the black box of neural nets. But that’s for a different day.)
The arguments presented above are worked out in greater detail, with all the mathematical trimmings, in this linked PDF. The commonality, by means of category theory, is explored in a different paper, on sheaves.
Many thanks and appreciation to Hanson Robotics for providing the time to think and write about such things. (Much much earlier and kludgier versions of these ideas have been seen in public, in older robot demos.)
]]>“In his new book, Pearl, now 81, elaborates a vision for how truly intelligent machines would think. The key, he argues, is to replace reasoning by association with causal reasoning. Instead of the mere ability to correlate fever and malaria, machines need the capacity to reason that malaria causes fever. Once this kind of causal framework is in place, it becomes possible for machines to ask counterfactual questions—to inquire how the causal relationships would change given some kind of intervention—which Pearl views as the cornerstone of scientific thought.” [1]
The exploration of causal reasoning, and how it works, dates back to the Medieval times, and specifically the work of the Scholastics, on whose theories our modern legal system is built.
Recall a classic example: a dead body is discovered, and a man with a bloody sword stands above it. Did the man commit the crime, or was he simply the first on the scene, and picked up the sword? How can you know? What is the “probability” of guilt?
The Scholastics struggled mightily with the concept of “probability”. Eventually, Blaise Pascal, and many others (Huygens, etc…) developed the modern mathematical theory that explains dice games, and how to place bets on them. [2] This mathematical theory is called “probability theory”, and it works. However, in a courtroom trial for murder, it is not the theory that is applied to determine the “probability of innocence or guilt”.
What actually happens is that the prosecution, and the defense, assemble two different “cases”, two competing “theories”, two different networks of “facts”, of “proofs”, one network showing innocence, the other showing guilt. The jury is asked to select the one, or the other network, as the true model of what actually happened.
The networks consist of logically inter-connected “facts”. Ideally, those facts are related in a self-consistent fashion, without contradicting one-another. Ideally, the network of facts connects to the broader shared network that we call “reality”. Ideally, the various “facts” have been demonstrated to be “true”, by eye-witness testimony, or by forensic reasoning (which itself is a huge network of “facts”, e.g that it takes x hours for blood to clot, y hours for rigor mortis to set in). Ideally, the network connections themselves are “logical”, rather than being various forms of faulty argumentation (appeals to emotion, appeals to authority, etc.) You really have to put yourself into a Medieval state of mind to grok what is happening here: men pacing in long, fur-trimmed coats, presenting arguments, writing hundred-page-long commentaries-on-commentaries-on-commentaries. Picking apart statements that seems superficially coherent and believable, but can be shown to be built from faulty reasoning.
What is being done, here? Two different, competing models of the world are being built: in one model, the accused is guilty. In the other model, the accused is innocent. Clearly, both models cannot be right. Only one can be believed, and not the other. Is there a basis on which to doubt one of the models? Is that doubt “reasonable”? If the accused is to be harmed, viz. imprisoned, and its been shown that the accused is guilty “beyond a reasonable doubt”, then one must “believe”, accept as “reality”, that model, that version of the network of facts. The other proof is unconvincing; it must be wrong.
I think that it is possible to build AI/AGI systems, within the next 5-10 years, that construct multiple, credible, competing networks of “facts”, tied together with various kinds of evidence and inference and deduction, relation and causation. Having constructed such competing models, such alternative world-views of deeper “reality”, these AI/AGI systems will be able to disentangle the nature of reality in a rational manner. And, as I hope to have demonstrated, the mode of reasoning that these systems will employ will be distinctly Medieval.
P.S. The current tumult in social-media, society and politics is very much one of trying to build different, competing “models of reality”, of explanations for the world-as-it-is. In the roughest of outlines, this is (in the USA) the red-state vs. blue-state political divides, the “liberal” vs. “conservative” arguments. Looking more carefully, one can see differences of opinion, of world-view on a vast number of topics. Every individual human appears to hold a puzzle-piece, a little micro-vision of what is (social) reality, a quasi-coherent tangle of networked facts that stick together, for them, and that they try to integrate into the rest of society, via identity politics, via social-media posts, via protest marches.[4]
The creation and integration of (competing) models of reality is no longer just a courtroom activity; it is a modern-day social-media and political obsession. It is possible today, unlike before, only because of the high brain-to-brain (person-to-person) data bandwidth that the Internet, and social-media now provides. One can encounter more competing theories of reality than every before, and one can investigate them to a greater level of detail, than before.
If and when we build AGI systems capable of simultaneously manipulating multiple competing models of the world, I think we can take a number of lessons from social science and psychology as to how these networks might behave. There is currently tremendous concern about propaganda and brain-washing (beliefs in quasi-coherent networks of “facts”, that are non-the-less disconnected from mainstream “reality”). There is tremendous concern about the veracity of main-stream media, and various well-documented pathologies there-of: viz. the need to be profitable forces main-steam media to propagate outrageous but bogus and unimportant news. The equivalent AGI risk is that the sensory-input system floods the reasoning system with bogus information, and that there is no counter-vailing mechanism to adjust for it. Viz.: we can’t just unplug journalism and journalists from the capitalist system; nor is it clear that doing so would increase the quality of the broad-cast news.
Some of the issues facing society are because human brains are not sufficiently integrated, in the sense of “integrated information”. [3] Any individual human mind can make sense of only one small part of the world; we do not have the short-term memory, the long-term memory, or the reasoning capacity to take on more. This is not a limitation we can expect in AGI. However, instead of having individual humans each representing the pro and con of a given issue, its reasonable to expect that the AGI will simultaneously develop multiple competing theories, in an attempt to find the better, stronger one – the debate does not stop; it only getters bigger and more abstract.
Another on-line concern is that much of on-line political posting and argumentation is emotionally driven, anchored in gut-sense arguments, which could certainly be found to be full of logical fallacies, if they were to be picked apart. “Humans are irrational”, it is said. But are they really? In Bayesian inference, one averages together, blurs together vast tracts of “information”, and reduces it to a single number, a probability. Why? Because all of those inputs look like “noise”, and any given, specific Bayesian model cannot discriminate between all the different things going on. Thus, average it together, boil it all down, despite the fact that this is sure to erase important distinctions (e.g. “all cats have fur”, except, of course, when they don’t.) This kind of categorical, lumped-together, based-on-prior-experience, “common sense” kind of thinking sure seems to be exactly what we accuse our debate opponents of doing: either they’re “ignoring the details”, or failing to “see the big picture”. I don’t see how this is avoidable in the reasoning of Bayesian networks, or any other kind of network reasoning. Sooner or later, the boundaries of a fact network terminate in irrelevant facts, or details that are too small to consider. Those areas are necessarily fuzzed over, averaged, and ignored: gut-intuition, common-sense will be applied to them, and this suffers from all of well-known pitfalls of gut-sense reasoning. AGI might be smarter than us; and yet, it might suffer from a very similar set of logical deficiencies and irrational behaviors.
The future will look very much like today. But different.
[1] Judea Pearl, The Book of Why / The New Science of Cause and Effect
[2] James Franklin, The Science of Conjecture: Evidence and Probability Before Pascal
[3] Integrated information theory – Wikipedia
[4] Meaningness – Thinking, feeling, and acting—about problems of meaning and meaninglessness; self and society; ethics, purpose, and value. (A work in progress, consisting of a hypertext book and a metablog that comments on it.)
]]>Inference chainers, automatic theorem provers and the likes are programs that given a theory attempt to build a proof of a given theorem. They are generally inefficient. An proof, or inference is in essence a program, as the Curry-Howard correspondence says. In OpenCog it is very apparent as an inference is an actual atomese program. A function call in this program is a formula associated to an inference rule, the conclusion being the output of the call, the premises being its inputs. See for instance a double deduction inference tree
X->Z Z->Y
-----f------
A->X X->Y
------f------
A->B
a bottom-down notation borrowed from the classical notations such as https://en.wikipedia.org/wiki/Rule_of_inference except that premises are aligned horizontally, and the rule formula, here f
, overlays the line separating the premises and the conclusion.
In OpenCog such tree would be represented as
[12731261225409633207][6] [11133117073607658831][6]
---------------bc-deduction-formula----------------
[17146615216377982335][6] [16015351290941397556][6]
---------------bc-deduction-formula----------------
[13295355732252075959][1]
the cryptic numbers indicate the hash values of the atoms involved. The hash value [13295355732252075959][1]
at the bottom of the tree is the target’s, the hash values at the leaves [17146615216377982335][6]
, [12731261225409633207][6]
and [11133117073607658831][6]
are the premises, and [16015351290941397556][6]
is an intermediary target. bc-deduction-formula
is the name of the formula associated to the deduction rule.
This inference tree would correspond to the atomese program
(BindLink
;; Pattern to fetch premises from the atomspace
(AndLink
(InheritanceLink
(VariableNode "$B-6229393a") ; [6185394777777469381][6]
(ConceptNode "D") ; [246112806454457922][1]
) ; [11133117073607658831][6]
(InheritanceLink
(VariableNode "$B-6266d6f2") ; [4097372290580364298][6]
(VariableNode "$B-6229393a") ; [6185394777777469381][6]
) ; [12731261225409633207][6]
(InheritanceLink
(VariableNode "$X") ; [6809909406030619949][1]
(VariableNode "$B-6266d6f2") ; [4097372290580364298][6]
) ; [17146615216377982335][6]
) ; [12553878300592245761][6]
;; Formula calls computing the inference tree
(ExecutionOutputLink
(GroundedSchemaNode "scm: bc-deduction-formula") ; [5481509939359570705][1]
(ListLink
(InheritanceLink
(VariableNode "$X") ; [6809909406030619949][1]
(ConceptNode "D") ; [246112806454457922][1]
) ; [13295355732252075959][1]
(InheritanceLink
(VariableNode "$X") ; [6809909406030619949][1]
(VariableNode "$B-6266d6f2") ; [4097372290580364298][6]
) ; [17146615216377982335][6]
(ExecutionOutputLink
(GroundedSchemaNode "scm: bc-deduction-formula") ; [5481509939359570705][1]
(ListLink
(InheritanceLink
(VariableNode "$B-6266d6f2") ; [4097372290580364298][6]
(ConceptNode "D") ; [246112806454457922][1]
) ; [16015351290941397556][6]
(InheritanceLink
(VariableNode "$B-6266d6f2") ; [4097372290580364298][6]
(VariableNode "$B-6229393a") ; [6185394777777469381][6]
) ; [12731261225409633207][6]
(InheritanceLink
(VariableNode "$B-6229393a") ; [6185394777777469381][6]
(ConceptNode "D") ; [246112806454457922][1]
) ; [11133117073607658831][6]
) ; [10363704109909197645][6]
) ; [18135989265351680839][6]
) ; [14126565831644143184][6]
) ; [12675478950231850570][6]
) ; [15856494860655100711][6]
The specificity here is that the pattern matcher (the query corresponds to the outer BindLink) is used to fetch the relevant axioms from the atomspace, the most afferent inputs of that program.
The question we attempt to address here is: how to efficiently build inferences, specifically back-inferences.
A back-inference, an inference built backward, is done by providing a target, or theorem, and grow an inference tree going from the target to the axioms, so that once run, the inference will evaluate the truth value of that target. The problem is that such growth, if not carefully controlled, can be very inefficient.
The idea is to learn from past inferences. This is not a new idea, but is still fairely under-explored. I give some references at the end.
Let me sketch how we intend to do that in OpenCog:
The backward chainer is a rather a elementary algorithm. Given a target, pick a rule so that its conclusion unifies with the target (that is such rule can possibly produce such target), create an initial inference tree. The leaves of that tree are the premises, the root is the target. Treat the premises as new targets and re-iterate the process. Grow the tree (or rather the trees, as we keep the former ones around) till you get something that, if evaluated, produces the target. For that of course the premises need to be present, as axioms, in the atomspace.
The problem though comes from the combinatorial explosion of having to choose the inference tree, the premise to expand and the rule. The complexity of the algorithm grows exponentially with respect to the size of the inference tree.
So here’s the crux of the idea of how to make it practical:
Defer the hard decisions to a cognitive process.
To repeat myself, the hard decisions in this program, the backward chainer, are
These are very hard decisions that only an intelligent entity can make, ultimately an AGI. We don’t have an AGI, but we are building one, thus if we can plug our proto-AGI to solve these sub-problems, it becomes a recursive divide-and-conquer type of construct. Turtles all the way down if you like, that hopefully ends. The idea of SampleLink introduced by Ben [1] was intended to make this recursive construct explicit. Here we do not use SampleLink, it hasn’t been implemented yet, but that doesn’t stop us from using OpenCog to learn control rules and utilize them in the backward chainer. What it means is that in the backward chainer code, when such a hard decision occurs, the code that would otherwise look like a hard wired heuristic now looks like this
That is what is meant by a cognitive process, a process that uses the available knowledge to make decisions.
More specifically these control rules are Cognitive Schematics [2], pieces of knowledge that tells our cognitive process how actions relate to goals (or subgoals) conditioned by contexts
Context & Action -> Goal
They are usually represented by an Implication (or PredictiveImplication if time matters) in OpenCog. That is a conditional probability, or meta-probability, as that is what a TV represents.
Here these cognitive schematics will be about rules, premises, proofs, etc. For instance a cognitive schematic about the third hard decision
3. given an inference tree and a premise, pick a rule to expand
could be formulated as followed
ImplicationScope <TV>
VariableList A L R B T
And
Preproof A T
<some pattern>
Expand (List A L R) B
Preproof B T
meaning that if inference tree A is a preproof of theorem T (a preproof in an inference tree that may not proove T yet, but that may eventually proove T if adequately expanded), conditioned by some pattern involving A, L, R, and some action, expanding inference tree A into inference tree B, via premise L, using rule R, then the produced inference tree B will also be a preproof of T, with truth value TV. The cognitive schematic parts are
Context = (Preproof A T) And <some pattern>
Action = Expand (List A L R) B
Goal = Preproof B T
Once we have such cognitive schematics
C1 & A -> G
…
Cn & A -> G
we need to combine then. We could consider a rule formed with the conjunction or disjunction of all contexts, or any other way of aggregating. The dilemma here is that
So to rephrase, if a context is too small its associated rule might overfit, but if the context is too large, its rule will not be informative enough, its conditional probability will tend to the marginal probability of the goal.
To address this dilemma we have choosen to rely on Universal Operator Induction [3], albeit some modification of it to account for the fact that a control rule is only a partial operator (see [5] for more details). Once this aggregation is done, we can assign a TV for each action, i.e. inference rule, and sample our next inference rule accordingly (here we have choosen Thomson Sampling due to its asymptotical optimality property [4]).
At this point the reader might ask, what about managing the complexity of this decision process? The short answer is ECAN [6]. The less short answer is turtles all the way down… SampleLink, etc. But suffice to say that ECAN has been designed to be a cognitive process as well, utilizing knowledge via a dedicated type of atoms called HebbianLinks.
The problem space of our first experiments is very simple. Given knowledge about the alphabet, infer that 2 letters are alphabetically ordered.
There are two collections of axioms
a⊂b
…
y⊂z
where X⊂Y is a notation for the atomese program
(Inheritance (stv 1 1) X Y)
Inheritance is used so that we can directly use the PLN deduction rule to infer order transitivity. Given this collection of axioms, all the backward chainer needs is to chain a series of deduction rules, as many as it requires. For instance inferring a⊂c will only require 2 deductions, while inferring h⊂z will require 17 deductions.
In the end, only the deduction rule is required. Figuring that out is not challenging but serves as a simple test for our algorithms. That is what we have accomplished so far. To create a more interesting case, we introduce a second collection of axioms
a
occurs before any other lettera<b
…
a<z
where X<Y is a notation for the atomese program
(Evaluation (stv 1 1) (Predicate "alphabetical-order") (List X Y))
Alongside an implication
X<Y ⇒ X⊂Y
or in atomese
ImplicationScope (stv 1 1)
VariableList
TypedVariable
Variable "$X"
Type "ConceptNode"
TypedVariable
Variable "$Y"
Type "ConceptNode"
Evaluation
Predicate "alphabetical-order"
List
Variable "$X"
Variable "$Y"
Inheritance
Variable "$X"
Variable "$Y"
This second collection of axioms allows us to prove with just one inference step, using the PLN instantiation rule, that
a⊂X for any letter X != a.
however, unlike deduction, using this rule is only fruitful if the first letter is a
, otherwise it will actually slow down the backward chainer, so it is important to be able to discover this context.
The control rule of deduction is very simple as it has no extra context
ImplicationScope <TV>
VariableList A L B T
And
Preproof A T
Expand (List A L <deduction-rule>) B
Preproof B T
tells that if A is a preproof and gets expanded into B by a deduction rule, then B has a certain probability, expressed by TV, of being a preproof.
Learning that control rule is easy, we just need to apply PLN direct evaluation rule to calculate the TV based on the available evidence, the traces gathered while solving the problem collection. Indeed, while the backward chainer is running, it stores in a history atomspace every hard decision that has been made. In particular all inference tree expensions, and of course which expansion led to a proof, which allows us to build a corpus of expansions and preproofs. The PLN direct evaluation rule will merely count the positive and negative instances and come up with a conditional probability and a confidence.
Learning context-sensitive control rules is harder. In fact it may be arbitrary hard, but the initial plan is to experiment with frequent subgraph mining [8], using OpenCog's pattern miner [7].
We haven't reached that part yet, but it is expected that such rule will look something like
ImplicationScope <TV>
VariableList A L B T
And
Preproof A T
Expand (List A L <conditional-instantiation-rule>) B
<pattern-discovered-by-the-pattern-miner>
Preproof B T
the pattern in question
<pattern-discovered-by-the-pattern-miner>
will have to express that the premise L, looks like
Inheritance
ConceptNode "a"
Variable "$X"
or more generally that the expansion looks like
Execution
Schema "expand"
List
Variable "$A"
Inheritance
ConceptNode "a"
Variable "$X"
GroundedSchemaNode "scm: conditional-full-instantiation-scope-formula"
Variable "$B"
Although this part will rely on the pattern miner, in the end the calculations of these rules will be performed by PLN, so in a way PLN will be handling the meta-learning part as well. Will come back to that.
Let me detail the experiment a bit further. The problem set is composed one 100 targets, randomly generated so that 2 ordered letters are queried, such as
w⊂y
q⊂u
g⊂v
y⊂z
…
We run 2 iterations, the first one with an empty control atomspace, then ask OpenCog to discover control rules, populate the control atomspace with them and rerun, and see how many more we have solved.
Just by learning context-free control rules, saying basically that deduction is often useful, conditional instantiation is sometimes useful, and all other rules are useless, we can go from solving 34 to 52.
Below are examples of control rules that we have learned.
;; Modus ponens is useless
(ImplicationScopeLink (stv 0 0.03625)
(VariableList
(VariableNode "$T")
(TypedVariableLink
(VariableNode "$A")
(TypeNode "DontExecLink")
)
(VariableNode "$L")
(TypedVariableLink
(VariableNode "$B")
(TypeNode "DontExecLink")
)
)
(AndLink
(EvaluationLink
(PredicateNode "URE:BC:preproof-of")
(ListLink
(VariableNode "$A")
(VariableNode "$T")
)
)
(ExecutionLink
(SchemaNode "URE:BC:expand-and-BIT")
(ListLink
(VariableNode "$A")
(VariableNode "$L")
(DontExecLink
(DefinedSchemaNode "modus-ponens-implication-rule")
)
)
(VariableNode "$B")
)
)
(EvaluationLink
(PredicateNode "URE:BC:preproof-of")
(ListLink
(VariableNode "$B")
(VariableNode "$T")
)
)
)
;; Deduction is often useful
(ImplicationScopeLink (stv 0.44827586 0.03625)
(VariableList
(VariableNode "$T")
(TypedVariableLink
(VariableNode "$A")
(TypeNode "DontExecLink")
)
(VariableNode "$L")
(TypedVariableLink
(VariableNode "$B")
(TypeNode "DontExecLink")
)
)
(AndLink
(EvaluationLink
(PredicateNode "URE:BC:preproof-of")
(ListLink
(VariableNode "$A")
(VariableNode "$T")
)
)
(ExecutionLink
(SchemaNode "URE:BC:expand-and-BIT")
(ListLink
(VariableNode "$A")
(VariableNode "$L")
(DontExecLink
(DefinedSchemaNode "deduction-inheritance-rule")
)
)
(VariableNode "$B")
)
)
(EvaluationLink
(PredicateNode "URE:BC:preproof-of")
(ListLink
(VariableNode "$B")
(VariableNode "$T")
)
)
)
;; Conditional instantiation is sometimes useful
(ImplicationScopeLink (stv 0.12903226 0.03875)
(VariableList
(VariableNode "$T")
(TypedVariableLink
(VariableNode "$A")
(TypeNode "DontExecLink")
)
(VariableNode "$L")
(TypedVariableLink
(VariableNode "$B")
(TypeNode "DontExecLink")
)
)
(AndLink
(EvaluationLink
(PredicateNode "URE:BC:preproof-of")
(ListLink
(VariableNode "$A")
(VariableNode "$T")
)
)
(ExecutionLink
(SchemaNode "URE:BC:expand-and-BIT")
(ListLink
(VariableNode "$A")
(VariableNode "$L")
(DontExecLink
(DefinedSchemaNode "conditional-full-instantiation-implication-scope-meta-rule")
)
)
(VariableNode "$B")
)
)
(EvaluationLink
(PredicateNode "URE:BC:preproof-of")
(ListLink
(VariableNode "$B")
(VariableNode "$T")
)
)
)
These rules are rather simple, but, as reported, can already speed up the backward chainer.
This relates to the work of Ray Solomonoff [9], Juergen Schimdhuber with OOPS [10], Eray Ozkural with HAM [11], Irvin Hwang et al with BPM [12], Josef Urban with MaLARea [13], Alexander A. Alemi et al with DeepMath [14] to cite only a few. What I believe sets it apart though, is that the system used for solving the problem is the same one used for solving the meta-problem. Which leads to an interesting question, that we may well be able to put to the test in the future. Can skills acquired to solve a problem be transfered to the meta-level?
Let me expand, if you ask an AI to solve a collections of down to earth problems, it will accumulate a lot of knowledge, say rules. Some will be very concrete, related to specificities of the problems, such as pushing the gas pedal while driving a car, and will not be transferable to the meta-level, because pushing a gas pedal is unrelated to discovering control rules to speed up a program. They will basically remain mute when asked to serve the cognitive process in charge of solving the meta-problem. But some will be more abstract, abstract enough that they will be recognized by the meta-solver as potentially useful. If these abstract rules can indeed help and be transfered to the upper levels, then it opens the possiblity for true intelligence bootstrapping. If it can, then it means we can improve not just learning, but also meta-learning, meta-meta-learning, and so on to infinity, at once. But realistically, even if it doesn't, or does to some limited extend, possibly evaporating as the meta-levels go higher, meta-learning may still result in considerable performance gains. In any instance, it is our only magic bullet, isn't it?
Thanks to Linas for his feedback.
[1] Ben Goertzel. Probabilistic Growth and Mining of Combinations: A
Unifying Meta-Algorithm for Practical General Intelligence
https://link.springer.com/chapter/10.1007/978-3-319-41649-6_35?no-access=true
[2] Ben Goertzel. Cognitive Synergy: a Universal Principle For
Feasible General Intelligence?
https://pdfs.semanticscholar.org/511e/5646bc1d643585933549b5321a9da5ee5f55.pdf
[3] Ray Solomonoff. Three Kinds of Probabilistic Induction: Universal
Distributions and Convergence Theorems
http://world.std.com/%7Erjs/publications/chris1.pdf
[4] Jan Leike et al. Thompson Sampling is Asymptotically Optimal in
General Environments
http://auai.org/uai2016/proceedings/papers/20.pdf
[5] Nil Geisweiller. Inference Control Learning Experiment README.md
https://github.com/opencog/opencog/tree/master/examples/pln/inference-control-learning
[6] Matthew Ikle’ et al. Economic Attention Networks: Associative
Memory and Resource Allocation for General Intelligence
http://agi-conf.org/2009/papers/paper_63.pdf
[7] Ben Goertzel et al. Integrating Deep Learning Based Perception
with Probabilistic Logic via Frequent Pattern Mining
http://goertzel.org/agi-13/DeSTIN_PLN_v3.pdf
[8] Yun Chi et al. Mining Closed and Maximal Frequent Subtrees from
Databases of Labeled Rooted Trees
http://ftp.cs.ucla.edu/tech-report/2004-reports/040020.pdf
[9] Ray J. Solomonoff. Progress in incremental machine learning. In
NIPS Workshop on Universal Learning Algorithms and Optimal Search,
Whistler, B.C., Canada, December 2002.
http://raysolomonoff.com/publications/nips02.pdf
[10] Schmidhuber J. Optimal ordered problem solver. Machine Learning
54 (2004) 211–256 https://arxiv.org/pdf/cs/0207097.pdf
[11] Eray Ozkural. Towards Heuristic Algorithmic Memory.
https://link.springer.com/chapter/10.1007%2F978-3-642-22887-2_47
[12] Irvin Hwang et al. Inducing Probabilistic Programs by Bayesian
Program Merging
https://arxiv.org/pdf/1110.5667.pdf
[13] Josef Urban. MaLARea: a Metasystem for Automated Reasoning in
Large Theories
http://ceur-ws.org/Vol-257/05_Urban.pdf
[14] Alexander A. Alemi et al. DeepMath – Deep Sequence Models for
Premise Selection
https://arxiv.org/pdf/1606.04442v1.pdf
[15] Ben Goertzel et al. Metalearning for Feature Selection.
https://arxiv.org/abs/1703.06990
After some musing, I came to the conclusion that this may be another area where it could make sense to insert a deep neural network inside OpenCog, for carrying out particular functions.
I note that Itamar Arel and others have proposed neural net based AGI architectures in which deep neural nets for perception, action and reinforcement are coupled together. In OpenCog, one could use deep neural nets for perception, action and reinforcement; but, these networks would all be interfaced via the Atomspace, and in this way would be interfaced with symbolic algorithms such as probabilistic logical inference, hypergraph pattern mining and concept blending, as well as with each other.
One interesting meta-point regarding these musings is that they don’t imply any huge design changes to the OpenCog “OpenPsi” action selector. Rather, one could implement deep neural net policies for action selection, learned via reinforcement learning algorithms, as a new strategy within the current action selection framework. This speaks well of the flexible conceptual architecture of OpenPsi.
Action Selection as a Contextual Bandit Problem
For links to fill you in on OpenCog’s current action selection paradigm and code, start here.
The observation I want to make in this blog post is basically that: The problem of action selection in OpenCog (as described at the above link and the others it points to) is an example of the “contextual bandit problem” (CBP).
In the case where more than one action can be chosen concurrently, we have a variation of the contextual bandit problem that has been called “slates” (see the end of this presentation).
So basically the problem is: We have a current context, and we have a goal (or several), and we have bunch of alternative actions. We want in the long run to choose actions that will maximize goal achievement. But we don’t know the expected payoff of each possible action. So we need to make a probabilistic, context-dependent choice of which action to do; and we need to balance exploration and exploitation appropriately, to maximize long-term gain. (Goals themselves are part of the OpenCog self-organizing system and may get modified as learning progresses and as a result of which actions are chosen, but we won’t deal with that part of the feedback tangle here.)
Psi and MicroPsi and OpenPsi formulate this problem in a certain way. Contextual bandit problems represent an alternate formulation of basically the same problem.
Some simple algorithms for contextual bandit problems are here and a more interesting neural net based approach is here. A deep neural net approach for a closely related problem is here.
CBP and OpenPsi
These approaches and ideas can be incorporated into OpenCog’s Psi based action selector, though this would involve using Psi a little differently than we are now.
A “policy” in the CBP context is a function mapping the current context into a set of weightings on implications of the form (Procedure ⇒ Goal).
Most of the time in the reinforcement literature a single goal is considered, whereas in Psi/OpenCog one considers multiple goals; but that’s not an obstacle to useing RL ideas in OpenCog. One can use RL to figure out procedures likely to lead to fulfillment of an individual goal; or one can apply RL to synthetic goals defined as weighted averages of system goals.
What we have in OpenPsi are implications of the form (Context & Procedure ⇒ Goal) — obviously just a different way of doing what RL is doing…
That is:
and these two options are obviously logically equivalent.
A policy, in the above sense, could be used to generate a bunch of Psi implications with appropriate weights. In general a policy may be considered as a concise expression of a large set of Psi implications.
In CBP learning what we have is, often, a set of competing policies (e.g. competing linear functions, or competing neural networks), each of which provides its own mapping from contexts into (Procedure⇒ Goal) implications. So, if doing action selection in this approach: To generate an action, one would first choose a policy, and then use that policy to generate weighted (Context & Procedure ⇒ Goal) implications [where the Context was very concrete, being simply the current situation], and then use the weights on these implications to choose an action.
In OpenCog verbiage, each policy could in fact be considered a context, so we could have
ContextLink ConceptNode “policy_5” ImplicationLink AndLink Context procedure Goal
and one would then do action selection using the weighting for the current policy.
If, for instance, a policy were a neural network, it could be wrapped up in a GroundedSchemaNode. A neural net learning algorithm could then be used to manage an ensemble of policies (corresponding behind the scenes to neural networks), and experiment with these policies for action selection.
This does not contradict the use of PLN to learn Psi implications. PLN would most naturally be used to learn Psi implications with abstract Contexts; whereas in the RL approach, the abstraction goes into the policy, and the policy generates Psi implications that have very specific Contexts. Both approaches are valid.
In general, the policy-learning-based approach may often be better when the Context consists of a large number of different factors, with fuzzy degrees of relevance. In this case learning a neural net mapping these contextual factors into weightings across Psi implications may be effective. On the other hand, when the context consists of a complex, abstract combination of a smaller number of factors, a logical-inference approach to synthesizing Psi implications may be superior.
It may also be useful, sometimes, to learn neural nets for CBP policies, and then abstract patterns from these neural nets using pattern mining; these patterns would then turn into Psi implications with abstract Contexts.
(Somewhat Sketchy) Examples
To make these ideas a little more concrete, let’s very briefly/roughly go through some example situations.
First, consider question-answering. There may be multiple sources within an OpenCog system, capable of providing an answer to a certain question, e.g.:
A hard-wired response, which could be coded into the Atomspace by a human or learned via imitation
Fuzzy matcher based QA taking into account the parse and interpretation of the sentence
Pattern matcher lookup, if the Atomspace has definite knowledge regarding the subject of the query
PLN reasoning
The weight to be given to each method’s, in each case, needs to be determined adaptively based on the question and the context.
A “policy” in this case would map some set of features associated with the question and the context, into a weight vector across the various response sources.
A question is what is the right way to quantify the “context” in a question-answering case. The most obvious approach is to use word-occurrence or bigram-occurrence vectors. One can also potentially add in, say, extracted RelEx relations or RelEx2Logic relations.
If one has multiple examples of answers provided by the system, and knows which answers were accepted by the questioner and which were not, then this knowledge can be used to drive learning of policies. Such a policy would tell the system, given a particular question and the words and semantic relationships therein as well as the conversational context, which answer sources to rely on with what probabilities.
A rather different example would be physical movement. Suppose one has a collection of movement patterns (e.g. “animations” moving parts of a robot body, each of which may have multiple parameters). In this case one has a slate problem, meaning that one can choose multiple movement patterns at the same time. Further, one has to specify the parameters of each animation chosen; these are part of the action. Here a neural network will be very valuable as a policy representation, as one’s policy needs to take in floating-point variables quantifying the context, and output floating-point variables representing the parameters of the chosen animations. Real-time reinforcement data will be easily forthcoming, thus driving the underlying neural net learning.
(If movement is controlled by a deep neural network, these “animations” may be executed via clamping them in the higher-level nodes of the network, and then allowing the lower-level nodes to self-organize into compatible states, thus driving action.)
Obviously a lot of work and detailed thinking will be required to put these ideas into practice. However, I thought it might be useful to write this post just to clarify the connections between parts of the RL literature and the cognitive modeling approach used in OpenCog (drawn from Dorner, Bach, Psi, etc.). Often it happens that the close relationships between two different AI approaches or subfields are overlooked, due to “surface level” issues such as different habitual terminologies or different historical roots.
Potentially, the direction outlined in this post could enable OpenCog to leverage code and insights created in the deep reinforcement learning community; and to enable deep reinforcement neural networks to be used in more general-purpose ways via embedding them in OpenCog’s neural-symbolic framework.
]]>
These are not necessarily proposed for immediate-term development – we have a lot of other stuff going on, and a lot of stuff that’s half-finished or ¾-finished or more, that needs to be wrapped up and improved, etc. They are, however, proposed as potentially fundamental enhancements to the OpenCog and PrimeAGI (CogPrime) designs in their current form.
The core motivation here is to add more self-organizing creativity to the AGI design. One can view these ideas as extending what MOSES does in the PrimeAGI design – MOSES (at least in its aspirational version) is artificial evolution enhanced by pattern mining and probabilistic reasoning; whereas Cogistry is, very loosely, more like artificial biochemistry similarly enhanced.
Historical Background
Walter Fontana, in a fascinating 1990 paper, articulated a novel approach to “artificial life” style digital evolution of emergent phenomena, called “algorithmic chemistry.” The basic concept was: Create small codelets, so that when codelet A acts on codelet B, it produces a new codelet C. Then put a bunch of these codelets in a “primordial algorithmic soup” and let them act on each other, and see what interesting structures and dynamics emerge.
The paper reports some interesting emergent phenomena, but then the research programme was dropped off at an early stage. Very broadly speaking, the story seems to have been similar to what happened with a lot of Alife-related work of that era: Some cool-looking self-organizational phenomena occurred, but not the emergence of highly complex structures and dynamics like the researchers were looking for.
These sorts of results spawned the natural question “why?” Did the simulations involved not have a large enough scale? (After all, the real primordial soup was BIG and based on parallel processing and apparently brewed for quite a long time before producing anything dramatic.) Or were the underlying mechanisms simply not richly generative enough, in some way? Or both?
What I am going to propose here is not so much a solution to this old “why” question, but rather a novel potential route around the problem that spawned the question. My proposal – which I call “Cogistry” — is to enhance good old Fontana-style algorithmic chemistry by augmenting its “self-modifying program soup” with AI algorithms such as hypergraph pattern mining and probabilistic reasoning. I believe that, if this is done right, it can lead to “algorithm soups” with robust, hierarchical, complex emergent structures – and also to something related but new and different: emergent, self-organizing program networks that carry out functions an AI agent desires for achievement of its goals.
That is: my aim with Cogistry is to use probabilistic inference and pattern mining to enhance algorithmic chemistry, so as to create “digital primordial soups” that evolve into interesting digital life-forms, but ALSO so as to create “life-like program networks” that transform inputs into outputs in a way that carries out useful functions as requested by an AI agent’s goal system. The pure “digital primordial soup” case would occur when the inference and pattern mining are operating with the objective of spawning interesting structures and dynamics; whereas the “useful program network” case would occur when the inference and pattern mining are operating with an additional, specific, externally-supplied objective as well.
There is a rough analogy here with the relation between genetic algorithms and estimation of distribution algorithms (EDAs). EDAs aim to augment GA mutation and crossover with explicit pattern mining and probabilistic reasoning. But there are significant differences between EDAs and the present proposal as well. There is a lot more flexibility in an algorithmic chemistry network than in an evolving populations of bit strings or typical GP program trees; and hence, I suspect, a lot more possibility for the “evolution/self-organization of evolvability” and the “evolution/self-organization of inferential analyzability” to occur. Of course, though, this added flexibility also provides a lot more potential for messes to be made (including complex, original sorts of messes).
From an Alife point of view, the “chemistry” in the “algorithmic chemistry” metaphor is intended to be taken reasonably seriously. A core intuition here is that to get rich emergent structures and dynamics from one’s “digital biology” it’s probably necessary to go a level deeper to some sort of “digital chemistry” with a rich combinatorial potential and a propensity to give rise to diverse stable structures, including some with complex hierarchical forms. One might wonder whether this is even deep enough, whether one needs actually to dig down to a “digital physics” from which the digital chemistry emerges; my intuition is that this is not the case and focusing on the level of algorithmic chemistry (if it’s gotten right) is deep enough.
Actually, the “digital physics,” in the analogy pursued here, would be the code underlying the algorithms in the algorithm-soup: the programming-language interpreter and the underlying C and assembly code, etc. So part of my suggestion here will be a suggestion regarding what kind of digital physics is likely to make algorithmic chemistry work best: e.g. functional programming and reversible computing. But, according to my intuition, the core ideas of this “digital physics” have already been created by others for other purposes, and can be exploited for algorithmic-chemistry purposes without needing dramatic innovations on that level.
An Algorithmic Chemistry Framework
First I will define a framework for algorithmic chemistry in a reasonably general way. I will then fill in more details, bit by bit.
The basic unit involved I will call a “codelet” – defined simply as a piece of code that maps one or more input codelets into one or more output codelets. What language to use for specifying codelets is a subtle matter that I will address below, but that we don’t need to define in order to articulate a general algorithmic chemistry framework.
Fontana’s original work involved a chemical soup with no spatial structure, but other work with artificial life systems suggests that a spatial structure may be valuable. So we propose a multi-compartment system, in which each compartment has its own “algorithmic chemical soup.” A codelet, relative to a specific compartment, can be said to have a certain “count” indicating how many copies of that codelet exist in that compartment.
Inside each compartment, multiple dynamics happen concurrently:
I will use the word “codenet” to refer to a collection of codelets that interact with each other in a coherent way. This is intentionally a vague, intuitive definition – because there are many kinds of coherent networks of codelets, and it’s not obvious which ones are going to be the most useful to look at, in which contexts. In some cases a chain of the form “In the presence of background codelet-set B, A1 reacts with B to make A2, which reacts with B to make A3, etc. …” may be very influential. In other cases cycles of the form “A and B react to make C; then A and C create to make more B; and B and C react to make more A” may be critical. In other cases it may be more complex sorts of networks. Exactly how to chop up a soup of codelets, growing and interacting over time, into distinct or overlapping codenets is not entirely clear at present and may never be. However, it’s clear that for understanding what is going on in an algorithmic-chemistry situation, it’s the codenets and not just the codelets that need to be looked at. If codelets are like chemicals, then codenets are like chemical compounds and/or biological systems.
Implementation Notes
In terms of current computing architectures, it would be natural to run different compartments on different machines, and to run the four processes in different threads, perhaps with multiple threads handling reaction, which will generally be the most intensive process.
If implemented in an OpenCog framework, then potentially, separate compartments could be separate Atomspaces, and the dynamic processes could be separate MindAgents running in different threads, roughly similar to the agents now comprising the ECAN module. Also, in OpenCog, codelets could be sub-hypergraphs in Atomspace, perhaps each codelet corresponding to a DefineLink.
Reactions would naturally be implemented using the Forward Chainer (a part of the Rule Engine, which leverages the Pattern Matcher). This differs a bit from PLN’s use of the Forward Chainer, because in PLN one is applying an inference rule (drawn from a small set thereof) to premises, whereas here one is applying a codelet (drawn from a large set of possible codelets) to other codelets.
Measuring Interestingness
One interesting question is how to measure the interestingness of a codelet, or codenet.
For codelets, we can likely just bump the issue up a level: A codelet is as interesting as the codenets it’s part of.
For codenets, we can presumably rely on information theory. A compartment, or a codenet, as it exists at a particular point in time, can be modeled using a sparse vector with an entry for each codelet that has nonzero count in the compartment or codenet (where the entry for a codelet contains the relevant count). A compartment or codenet as it exists during an interval of time can then be modeled as a series of sparse vectors. One can then calculate the interaction information of this vector-series (or the “surprisingness” as defined in the context of OpenCog’s Pattern Miner). This is a good first stab at measuring how much novelty there is in the dynamics of a codenet or compartment.
In a codenet containing some codelets representing Inputs and others representing Outputs, one can also calculate interaction information based only on the Input and Output codelets. This is a measure of the surprisingness or informativeness of the codenet’s relevant external behaviors, rather than its internal dynamics.
Pattern Mining and Inference for Algorithmic Chemistry
Given the above approach to assessing interestingness, one can use a modification of OpenCog’s Pattern Miner to search for codenets that have surprising dynamics. One can also, in this way, search for patterns among codenets, so that specific codenets fulfilling the patterns have surprising dynamics. Such patterns may be expressed in the Atomspace, in terms of “abstract codelets” — codelets that have some of their internal Atoms represented as VariableNodes instead.
An “abstract codenet” may be defined as a set of (possibly-abstract codelet, count-interval, time-interval) triples, where the time-interval is defined as a pair (start, end), where start and end are defined as offsets from the initiation of the codenet. The interpretation of such a triple is that (C, (m,n) (s,e)) means that some codelet instantiating abstract codelet C exists with count between m and n, during the time interval spanning from s to e.
Note that in order to form useful abstractions from codenets involving different codelets, it will be useful to be able to represent codelets in some sort of normal form, so that comparison of different codelets is tractable and meaningful. This suggests that having the codelet language support some sort of Elegant Normal Form, similar to the Reduct library used in OpenCog’s MOSES subsystem, will be valuable.
Using the Pattern Miner to search for abstract codenets with high degrees of informational surprisingness, should be a powerful way to drive algorithmic chemistry in interesting directions. Once one has found abstract codenets that appear to systematically yield high surprisingness, one can then use these to drive probabilistic generation of concrete codenets, and let them evolve in the algorithmic soup, and see what they lead to.
Furthermore, once one has abstract codenets with apparent value, one can apply probabilistic inference to generalize from these codenets, using deductive, inductive and abductive reasoning, e.g. using OpenCog’s Probabilistic Logic Networks. This can be used to drive additional probabilistic generation of concrete codenets to be tried out.
“Mutation” and “crossover” of codenets or codelets can be experimented with on the inferential level as well – i.e. one can ask one’s inference engine to estimate the likely interestingness of a mutation or crossover of observed codenets or codelets, and then try out the mutation or crossover products that have passed this “fitness estimation” test.
This kind of pattern mining and inference certainly will be far from trivial to get right. However, conceptually, it seems a route with a reasonably high probability of surmounting the fundamental difficulties faced by earlier work in artificial life and algorithmic chemistry. It is something conceptually different than “mere self-organization” or “mere logical reasoning” – it is Alife/Achem-style self-organization and self-modification, but explicitly guided by abstract pattern recognition and reasoning. One is doing symbolic AI to accelerate and accentuate the creativity of subsymbolic AI.
The above pertains to the case where one is purely trying to create algorithmic soups with interesting internal dynamics and structures. However, it applies also in cases where one is trying to use algorithmic chemistry to learn effective input-output functions according to some fitness criteria. In that case, after doing pattern mining of surprising abstract codenets, one can ask a different question: Which codenets, and which combinations thereof, appear to differentially lead to high-fitness transformation of inputs into outputs? One can then generate new codenets from the distribution obtained via answering this question. This is an approach to solving the “assignment of credit” problem from a “God’s eye” point of view – by mining patterns from the network over time … a fundamentally different approach to assignment of credit than has been taken in subsymbolic AI systems in the past.
Desirable Properties of a Cogistry Codelet Language
Designing the right language for the above general “Cogistry” approach is a subtle task, and I won’t try to do so fully here. I’ll just sketch some ideas and possible requirements.
Fontana’s original algorithmic chemistry work uses a variety of LISP, which seems a sound and uncontroversial choice (and the same choice we made in OpenCog’s MOSES GA-EDA tool, for example). However, a few variations and additions to this basic LISP-ish framework seem potentially valuable:
Float-weighted lists are very handy for dealing with perceptual data, for example. They also provide an element of continuity, which may help with robustness. Codelets relying on float vectors of weights can be modified slightly via modifying the weights, leading to codelets with slightly different behaviors – and this continuity may make learning of new codelets via sampling from distributions implied by abstract codenets easier.
Further, it seems to me we may want to make all operations reversible. If the atomic operations on bits, ints and floats are reversible, then corresponding operations on lists and sets and weighted vectors can also easily be made reversible. (For instance, removing the final element of a list can be made into a reversible operation by, instead, using an operation that splits a list into two parts: the list with the final element removed, and the final element itself.) The intuition here is that reversibility introduces a kind of “conservation of information” into the system, which should prevent the advent of various sorts of pathological runaway dynamics like Fontana observed in his early simulations. If codelets can produce “more” than they take in, then evolution will naturally lead to codelets that try to exploit this potential and produce more and more and more than they take in. But if codelets have to be “conservative” and essentially act only by rearranging their inputs, then they have to be cleverer to survive and flourish, and are more strongly pushed to create complexly interlocking self-organizing structures.
I’m well aware that mining patterns among functions of float variables is difficult, and it would be easier to restrict attention to discrete variables – but ultimately I think this would be a mistake. Perceptual data seems to be very naturally represented in terms of float vectors, for example. Perhaps an innovative approach will be needed here, e.g. instead of floats one could use confidence intervals (x% chance of lying in the interval (L,U) ). Reversible division on confidence intervals would require a bit of fiddling to work out, but seems unlikely to be fundamentally difficult.
Whoaaaoowwawwww!!
The idea of Cogistry seemed very simple when I initially thought of it; but when I sat down to write this post summarizing it, it started to seem a lot more complicated. There are a lot of moving parts! But, hey, nobody said building a thinking machine and creating digital life in one fell swoop was going to be easy…. (Well actually OK, some people did say that… er, actually, I did say that, but I took it back a decade and a half ago!! …)
What I like about the idea of Cogistry, though, is that – if it works (and I’m sure it will, after a suitable amount of research and fiddling!) — it provides a way to combine the fabulous generative creativity of biological systems, with the efficiency and precision of digital-computer-based pattern mining and probabilistic reasoning. Such a combination has the potential to transcend the “symbolic versus subsymbolic” and “biological versus computational” dichotomies that have plagued the AI field since nearly its inception (and that indeed reflect deeper complex dichotomies and confusions in our contemporary culture with its mixture of information-age, machine-age, and biological/emotional/cultural-humanity aspects). Some of the details are gonna be freakin’ complicated to work out but, though I realize it sounds a bit vain or whatever, I have to say I feel there is some profound stuff lurking here…
Notes Toward a Possible Development Plan
At first blush, it seems to me that most of the hard work here is either
From this perspective, an approach to making Cogistry real would be to start by
Now, I would not expect this initial work to yield great results… since basically it’s a matter of reimplementing good old Alife/Achem stuff in a context where inference, pattern mining, ECAN etc. can be layered on. Without the layering on of these AI tools, one would expect to find familiar Alife-y issues: some interesting structures emerging, but hitting a complexity ceiling … and then being uncertain whether increased scale or a change to the codelet language might be the key to getting more interesting things to emerge.
But beyond this basic framework, the other things needed for Cogistry are all things needed for other OpenCog AGI work anyway:
With the basic codelet-system framework in place, using these things for Cogistry alongside their other uses would be “straightforward in principle”
— Thanks are due to Cassio Pennachin and Zar Goertzel for comments on an earlier version of the text contained in this post.
]]>The idea of a block-chain comes from the idea of block ciphers, where you want to securely sign (or encrypt) some message, by chaining together blocks of data, in such a way that each prior encrypted block provides “salt” or a “seed” for the next block. Both bitcoin and git use block-chaining to provide cryptographic signatures authenticating the data that they store. Now, git stores big blobs of ASCII text (aka “source code”), while bitcoin stores a very simple (and not at all general) general ledger. Instead of storing text-blobs, like git, or storing an oversimplified financial ledger, like bitcoin, what if, instead, we could store general structured data? Better yet: what if it was tuned for knowledge representation and reasoning? Better still: what if you could store algorithms in it, that could be executed? But all of these things together, and you’ve got exactly what you need for smart contracts: a “secure”, cryptographically-authenticated general data store with an auditable transaction history. Think of it as internet-plus: a way of doing distributed agreements, world-wide. It has been the cypher-punk day-dream for decades, and now maybe within reach. The rest of this essay unpacks these ideas a bit more.
When I say “git, the block-chain”, I’m joking or misunderstanding, I mean it. Bitcoin takes the core idea of git, and adds a new component: incentives to provide an “Acked-by” or a “Signed-off-by” line, which git does not provide: with git, people attach Ack and Sign-off lines only to increase their personal status, rather than to accumulate wealth. What is more, git does NOT handle acked-by/signed-off-by in a cryptographic fashion: it is purely manual; Torvalds or Andrew Morton or the other maintainers accumulate these, and they get added manually to the block chain, by cut-n-paste from email into the git commit message.
Some of the key differences between git and bitcoin are:
For the things that I am interested in, I really don’t care about the mining aspect of blockchains. It’s just stupid. Git is a superior block-chain to bitcoin. It’s got more features, its got a better API, it offers consistent histories — that is, merging! Which bitcoin does not. Understandably — bitcoin wants to prevent double-spending. But there are other ways to avoid double-spending, than to force a single master. Git shows the way.
Now, after building up git, it also has a lot of weaknesses: it does not provide any sort of built-in search or query. You can say “git log” and view the commit messages, but you cannot search the contents: there is no such feature.
Git is designed for block-chaining unstructured ASCII (utf8) blobs of character strings — source-code, basically — it started life as a source-code control system. Let’s compare that to structured data. So, in the 1960’s, the core concepts of relations and relational queries got worked out: the meaning of “is-a”, “has-a”, “part-of”, “is-owned-by”, etc. The result of this research was the concept of a relational database, and a structured query language (SQL) to query that structured data. Businesses loved SQL, and Oracle, Sybase, IBM DB2 boomed in the 1970’s and 1980’s, and that is because the concept of relational data fit very well with the way that businesses organize data.
Lets compare SQL to bitcoin: In bitcoin, there is only ONE relational table, and it is hard-coded. It can store only one thing: a quantity of bitcoin. There is only one thing you can do to that table: add or remove bitcoin. That’s it.
In SQL, the user can design any kind of table at all, to hold any kind of data. Complete freedom. So, if you wanted to implement block-chained smart contracts, that is what you would do: allow the user to create whatever structured data they might want. For example: every month, the baker wants to buy X bags of flour from the miller for Y dollars: this is not just a contract, but a recurring contract: every month, it is the same. To handle it, an SQL architect designs an SQL table to store dollars, bags of flour, multiple date-stamps: datestamp of when the order was made, date-stamp of when the order was given to the shipping firm (who then crypto-signs the block-chain of this transaction), the datestamp of when the baker received the flour, the datestamp of when the baker paid the miller. Each of these live on the block-chain, each get crypto-signed when the transaction occurs.
The SQL architect was able to design the data table in such a way that it is NATURAL for the purchase-ship-sell, inventory, accounts-payable, accounts-receivable way that this kind of business is conducted.
There are far more complicated transactions, in the petroleum industry, where revenue goes to pipeline owners, well owners, distillers, etc. in a very complicated process. Another example is the music-industry royalties. Both of these industries use a rather complex financial ledger system that resemble financial derivatives, except that there is no futures contract structure to it: the pipeline owner cannot easily arbitrage the petroleum distiller. Anyway, this is what accounting programs and general ledgers excel at: they match up with the business process, and the reason they can match up is because the SQL architect can design the database tables so that they fit well with the business process.
If you want to build a blockchain-based smart contract, you need to add structured data to the block-chain. So this is an example of where git falls flat: its an excellent block-chain, but it can only store unstructured ASCII blobs.
Comparing Git to SQL: Git is also missing the ability to perform queries: but of course: the git data is unstructured, so queries are hard/impossible, by nature. A smart-contract block-chain MUST provide a query language! Without that, it is useless. Let me say it again: SQL is KEY to business contracts. If you build a blockchain without SQL-like features in it, it will TOTALLY SUCK. The world does not need another bitcoin!
I hope you have followed me so far.
OK, now, we are finally almost at where OpenCog is. So: the idea of relational data and relational databases was fleshed out in the 1960’s and the 1970’s, and it appears to be enough for accounting. However, it is not enough for other applications, in two different ways.
First, for “big data”, it is much more convenient to substitute SQL and ACID with NoSQL and BASE. The Google MapReduce system is a prime example of this. It provides a highly distributed, highly-parallelizable query mechanism for structured data. Conclusion: if you build a block-chain for structured data, but use only SQL-type PRIMARY-KEY’s for your tables, it will fail to scale to big-data levels. Your block-chain needs to support both SQL and NoSQL. The good news is that this is a “solved problem”: it is known that these are category-theoretic duals, there is a famous Microsoft paper on this: “ACM Queue March 18, 2011 Volume 9, issue 3 “A co-Relational Model of Data for Large Shared Data Banks”, Erik Meijer and Gavin Bierman, Microsoft. Contrary to popular belief, SQL and noSQL are really just two sides of the same coin.”
Next problem: for the task of “knowledge representation” (ontology, triple-stores, OWL, SPARQL,) and “logical reasoning”, the flat tables and structures offered by SQL/noSQL are insufficient — it turns out that graphical databases are much better suited for this task. Thus, we have the concept of a graph_database, some well-known examples include Neo4j, tinkerpop, etc.
The OpenCog AtomSpace fits into this last category. Here, the traditional 1960’s-era “is-a” relation corresponds to the OpenCog InheritanceLink. Named relations (such as “Billy Bob is a part-time employee” in and SQL table) are expressed using EvaluationLinks and PredicateNodes:
(EvaluationLink (PredicateNode "is-employee") (ListLink (ConceptNode "BillyBob") (ConceptNode "employee")))
Its a bit verbose, but it is another way of expressing the traditional SQL relations. It is somewhat No-SQL-like, because you do not have to declare an “is-employee” table in advance, the way you do in SQL — there is no “hoisting” — instead, you can create new predicates dynamically, on the fly, at any time.
OpenCog has a centralized database, called the AtomSpace. Notice how the above is a tree, and so the AtomSpace becomes a “forest of trees”. In the atomspace, each link or node is unique, and so each tree shares nodes and links: this is called a “Levi graph” and is a general bipartite way of representing hypergraphs. So, the atomspace is not just a graph database, its a hypergraph database.
Edits to this database are very highly regulated and centralized: so there is a natural location where a blockchain signature could be computed: every time an atom is added or removed, that is a good place to hash the atomspace contents, and apply a signature.
The atomspace does NOT have any sort of history-of-transactions (we have not needed one, yet). We are (actually, Nil is) working on something similar, though, called the “backwards-inference tree”, which is used to store a chain of logical deductions or inferences that get made. Its kind-of-like a transaction history, but instead of storing any kind of transaction, it only stores those transactions that can be chained together to perform a forward-chained logical deduction. Because each of these deductions lead to yet-another deduction, this is also a natural location to perform crypto block-chaining. That is, if some early inference is wrong or corrupted, all later inferences become invalid – – that is the chaining. So we chain, but we have not needed crypto signatures on that chain.
The atomspace also has a query language, called the “pattern matcher“. It is designed to search only the current contents of the database. I suppose it could be extended to search the transaction history. The backward-inference-tree-chains were designed by Nil to be explicitly compatible with the pattern matcher.
The AtomSpace is a typed graph store, and some of the types are taken from predicate logic: there is a boolean AndLink, boolean OrLink, a boolean NotLink; but also an intuitionist-logic ChoiceLink, AbsentLink, PresentLink, and to round it out, a Kripke-frame ContextLink (similar to a CYC “microtheory” but much, much better). The reason I am mentioning these logic types is because they are the natural constructor types for smart contracts: in a legal contract, you want to say “this must be fulfilled and this or this but not this”, and so the logical connectives provide what you need for specifying contractual obligations.
Next, the AtomSpace has LambdaLinks, which implement the lambda abstractor from lambda calculus. This enables generic computation: you need this for smart_contracts. The AtomSpace is NOT very strong in this area, though: it provides a rather basic computational ability with the LambdaLink, but it is very primitive, and does not go much farther. We do some, but not a lot of computation in the AtomSpace. It was not meant to be the kind of programming language that humans would want to code in.
The atomspace does NOT have any lambda flow in it, e.g. Marius Buliga’s ChemLambda. I am still wrestling with that. The atomspace does have a distinct beta-reduction type, called PutLink, dual to the LambdaLink abstractor. However, for theorem-proving, I believe that a few more abstractors are needed: Buliga has four: lambda and beta, and two more. I am also trying to figure out Jean-Yves Girard’s Ludics. Not there, yet.
Perhaps I failed to mention: the current AtomSpace design has no security features in it, whatsoever. Absolutely zero. Even the most trivial hostile action will wipe out everything. There is a reason for this: development is focused on reasoning and thinking. Also, the current atomspace is not scalable. It’s also rather low-performance. Its unsuitable for big-data. None of these checkboxes that many people look for are satisfied by the atomspace. That’s because these issues are, for this project, quite low priority. We are focused on reasoning and understanding, and just about nothing else.
So, taken at face value, it is absurd to contemplate a blockchain for the atomspace; without even basic security, or decentralized, distributed storage, byzantine fault tolerance, and high performance, its a non-starter for serious consideration. Can these checkboxes be added to the atomspace, someday? Maybe. Soon? Not at all likely. These are nice-to-haves, but opencog’s primary focus must remain reasoning and thinking, not scalable, distributed, secure storage.
So that’s it, then: you can think of the OpenCog atomspace as a modern-day graphical, relational database that includes the datalog fragment of prolog, and lots of other parts as well. It has an assortment of weaknesses and failures, which I know of, but won’t get into here. It is probably a decent rough sketch for the data storage system that you’d want for a block-chained smart contract. To make it successful, you would need to a whole lotta things:
The problem here is that, as a business, companies like IBM and PwC will trounce you at the high-end, cause they already have the business customers, and the IBM STSM’s are smart enough to figure out how block-chains work, and so will get some architects to create that kind of system for them. At the low-end, there must be thousands of twenty-something programmers writing apps for cell-phones, daydreaming of being the next big unicorn, and they are all exploring payment systems and smart-cards and whatever, at a furious pace. So if you really want a successful block-chain, smart-contract business, here, hold on to your butt.
I think that the only hope is to go open source, work with the Apache foundation, have them do the marketing for the AtomSpace or something like it, and set up API’s that people want to use. That’s a lot of work. But that is the way to go.
]]>OpenCog has a reasoning system, called PLN, short for “Probabilistic Logic Networks”. Its actually two things: first and foremost, its a set of “rules of inference”, which can be applied to “real world knowledge”, to deduce new “facts” about the world. There are about half a dozen of these rules, and one of them resembles the classical “Modus Ponens“, except that it assigns a probability to the outcome, based on the probabilities of the inputs. For the rest of this post, the details of PLN mostly don’t matter: if you wish, you can think of classical propositional logic, or of some kind of fuzzy logic, if you wish, or even competing systems such as NARS. Anyway, PLN applies these rules of inference to the Atoms contained in the AtomSpace, to generate new Atoms. This is a fancy way of saying that the AtomSpace is the knowledge repository in OpenCog, an that the atoms are the “facts”. Its not much more than that: its just a big jumble of facts.
I want to talk about reasoning using PLN. Now, this is NOT the way that the current opencog code base implements PLN reasoning; however, its a conceptual description of what it could (or should, or might) do.
Now, I mentioned that PLN consists of maybe a half-dozen or a dozen rules of inference. They have fancy names like “modus ponens” but we could call them just “rule MP” … or just “rule A”, “rule B”, and so on.
Suppose I start with some atomspace contents, and apply the PLN rule A. As a result of this application, we have a “possible world 1”. If, instead, we started with the same original atomspace contents as before, but applied rule B, then we would get “possible world 2”. It might also be the case that PLN rule A can be applied to some different atoms from the atomspace, in which case, we get “possible world 3”.
Each possible world consists of the triple (some subset of the atomspace, some PLN inference rule, the result of applying the PLN rule to the input). Please note that some of these possible worlds are invalid or empty: it might not be possible to apply the choosen PLN rule to the chosen subset of the atomspace. I guess we should call these “impossible worlds”. You can say that their probability is exactly zero.
Observe that the triple above is an arrow: the tail of the arrow is “some subset of the atomspace”, the head of the arrow is “the result of applying PLN rule X”, and the shaft of the arrow is given a name: its “rule X”. (In fancy-pants, peacock language, the arrows are morphisms, and the slinging together, here, are Kripke frames. But lets avoid the fancy language since its confuses things a lot. Just know that it’s there.)
Anyway — considering this process, this clearly results in a very shallow tree, with the original atomspace as the root, and each branch of the tree corresponding to possible world. Note that each possible world is a new and different atomspace: The rules of the game here are that we are NOT allowed to dump the results of the PLN inference back into the original atomspace. Instead, we MUST fork the atomspace. Thus, if we have N possible worlds, then we have N distinct atomspaces (not counting the original, starting atomspace). This is very different from what the PLN code base does today: it currently dumps its results back into the original atomspace. But, for this conceptual model, we don’t want to do that.
Now, for each possible world, we can apply the above procedure again. Naively, this is a combinatoric explosion. For the most part, each different possible world will be different than the others. They will share a lot of atoms in common, but some will be different. Note, also, that *some* of these worlds will NOT be different, but will converge, or be “confluent“, arriving at the same atomspace contents along different routes. So, although, naively, we have a highly branching tree, it should be clear that sometimes, some of the branches come back together again.
I already pointed out that some of the worlds are “impossible” i.e. have a probability of zero. These can be discarded. But wait, there’s more. Suppose that one of the possible worlds contains the statement “John Kennedy is alive” (with a very very high confidence) , while another one contains the statement “John Kennedy is dead” (with a very very high confidence). What I wish to claim is that, no matter what future PLN inferences might be made, these two worlds will never become confluent.
There is also a different effect: during inferencing (i.e. the repeated application of PLN), one might find oneself in a situation where the atoms being added to the atomspace, at each inference step, have lower and lower probability. At some point, this suggests that one should just plain quit — that particular branch is just not going anywhere. Its a dead end. A similar situation occurs when no further PLN rules can be applied. Dead end.
OK, that’s it. The above provides a very generic description of how inferencing can be performed. It doesn’t have to be PLN — it could be anything — classical logic using sequent calculus, for example. So far, everything I said is very easy-peasy, direct and straightforward. So now is where the fun starts.
First, (lets get it out of the way now) the above describes *exactly* how Link Grammar works. For “atomspace” substitute “linkage” and for “PLN rule of inference” substitute “disjunct“. That’s it. End of story. QED.
Oh, I forgot to introduce Link Grammar. It is a system for parsing natural languages, such as English. It does this by maintaining a dictionary of so-called “disjuncts”, which can be thought of “jigsaw puzzle pieces”. The act of parsing requires finding and joining together the jigsaw pieces into a coherent whole. The final result of the act of parsing is a linkage (a parse is a linkage – same thing). These jigsaw puzzle pieces are nicely illustrated in the very first paper on Link Grammar.
Notice that each distinct linkage in link-grammar is a distinct possible-world. The result of parsing is to create a list of possible worlds (linkages, aka “parses”). Now, link-grammar has a “cost system” that assigns different probabilities (different costs) to each possible world: this is “parse ranking”: some parses (linkages) are more likely than others. Note that each different parse is, in a sense, “not compatible” with every other parse. Two different parses may share common elements, but other parts will differ.
Claim: the link-grammar is a closed monoidal category, where the words are the objects, and the disjuncts are the morphisms. I don’t have the time or space to articulate this claim, so you’ll have to take it on faith, or think it through, or compare it to other papers on categorial grammar or maybe pregroup grammar. There is nice example from Bob Coecke showing the jigsaw-puzzle pieces. You can see a similar story develop in John Baez’s “Rosetta Stone” paper, although the jigsaw-pieces are less distinctly illustrated.
Theorem: the act of applying PLN, as described above, is a closed monoidal category. Proof: A “PLN rule of inference” is, abstractly, exactly the same thing as a link-grammar disjunct. The contents of the atomspace is exactly the same thing as a (partially or fully) parsed sentence. QED.
There is nothing more to this proof than that. I mean, it can fleshed it out in much greater detail, but that’s the gist of it.
Observe two very important things: (1) During the proof, I never once had to talk about modus ponens, or any of the other PLN inference rules. (2) During the proof, I never had to invoke the specific mathematical formulas that compute the PLN “TruthValues” — that compute the strength and confidence. Both of these aspects of PLN are completely and utterly irrelevant to the proof. The only thing that mattered is that PLN takes, as input, some atoms, and applies some transformation, and generates atoms. That’s it.
The above theorem is *why* I keep talking about possible worlds and kripke-blah-blah and intuitionistic logic and linear logic. Its got nothing to do with the actual PLN rules! The only thing that matters is that there are rules, that get applied in some way. The generic properties of linear logic and etc. are the generic properties of rule systems and Kripke frames. Examples of such rule systems include link-grammar, PLN, NARS, classical logic, and many more. The details of the specific rule system do NOT alter the fundamental process of rule application aka “parsing” aka “reasoning” aka “natural deduction” aka “sequent calculus”. In particular, it is a category error to confuse the details of PLN with the act of parsing: the logic that describes parsing is not PLN, and PLN dos not describe parsing: its an error to confuse the two.
Phew.
What remains to be done: I believe that what I describe above, the “many-worlds hypothesis” of reasoning, can be used to create a system that is far more efficient than the current PLN backward/forward chainer. It’s not easy, though: the link-parser algorithm struggles with the combinatoric explosion, and has some deep, tricky techniques to beat it down. ECAN was invented to deal with the explosion in PLN. But there are other ways.
By the way: the act of merging the results of a PLN inference back into the original atomspace corresponds, in a very literal sense, to a “wave function collapse”. As long as you keep around multiple atomspaces, each containing partial results, you have “many worlds”, but every time you discard or merge some of these atomspaces back into one, its a “collapse”. That includes some of the truth-value merge rules that currently plague the system. To truly understand these last three sentences, you will, unfortunately, have to do a lot of studying. But I hope this blog post provides a good signpost.
]]>In order for machine intelligence to perform in the real world, it needs to create an internal model of the external world. This can be as trite as a model of a chessboard that a chess-playing algo maintains. As information flows in from the senses, that model is updated; the current model is used to create future plans (e.g. the next move, for a chess-playing computer).
Another important part of an effective machine algo is “attentional focus”: so, for a chess-playing computer, it is focusing compute resources on exploring those chess-board positions that seem most likely to improve the score, instead of somewhere else. Insert favorite score-maximizing algo here.
Self-aware systems are those that have an internal model of self. Conscious systems are those that have an internal model of attentional focus. I’m conscious because I maintain an internal model of what I am thinking about, and I can think about that, if I so choose. I can ask myself what I’m thinking about, and get an answer to that question, much in the same way that I can ask myself what my teenage son is doing, and sort-of get an answer to that (I imagine, in my minds eye, that he is sitting in his room, doing his homework. I might be wrong.) I can steer my attention the way I steer my limbs, but this is only possible because I have that internal model (of my focus, of my limbs), and I can use that model to plan, to adjust, to control.
So, can we use this to build an AGI?
Well, we already have machines that can add numbers together better than us, can play chess better than us, and apparently, can drive cars better than us. Only the last can be said to have any inkling of self-awareness, and that is fairly minimal: just enough to locate itself in the middle of the road, and maintain a safe distance between it and obstacles.
I am not aware of any system that maintains an internal model of its own attentional focus (and then uses that model to perform prediction, planning and control of that focus). This, in itself, might not be that hard to do, if one set out to explicitly accomplish just that. I don’t believe anyone has ever tried it. The fun begins when you give such a system senses and a body to play with. It gets serious when you provide it with linguistic abilities.
I admit I’m not entirely clear on how to create a model of attentional focus when language is involved; I plan to think heavily on this topic in the coming weeks/months/years. At any rate, I suspect its doable.
I believe that if someone builds such a device, they will have the fabled conscious, self-aware system of sci-fi. It’s likely to be flawed, stupid, and psychotic: common-sense reasoning algorithms are in a very primitive state (among (many) other technical issues). But I figure that we will notice, and agree that its self-aware, long before its intelligent enough to self-augument itself out of its pathetic state: I’m thinking it will behave a bit like a rabid talking dog: not a charming personality, but certainly “conscious”, self-aware, intelligent, unpredictable, and dangerous.
To be charming, one must develop a very detailed model of humans, and what humans like, and how they respond to situations. This could prove to be quite hard. Most humans can’t do it very well. For an AGI to self-augument itself, it would have to convince it’s human masters to let it tinker with itself. Given that charm just might be a pre-requisite, that would be a significant challenge, even for a rather smart AGI. Never mind that self-augumentation can be fatal, as anyone who’s overdosed on heroin might fail to point out.
I’m sure the military and certain darker political forces would have considerable interest in building a charming personality, especially if its really, really smart. We already know that people can be charming and psychotic all at the same time; ethics or lack thereof is not somehow mutually exclusive of intelligence. That kind of a machine, unleashed on the world, would be … an existential threat. Could end well, could end badly.
Anyway, I think that’s the outline of a valid course of research. It leaves open some huge questions, but it does narrow the range of the project to some concrete and achievable goals.
]]>These ideas are part of the same train of thought as the New PLN Design, currently being implemented bit-by-bit (and with interesting variations and deviations from the rough spec I just linked to) by Jade O’Neill and Ramin Barati. But this blog post contains new ideas not contained on that page.
Actually, I am unsure if I will end up recommending the ideas outlined here for implementation or not. But even if not, I think they are interesting for the light they shed on what is going on with PLN conceptually and mathematically.
For one thing, on the theoretical side, I will outline here an argument why inference trails are ultimately unnecessary in PLN. (They are needed in Pei Wang’s NARS system, from which PLN originally borrowed them; but this is because NARS is not probabilistic, so that the sorts of Gibbs sampling based arguments I outline here can’t be applied to NARS.)
Rough Summary / Prelude
Basically: In this post I will describe how to reformulate PLN inference as (very broadly speaking) to make use of Gibbs Sampling. As Gibbs Sampling is used in the standard approach to Markov Logic Networks, this also serves (among other more practical purposes) to make clearer the relationship between PLN and MLN.
Broadly speaking, the idea here is to have two different, interlocking levels of PLN inference, with different truth values and different dynamics associated with them
It seems possible that doing this might speed the convergence of a PLN network toward maximally intelligent conclusions based on the knowledge implicit in it.
Consideration of this possibility leads to an understanding of the relation between PLN dynamics and Gibbs sampling, which leads to an argument (at this stage, a sketch of a proof rather than a proof) that inference trails are not really needed in PLN.
Two preliminary notes before getting started:
Without further ado, I will now present two thought-experiments in PLN design: one fairly extreme, the other less so.
Thought-Experiment #1: PLN Inference via Gibbs Sampling on Distributional Truth Values
In this section I’ll describe a hypothetical way of doing PLN inference via Gibbs sampling.
Suppose that, instead of a single truth value, we let each PLN Atom have two truth values:
The sample distribution consists of a series of values that define the shape of a distribution. For example, the template sample distribution might comprise K=5 values corresponding to the intervals [0, .2] , [.2, .4], [.4,.6], [.6,.8], [.8,1]. The values would be viewed as a step value approximation to an underlying first-order probability distribution.
Next, the instantaneous truth values would be updated via Gibbs sampling. What I mean by this is, a process by which: the Atoms in the Atomspace are looped through, and when each Atom X is visited, its sampled strengths are replaced with the result of the following Gibbs-type Update Rule:
The instantaneous truth value would then impact the primary truth value as follows
Periodically (every N cycles), the primary truth value of A is revised with the instantaneous truth value of A
(i.e. the primary truth value is replaced with a weighted average of itself & the instantaneous truth value)
Note that one could vary on this process in multiple ways — e.g. via making the instantaneous truth value an imprecise or indefinite probability, or a second order probability distribution. The above procedure is given as it is, more out of a desire for relative simplicity of presentation, than because it necessarily seems the best approach.
If nothing else besides this updating happened with the primary truth values of logical Atoms (and if the various logical relations in the Atomspace all possessed a consistent probabilistic interpretation in terms of some grounding) — then according to the theory of Gibbs sampling, each Atom would get a primary strength approximating its correct strength according to the joint distribution implicit in all the logical Atoms in the Atomspace.
(The above description, involved as it is, still finesses a bit of mathematical fancy footwork. It’s important to remember that, in spite of the Gibbs sampling, the PLN heuristic inference rules (which are derived using probability theory, but also various other heuristics) are being used to define the relationships between the variables (i.e. the truth value strengths of Atoms) in the network.
So the Gibbs sampling must be viewed as taking place, not on the variables (the Atom strengths) themselves, but on propositions of the form “the strength of Atom A lies in interval [x,y]”. One can thus view the sampling as happening on a second-order probability distribution defined over the main probability distribution of strengths.
So the joint distribution on the truth value strength distributions in the PLN network, has to be calculated consistently with the results of the PLN probabilistic/heuristic inference rules. If the PLN inference rules deviated far from probability theory, then the Gibbs sampling would result in a network that didn’t make sense as a probabilistic model of the world to which the variables in the network refer, but did make sense as a model of the relationship between the variables according to the PLN inference rules.
This is pretty different from a MLN, because in an MLN the Gibbs sampling just has to find a distribution consistent with certain propositional logic relations, not consistent with certain heuristic uncertain truth value estimation functions.
Anyway: this sort of subtlety is the reason that the idea presented here is not “obvious” and hasn’t emerged in PLN theory before.
So then, if this were the only kind of inference dynamic happening in PLN, we could view PLN as something vaguely analogous to a second-order Markov Logic Network incorporating a wider variety of logical constructs (more general quantifier logic, intensional inference, etc.) via heuristic formulas.
However, the thought-experiment I am outlining in this section is not to have this kind of sampling be the only thing happening in PLN. My suggestion is that in any new PLN, just like in the current and prior PLN, primary strengths may also be modified via forward and backward chaining inference. These inference methods do something different than the Gibbs-type updating mentioned above, because they add new logical links (and in some cases nodes) to the network.
This is vaguely comparable to how, in some cases, Gibbs sampling or message-passing in Markov Logic Networks have been coupled with Inductive Logic Programming. ILP, vaguely similarly to PLN forward and backward inference, adds new logical links to a network. I.e., to use MLN / Bayes Nets terminology, both ILP and PLN chaining are concerned with structure building, whereas Gibbs sampling, message-passing and other comparable methods of probabilistic inference are concerned with calculating probabilities based on a given network structure.
Also note: If there is information coming into the system from outside PLN, then this information should be revised into the instantaneous truth values as well as the primary ones. (This point was raised by Abram Demski in response to an earlier version of this post.) …. And this leads to the interesting question of when, and to what extent, it is useful to revise the primary truth values back into the instantaneous truth values, based on the modifications of the primary truth values due to regular PLN forward and backward inference.
If we do both the Gibbs sampling suggested above and the traditional PLN chaining on the same network, what we have is a probabilistic network that is constantly adapting its structure (and a subset of its truth values) based on chains of inference rules, and constantly updating its truth values based on its structure according to Gibbs type (and vaguely MLN-ish) methods.
Note that the Gibbs sampling forms a consistent model of the joint distribution of all the Atoms in the Atomspace, without needing a trail-like mechanism. Clearly the Gibbs-type approach is much more like what could be realized in a brain-like system (though OpenCog is not really a brain-like system in any strong sense).
Inference trails would still be useful for chaining-based inferences, in the suggested framework. However, if the trail mechanism screws up in some cases and we get truth values that handle dependencies incorrectly — in the medium run, this won’t matter so much, because the Gibbs sampling mechanism will eventually find more correct versions for those truth values, which will be revised into the truth values. Note that incorrect truth values gotten by inadequate use of trails will still affect the results of the sampling, because they will weight some of the links used in the sampling-based inference — but the sampling-based inference will “merge” these incorrect truth values with the truth values of the relations embodying the dependencies they ignore, muting the influence of the incorrect values.
Also: one problem I’ve noted before with MLN and related ideas is that they assume a fully consistent interpretation of all the links in their network. But a complex knowledge network reflecting the world-understanding of an AGI system, is not going to be fully consistent. However, I believe the approach described here would inherit PLN’s robustness with regard to inconsistency. The PLN heuristic inference rules are designed to dampen inconsistencies via locally ignoring them (e.g. if the premises of the PLN deduction rule are wildly inconsistent so that the rule gives a truth value strength outside [0,1], the resultant inference will simply not be revised into the truth value of the conclusion Atom). In the current proposal, this sort of mechanism would be used both in the Gibbs sampling and the chaining control mechanisms.
Revision versus Gibbs Sampling
Now — if anyone is still following me by this point — I want to take the discussion in a slightly different direction. I’m going to use the above ideas to make an argument why inference trails are unnecessary in PLN even without Gibbs sampling.
Reading through Thought Experiment #1 above, one might wonder why bother to maintain two truth values, an instantaneous and a primary one. Why is this better than the traditional PLN approach, where you do the updating directly on the primary truth values, but instead of (as in Gibbs sampling) replacing the old truth value with the new one at each step, just revise the new truth value with the old one?
The answer seems to be: In the long run, if one assumes a fixed set of knowledge in the inference network during the learning process, both approaches amount to the same thing. So in this somewhat artificial “fixed knowledge” setting, it’s really mainly a matter of convergence rates. (Which means it’s a matter of the speed of coming to modestly intelligent conclusions, since in a real-world system in a dynamic environment, there is no hope of an inference network converging to a fully coherent conclusion based on its existing data before new data comes in and disrupts things).
Viewed at a sufficient level of abstraction, the Gibbs sampling approach corresponds to taking a Markov matrix M and taking the limit of the power M^n as n goes to infinity, till (M^n x), where x is the initial condition, converges to a stationary distribution.
Specifically, in the approach outlined above, one can think about a long vector, each entry of which refers to a “truth value state” of the PLN system as a whole. The k’th truth value state corresponds to a proposition of the form “Truth value of Atom 1 lies in interval I_k(1), AND truth value of Atom 2 lies in interval I_k(2), AND … truth value of Atom lies in interval I_k(n).” So this is a very high dimensional vector. Given the specific set of inference rules and truth value formulas in a PLN system, if one iterates PLN using parallel forward chaining (i.e. executing all possible single-step forward inferences at the same time, and revising the results together); then PLN execution corresponds to multiplying by a large Markov matrix M.
On the other hand, the standard PLN approach with only one truth value for each Atom and a fixed weight c in the revision rule, corresponds roughly to taking the limit of the power ( c I + (1-c) M )^n as n goes to infinity. The latter approach will generally take significantly longer to converge to the stationary distribution, because the ratio (second largest eigenvalue) / (largest eigenvalue) will be closer to 1.
Actually it’s a bit subtler than that, because the revision weight c isn’t a constant in PLN. Rather, as the system accumulates more evidence, c gets larger, so that the existing evidence is weighted more and the new evidence is weighted less.
But for each fixed value of c, the iteration would converge to the same stationary distribution as the Gibbs sampling approach (under reasonable assumptions, for a network with fixed knowledge). And we may assume that as the network learns, eventually c will reach some maximum short of 1 (c=.9999 or whatever). Under this assumption, it seems PLN iteration with adaptive revision weight will converge to the stationary distribution — eventually.
So the apparent conclusion of this somewhat sketchy mathematical thinking (if all the details work out!) is that, if one makes the (unrealistic) assumption of a fixed body of knowledge in the system,
Now, it may be that trails are still useful in the short run. On the other hand, there seem other ways to handle the matter. For instance: If one has a sub-network of tightly interlinked Atoms, then one can do a lot of inference on these Atoms, i.e. accelerating the iterative sampling process as regards the relationships between these Atoms. In this way the mutual dependencies among those Atoms will get resolved faster, much as if one were using trails.
Thought-Experiment #2
Finally, I’ll present a less extreme thought-experiment, which I think has a greater likelihood of actually being useful for PLN in OpenCog.
Instead of having two truth values per Atom — one the primary, traditional PLN truth value and the other an instantaneous truth value used for Gibbs sampling — what if one had two truth values, both updated via the standard PLN approach, but with widely differing default revision weights?
The standard default revision weight in PLN now is driven by the confidence factor
c = n/(n+k)
where n is a number of observations, and k is the “personality parameter.” But layered on top of this (in the PLN theory, though not currently in the code), is a “confidence decay factor”, which decays confidence values over time.
One possibility would be to have two different truth values associated with each Atom: one conservative and one adventurous. The two would differ in their personality parameters. The conservative truth value would get updated with a small value of k, meaning that it would tend to weight its past experience highly and its new conclusions not so much. The adventurous truth value would get updated with a large value of k, meaning that it would weight its new conclusions much more than its past experience.
What Thought Experiment #1 teaches us is that: As k goes to infinity, if one follows a simple inference control strategy as outlined there, the adventurous truth value will basically be getting updated according to Gibbs sampling (on second order probability distributions).
We have seen that both the adventurous and conservative truth values will converge to the same stationary distribution in the long run, under unrealistic assumptions of fixed knowledge in the network. But so what? Under realistic conditions they will behave quite differently.
There is much to experiment with here. My point in this post has merely been to suggest some new experiments, and indicate some theoretical connections between PLN, sampling theory, and other probabilistic inference methods like MLN.
OK, that’s a rough summary of my train of thought on these topics at the moment. Feedback from folks with knowledge of PLN, MLNs and sampling would be valued. Am I thinking about this stuff in a sensible way? What do you think?
The current version of this post owes something to a critique of the previous version by Abram Demski.
]]>