# Inference Meta-Learning (part I)

This is the introductory part of a blog series about my work on inference control learning for OpenCog. It introduces the problematic, an initial design solution and some early results. More interesting results will come in subsequent parts.

## Introduction

Inference chainers, automatic theorem provers and the likes are programs that given a theory attempt to build a proof of a given theorem. They are generally inefficient. An proof, or inference is in essence a program, as the Curry-Howard correspondence says. In OpenCog it is very apparent as an inference is an actual atomese program. A function call in this program is a formula associated to an inference rule, the conclusion being the output of the call, the premises being its inputs. See for instance a double deduction inference tree

``````     X->Z    Z->Y
-----f------
A->X     X->Y
------f------
A->B
``````

a bottom-down notation borrowed from the classical notations such as https://en.wikipedia.org/wiki/Rule_of_inference except that premises are aligned horizontally, and the rule formula, here `f`, overlays the line separating the premises and the conclusion.

In OpenCog such tree would be represented as

``````             [12731261225409633207][6] [11133117073607658831][6]
---------------bc-deduction-formula----------------
[17146615216377982335][6] [16015351290941397556][6]
---------------bc-deduction-formula----------------
[13295355732252075959][1]
``````

the cryptic numbers indicate the hash values of the atoms involved. The hash value `[13295355732252075959][1]` at the bottom of the tree is the target’s, the hash values at the leaves `[17146615216377982335][6]`, `[12731261225409633207][6]` and `[11133117073607658831][6]` are the premises, and `[16015351290941397556][6]` is an intermediary target. `bc-deduction-formula` is the name of the formula associated to the deduction rule.

This inference tree would correspond to the atomese program

``````(BindLink
;; Pattern to fetch premises from the atomspace
(VariableNode "\$B-6229393a") ; [6185394777777469381][6]
(ConceptNode "D") ; [246112806454457922][1]
) ; [11133117073607658831][6]
(VariableNode "\$B-6266d6f2") ; [4097372290580364298][6]
(VariableNode "\$B-6229393a") ; [6185394777777469381][6]
) ; [12731261225409633207][6]
(VariableNode "\$X") ; [6809909406030619949][1]
(VariableNode "\$B-6266d6f2") ; [4097372290580364298][6]
) ; [17146615216377982335][6]
) ; [12553878300592245761][6]
;; Formula calls computing the inference tree
(GroundedSchemaNode "scm: bc-deduction-formula") ; [5481509939359570705][1]
(VariableNode "\$X") ; [6809909406030619949][1]
(ConceptNode "D") ; [246112806454457922][1]
) ; [13295355732252075959][1]
(VariableNode "\$X") ; [6809909406030619949][1]
(VariableNode "\$B-6266d6f2") ; [4097372290580364298][6]
) ; [17146615216377982335][6]
(GroundedSchemaNode "scm: bc-deduction-formula") ; [5481509939359570705][1]
(VariableNode "\$B-6266d6f2") ; [4097372290580364298][6]
(ConceptNode "D") ; [246112806454457922][1]
) ; [16015351290941397556][6]
(VariableNode "\$B-6266d6f2") ; [4097372290580364298][6]
(VariableNode "\$B-6229393a") ; [6185394777777469381][6]
) ; [12731261225409633207][6]
(VariableNode "\$B-6229393a") ; [6185394777777469381][6]
(ConceptNode "D") ; [246112806454457922][1]
) ; [11133117073607658831][6]
) ; [10363704109909197645][6]
) ; [18135989265351680839][6]
) ; [14126565831644143184][6]
) ; [12675478950231850570][6]
) ; [15856494860655100711][6]
``````

The specificity here is that the pattern matcher (the query corresponds to the outer BindLink) is used to fetch the relevant axioms from the atomspace, the most afferent inputs of that program.

The question we attempt to address here is: how to efficiently build inferences, specifically back-inferences.

A back-inference, an inference built backward, is done by providing a target, or theorem, and grow an inference tree going from the target to the axioms, so that once run, the inference will evaluate the truth value of that target. The problem is that such growth, if not carefully controlled, can be very inefficient.

## Design

The idea is to learn from past inferences. This is not a new idea, but is still fairely under-explored. I give some references at the end.

Let me sketch how we intend to do that in OpenCog:

1. Run the backward chainer in a greedy manner over a collection of problems
2. Store the traces of the backward chainer in a history atomspace
3. Mine the history atomspace to discover regularities
4. Turn these regularities into control rules for the backward chainer
5. Repeat 1, passing these control rules to the backward chainer on the same or related collection of problems, see how it improves.

The backward chainer is a rather a elementary algorithm. Given a target, pick a rule so that its conclusion unifies with the target (that is such rule can possibly produce such target), create an initial inference tree. The leaves of that tree are the premises, the root is the target. Treat the premises as new targets and re-iterate the process. Grow the tree (or rather the trees, as we keep the former ones around) till you get something that, if evaluated, produces the target. For that of course the premises need to be present, as axioms, in the atomspace.

The problem though comes from the combinatorial explosion of having to choose the inference tree, the premise to expand and the rule. The complexity of the algorithm grows exponentially with respect to the size of the inference tree.

So here’s the crux of the idea of how to make it practical:

Defer the hard decisions to a cognitive process.

To repeat myself, the hard decisions in this program, the backward chainer, are

1. pick an inference tree
2. given an inference tree, pick a premise
3. given an inference tree and a premise, pick a rule to expand

These are very hard decisions that only an intelligent entity can make, ultimately an AGI. We don’t have an AGI, but we are building one, thus if we can plug our proto-AGI to solve these sub-problems, it becomes a recursive divide-and-conquer type of construct. Turtles all the way down if you like, that hopefully ends. The idea of SampleLink introduced by Ben [1] was intended to make this recursive construct explicit. Here we do not use SampleLink, it hasn’t been implemented yet, but that doesn’t stop us from using OpenCog to learn control rules and utilize them in the backward chainer. What it means is that in the backward chainer code, when such a hard decision occurs, the code that would otherwise look like a hard wired heuristic now looks like this

1. Try to find control rules in the atomspace related to that decision
2. If you find some, combine them to make a decision
3. Otherwise, assume an uninformative control rule, or possibly a default one encapsulating some simple heuristic.

That is what is meant by a cognitive process, a process that uses the available knowledge to make decisions.

More specifically these control rules are Cognitive Schematics [2], pieces of knowledge that tells our cognitive process how actions relate to goals (or subgoals) conditioned by contexts

``````Context & Action -> Goal
``````

They are usually represented by an Implication (or PredictiveImplication if time matters) in OpenCog. That is a conditional probability, or meta-probability, as that is what a TV represents.

Here these cognitive schematics will be about rules, premises, proofs, etc. For instance a cognitive schematic about the third hard decision

``````3. given an inference tree and a premise, pick a rule to expand
``````

could be formulated as followed

``````ImplicationScope <TV>
VariableList A L R B T
And
Preproof A T
<some pattern>
Expand (List A L R) B
Preproof B T
``````

meaning that if inference tree A is a preproof of theorem T (a preproof in an inference tree that may not proove T yet, but that may eventually proove T if adequately expanded), conditioned by some pattern involving A, L, R, and some action, expanding inference tree A into inference tree B, via premise L, using rule R, then the produced inference tree B will also be a preproof of T, with truth value TV. The cognitive schematic parts are

``````Context = (Preproof A T) And <some pattern>
Action = Expand (List A L R) B
Goal = Preproof B T
``````

Once we have such cognitive schematics

C1 & A -> G

Cn & A -> G

we need to combine then. We could consider a rule formed with the conjunction or disjunction of all contexts, or any other way of aggregating. The dilemma here is that

1. The smaller (more restrictive) the context, the lower the confidence of the rule.
2. The larger (more abstract) the context, the lower the discriminative power of the rule.

So to rephrase, if a context is too small its associated rule might overfit, but if the context is too large, its rule will not be informative enough, its conditional probability will tend to the marginal probability of the goal.

To address this dilemma we have choosen to rely on Universal Operator Induction [3], albeit some modification of it to account for the fact that a control rule is only a partial operator (see [5] for more details). Once this aggregation is done, we can assign a TV for each action, i.e. inference rule, and sample our next inference rule accordingly (here we have choosen Thomson Sampling due to its asymptotical optimality property [4]).

At this point the reader might ask, what about managing the complexity of this decision process? The short answer is ECAN [6]. The less short answer is turtles all the way down… SampleLink, etc. But suffice to say that ECAN has been designed to be a cognitive process as well, utilizing knowledge via a dedicated type of atoms called HebbianLinks.

## Experiments

### Settings

The problem space of our first experiments is very simple. Given knowledge about the alphabet, infer that 2 letters are alphabetically ordered.

There are two collections of axioms

#### 1. Enumeration of all consecutively ordered letters

a⊂b

y⊂z

where X⊂Y is a notation for the atomese program

(Inheritance (stv 1 1) X Y)

Inheritance is used so that we can directly use the PLN deduction rule to infer order transitivity. Given this collection of axioms, all the backward chainer needs is to chain a series of deduction rules, as many as it requires. For instance inferring a⊂c will only require 2 deductions, while inferring h⊂z will require 17 deductions.

In the end, only the deduction rule is required. Figuring that out is not challenging but serves as a simple test for our algorithms. That is what we have accomplished so far. To create a more interesting case, we introduce a second collection of axioms

#### 2. Express that letter `a` occurs before any other letter

a<b

a<z

where X<Y is a notation for the atomese program

``````(Evaluation (stv 1 1) (Predicate "alphabetical-order") (List X Y))
``````

Alongside an implication

X<Y ⇒ X⊂Y

or in atomese

``````ImplicationScope (stv 1 1)
VariableList
TypedVariable
Variable "\$X"
Type "ConceptNode"
TypedVariable
Variable "\$Y"
Type "ConceptNode"
Evaluation
Predicate "alphabetical-order"
List
Variable "\$X"
Variable "\$Y"
Inheritance
Variable "\$X"
Variable "\$Y"
``````

This second collection of axioms allows us to prove with just one inference step, using the PLN instantiation rule, that

a⊂X for any letter X != a.

however, unlike deduction, using this rule is only fruitful if the first letter is `a`, otherwise it will actually slow down the backward chainer, so it is important to be able to discover this context.

### Learning Inference Control Rules

#### Context-Free Control Rules

The control rule of deduction is very simple as it has no extra context

``````ImplicationScope <TV>
VariableList A L B T
And
Preproof A T
Expand (List A L <deduction-rule>) B
Preproof B T
``````

tells that if A is a preproof and gets expanded into B by a deduction rule, then B has a certain probability, expressed by TV, of being a preproof.

Learning that control rule is easy, we just need to apply PLN direct evaluation rule to calculate the TV based on the available evidence, the traces gathered while solving the problem collection. Indeed, while the backward chainer is running, it stores in a history atomspace every hard decision that has been made. In particular all inference tree expensions, and of course which expansion led to a proof, which allows us to build a corpus of expansions and preproofs. The PLN direct evaluation rule will merely count the positive and negative instances and come up with a conditional probability and a confidence.

#### Context-Sensitive Control Rules

Learning context-sensitive control rules is harder. In fact it may be arbitrary hard, but the initial plan is to experiment with frequent subgraph mining [8], using OpenCog's pattern miner [7].

We haven't reached that part yet, but it is expected that such rule will look something like

``````ImplicationScope <TV>
VariableList A L B T
And
Preproof A T
Expand (List A L <conditional-instantiation-rule>) B
<pattern-discovered-by-the-pattern-miner>
Preproof B T
``````

the pattern in question

``````<pattern-discovered-by-the-pattern-miner>
``````

will have to express that the premise L, looks like

``````    Inheritance
ConceptNode "a"
Variable "\$X"
``````

or more generally that the expansion looks like

``````Execution
Schema "expand"
List
Variable "\$A"
Inheritance
ConceptNode "a"
Variable "\$X"
GroundedSchemaNode "scm: conditional-full-instantiation-scope-formula"
Variable "\$B"
``````

Although this part will rely on the pattern miner, in the end the calculations of these rules will be performed by PLN, so in a way PLN will be handling the meta-learning part as well. Will come back to that.

### Results so far

Let me detail the experiment a bit further. The problem set is composed one 100 targets, randomly generated so that 2 ordered letters are queried, such as

w⊂y
q⊂u
g⊂v
y⊂z

We run 2 iterations, the first one with an empty control atomspace, then ask OpenCog to discover control rules, populate the control atomspace with them and rerun, and see how many more we have solved.

Just by learning context-free control rules, saying basically that deduction is often useful, conditional instantiation is sometimes useful, and all other rules are useless, we can go from solving 34 to 52.

Below are examples of control rules that we have learned.

``````;; Modus ponens is useless
(VariableList
(VariableNode "\$T")
(VariableNode "\$A")
)
(VariableNode "\$L")
(VariableNode "\$B")
)
)
(PredicateNode "URE:BC:preproof-of")
(VariableNode "\$A")
(VariableNode "\$T")
)
)
(SchemaNode "URE:BC:expand-and-BIT")
(VariableNode "\$A")
(VariableNode "\$L")
(DefinedSchemaNode "modus-ponens-implication-rule")
)
)
(VariableNode "\$B")
)
)
(PredicateNode "URE:BC:preproof-of")
(VariableNode "\$B")
(VariableNode "\$T")
)
)
)

;; Deduction is often useful
(VariableList
(VariableNode "\$T")
(VariableNode "\$A")
)
(VariableNode "\$L")
(VariableNode "\$B")
)
)
(PredicateNode "URE:BC:preproof-of")
(VariableNode "\$A")
(VariableNode "\$T")
)
)
(SchemaNode "URE:BC:expand-and-BIT")
(VariableNode "\$A")
(VariableNode "\$L")
(DefinedSchemaNode "deduction-inheritance-rule")
)
)
(VariableNode "\$B")
)
)
(PredicateNode "URE:BC:preproof-of")
(VariableNode "\$B")
(VariableNode "\$T")
)
)
)

;; Conditional instantiation is sometimes useful
(VariableList
(VariableNode "\$T")
(VariableNode "\$A")
)
(VariableNode "\$L")
(VariableNode "\$B")
)
)
(PredicateNode "URE:BC:preproof-of")
(VariableNode "\$A")
(VariableNode "\$T")
)
)
(SchemaNode "URE:BC:expand-and-BIT")
(VariableNode "\$A")
(VariableNode "\$L")
(DefinedSchemaNode "conditional-full-instantiation-implication-scope-meta-rule")
)
)
(VariableNode "\$B")
)
)
(PredicateNode "URE:BC:preproof-of")
(VariableNode "\$B")
(VariableNode "\$T")
)
)
)
``````

These rules are rather simple, but, as reported, can already speed up the backward chainer.

## Related Work and Conclusion

This relates to the work of Ray Solomonoff [9], Juergen Schimdhuber with OOPS [10], Eray Ozkural with HAM [11], Irvin Hwang et al with BPM [12], Josef Urban with MaLARea [13], Alexander A. Alemi et al with DeepMath [14] to cite only a few. What I believe sets it apart though, is that the system used for solving the problem is the same one used for solving the meta-problem. Which leads to an interesting question, that we may well be able to put to the test in the future. Can skills acquired to solve a problem be transfered to the meta-level?

Let me expand, if you ask an AI to solve a collections of down to earth problems, it will accumulate a lot of knowledge, say rules. Some will be very concrete, related to specificities of the problems, such as pushing the gas pedal while driving a car, and will not be transferable to the meta-level, because pushing a gas pedal is unrelated to discovering control rules to speed up a program. They will basically remain mute when asked to serve the cognitive process in charge of solving the meta-problem. But some will be more abstract, abstract enough that they will be recognized by the meta-solver as potentially useful. If these abstract rules can indeed help and be transfered to the upper levels, then it opens the possiblity for true intelligence bootstrapping. If it can, then it means we can improve not just learning, but also meta-learning, meta-meta-learning, and so on to infinity, at once. But realistically, even if it doesn't, or does to some limited extend, possibly evaporating as the meta-levels go higher, meta-learning may still result in considerable performance gains. In any instance, it is our only magic bullet, isn't it?

## Acknowledgement

Thanks to Linas for his feedback.

## References

[1] Ben Goertzel. Probabilistic Growth and Mining of Combinations: A
Unifying Meta-Algorithm for Practical General Intelligence

[2] Ben Goertzel. Cognitive Synergy: a Universal Principle For
Feasible General Intelligence?

Click to access 5646bc1d643585933549b5321a9da5ee5f55.pdf

[3] Ray Solomonoff. Three Kinds of Probabilistic Induction: Universal
Distributions and Convergence Theorems

Click to access chris1.pdf

[4] Jan Leike et al. Thompson Sampling is Asymptotically Optimal in
General Environments

Click to access 20.pdf

[5] Nil Geisweiller. Inference Control Learning Experiment README.md
https://github.com/opencog/opencog/tree/master/examples/pln/inference-control-learning

[6] Matthew Ikle’ et al. Economic Attention Networks: Associative
Memory and Resource Allocation for General Intelligence

Click to access paper_63.pdf

[7] Ben Goertzel et al. Integrating Deep Learning Based Perception
with Probabilistic Logic via Frequent Pattern Mining

Click to access DeSTIN_PLN_v3.pdf

[8] Yun Chi et al. Mining Closed and Maximal Frequent Subtrees from
Databases of Labeled Rooted Trees

Click to access 040020.pdf

[9] Ray J. Solomonoff. Progress in incremental machine learning. In
NIPS Workshop on Universal Learning Algorithms and Optimal Search,

Click to access nips02.pdf

[10] Schmidhuber J. Optimal ordered problem solver. Machine Learning
54 (2004) 211–256 https://arxiv.org/pdf/cs/0207097.pdf

[11] Eray Ozkural. Towards Heuristic Algorithmic Memory.

[12] Irvin Hwang et al. Inducing Probabilistic Programs by Bayesian
Program Merging

Click to access 1110.5667.pdf

[13] Josef Urban. MaLARea: a Metasystem for Automated Reasoning in
Large Theories

Click to access 05_Urban.pdf

[14] Alexander A. Alemi et al. DeepMath – Deep Sequence Models for
Premise Selection

Click to access 1606.04442v1.pdf

[15] Ben Goertzel et al. Metalearning for Feature Selection.
https://arxiv.org/abs/1703.06990

This entry was posted in Development, Theory. Bookmark the permalink.

### 3 Responses to Inference Meta-Learning (part I)

1. You say “the complexity of the algorithm grows exponentially” – that’s not true, its much worse, it grows combinatorially. If only it was so slow as to be exponential! 🙂

• nilg says:

What makes you say that it grows combinatorially?

I believe my statement is correct. It grows exponentially w.r.t. the proof length, that is what Marcus Hutter’s famous paper The Fastest and Shortest Algorithm for All Well-Defined Problems says. He means the length of the binary representation of the proof, but I think it applies here too. There is a finite number of rules and axioms, ultimately a inference tree of size S can be turned into a binary string of length O(S).

Maybe you mean w.r.t. to something else. For instance the complexity w.r.t. the length of the theorem is unbounded.

• Oh, well, OK, if you keep the number of axioms, rules, and the number of premises constant, then, yes, exponential growth provides an absolute upper bound. However, I thought that the point of inference learning was to datamine for inference chains that can be converted into new rules. A side-effect of storing things in the atomspace is that the number of premises, viz potential starting points for new inferences, also increases.

So, if you look at the “big picture”, the upper bound is exponential. But if you look at it locally, the introduction of new rules and premises makes it combinatorial.

This site uses Akismet to reduce spam. Learn how your comment data is processed.