The Relationship Between PLN Inference and Gibbs Sampling (Some Thought-Experiments)

This post describes some new thought-experiments regarding PLN, which have not yet been tested nor worked out mathematically in detail… Reader beware — there could be some mistakes here! But I think the ideas are interesting enough to be worth sharing….

These ideas are part of the same train of thought as the New PLN Design, currently being implemented bit-by-bit (and with interesting variations and deviations from the rough spec I just linked to) by Jade O’Neill and Ramin Barati. But this blog post contains new ideas not contained on that page.

Actually, I am unsure if I will end up recommending the ideas outlined here for implementation or not.   But even if not, I think they are interesting for the light they shed on what is going on with PLN conceptually and mathematically.

For one thing, on the theoretical side, I will outline here an argument why inference trails are ultimately unnecessary in PLN.   (They are needed in Pei Wang’s NARS system, from which PLN originally borrowed them; but this is because NARS is not probabilistic, so that the sorts of Gibbs sampling based arguments I outline here can’t be applied to NARS.)

Rough Summary / Prelude

Basically: In this post I will describe how to reformulate PLN inference as (very broadly speaking) to make use of Gibbs Sampling.   As Gibbs Sampling is used in the standard approach to Markov Logic Networks, this also serves (among other more practical purposes) to make clearer the relationship between PLN and MLN.

Broadly speaking, the idea here is to have two different, interlocking levels of PLN inference, with different truth values and different dynamics associated with them

  • a Gibbs sampling based layer, corresponding very roughly to shallow, massively parallel, “unconscious” inference (more like inference based “activation spreading”, to use a neural net metaphor)
  • a forward/backward chaining based layer, corresponding very roughly to “conscious”, deliberative inference

It seems possible that doing this might speed the convergence of a PLN network toward maximally intelligent conclusions based on the knowledge implicit in it.

Consideration of this possibility leads to an understanding of the relation between PLN dynamics and Gibbs sampling, which leads to an argument (at this stage, a sketch of a proof rather than a proof) that inference trails are not really needed in PLN.

Two preliminary notes before getting started:

  • The ideas given here are related, though far from identical, to the work by myself and Cassio Pennachin, reported in Section 3.1 of the paper “PLN and the Brain” from the proceedings of AGI-08:  ….
  • These ideas will make the most sense to the reader who knows the basic ideas of Gibbs sampling, and will make even more sense to readers who know about Markov Logic Networks.  Advanced knowledge of all the details and variations of these topics is not necessary, though.

Without further ado, I will now present two thought-experiments in PLN design: one fairly extreme, the other less so.

Thought-Experiment #1: PLN Inference via Gibbs Sampling on Distributional Truth Values

In this section I’ll describe a hypothetical way of doing PLN inference via Gibbs sampling.

Suppose that, instead of a single truth value, we let each PLN Atom have two truth values:

  • the current truth value (which we may call the “primary truth value”)
  • a new entity called the “instantaneous truth value,” which consists of: a series of K values called the “sample distribution”

The sample distribution consists of a series of values that define the shape of a distribution.    For example, the template sample distribution might comprise K=5 values corresponding to the intervals [0, .2] , [.2, .4], [.4,.6], [.6,.8], [.8,1].  The values would be viewed as a step value approximation to an underlying first-order probability distribution.

Next, the instantaneous truth values would be updated via Gibbs sampling. What I mean by this is, a process by which: the Atoms in the Atomspace are looped through, and when each Atom X is visited, its sampled strengths are replaced with the result of the following Gibbs-type Update Rule:

  1. Find all inference rules R that, in a single step from some set of premise Atoms existing in the Atomspace currently, would result in an estimate for the truth value of X
  2. Execute all the (rule, premise-set) pairs found in Step 1.   That is,
    1. for each pair, repeat the following process some number N of times: choose a specific value from the distribution comprising the instantaneous truth value for each premise, and draw a conclusion from these specific values.  This produces a truth value distribution for the conclusion.
    2. merge these distributions via revision (weighted averaging), obtaining an overall truth value distribution for the conclusion
  3. Replace the existing instantaneous truth value of X with (a discretized version of) the result of Step 2

The instantaneous truth value would then impact the primary truth value as follows

Periodically (every N cycles), the primary truth value of A is revised with the instantaneous truth value of A

(i.e. the primary truth value is replaced with a weighted average of itself & the instantaneous truth value)

Note that one could vary on this process in multiple ways — e.g. via making the instantaneous truth value an imprecise or indefinite probability, or a second order probability distribution.   The above procedure is given as it is, more out of a desire for relative simplicity of presentation, than because it necessarily seems the best approach.

If nothing else besides this updating happened with the primary truth values of logical Atoms (and if the various logical relations in the Atomspace all possessed a consistent probabilistic interpretation in terms of some grounding) — then according to the theory of Gibbs sampling, each Atom would get a primary strength approximating its correct strength according to the joint distribution implicit in all the logical Atoms in the Atomspace.

(The above description, involved as it is, still finesses a bit of mathematical fancy footwork.   It’s important to remember that, in spite of the Gibbs sampling, the PLN heuristic inference rules (which are derived using probability theory, but also various other heuristics) are being used to define the relationships between the variables (i.e. the truth value strengths of Atoms) in the network.

So the Gibbs sampling must be viewed as taking place, not on the variables (the Atom strengths) themselves, but on propositions of the form “the strength of Atom A lies in interval [x,y]“.   One can thus view the sampling as happening on a second-order probability distribution defined over the main probability distribution of strengths.

So the joint distribution on the truth value strength distributions in the PLN network, has to be calculated consistently with the results of the PLN probabilistic/heuristic inference rules.   If the PLN inference rules deviated far from probability theory, then the Gibbs sampling would result in a network that didn’t make sense as a probabilistic model of the world to which the variables in the network refer, but did make sense as a model of the relationship between the variables according to the PLN  inference rules.

This is pretty different from a MLN, because in an MLN the Gibbs sampling just has to find a distribution consistent with certain propositional logic relations, not consistent with certain heuristic uncertain truth value estimation functions.

Anyway: this sort of subtlety is the reason that the idea presented here is not “obvious” and hasn’t emerged in PLN theory before.

So then, if this were the only kind of inference dynamic happening in PLN, we could view PLN as something vaguely analogous to a second-order Markov Logic Network incorporating a wider variety of logical constructs (more general quantifier logic, intensional inference, etc.) via heuristic formulas.

However, the thought-experiment I am outlining in this section is not to have this kind of sampling be the only thing happening in PLN.   My suggestion is that in any new PLN, just like in the current and prior PLN, primary strengths may also be modified via forward and backward chaining inference. These inference methods do something different than the Gibbs-type updating mentioned above, because they add new logical links (and in some cases nodes) to the network.

This is vaguely comparable to how, in some cases, Gibbs sampling or message-passing in Markov Logic Networks have been coupled with Inductive Logic Programming.  ILP, vaguely similarly to PLN forward and backward inference, adds new logical links to a network. I.e., to use MLN / Bayes Nets terminology, both ILP and PLN chaining are concerned with structure building, whereas Gibbs sampling, message-passing and other comparable methods of probabilistic inference are concerned with calculating probabilities based on a given network structure.

Also note: If there is information coming into the system from outside PLN, then this information should be revised into the instantaneous truth values as well as the primary ones.  (This point was raised by Abram Demski in response to an earlier version of this post.) ….  And this leads to the interesting question of when, and to what extent, it is useful to revise the primary truth values back into the instantaneous truth values, based on the modifications of the primary truth values due to regular PLN forward and backward inference.

If we do both the Gibbs sampling suggested above and the traditional PLN chaining on the same network, what we have is a probabilistic network that is constantly adapting its structure (and a subset of its truth values) based on chains of inference rules, and constantly updating its truth values based on its structure according to Gibbs type (and vaguely MLN-ish) methods.

Note that the Gibbs sampling forms a consistent model of the joint distribution of all the Atoms in the Atomspace, without needing a trail-like mechanism. Clearly the Gibbs-type approach is much more like what could be realized in a brain-like system (though OpenCog is not really a brain-like system in any strong sense).

Inference trails would still be useful for chaining-based inferences, in the suggested framework. However, if the trail mechanism screws up in some cases and we get truth values that handle dependencies incorrectly — in the medium run, this won’t matter so much, because the Gibbs sampling mechanism will eventually find more correct versions for those truth values, which will be revised into the truth values. Note that incorrect truth values gotten by inadequate use of trails will still affect the results of the sampling, because they will weight some of the links used in the sampling-based inference — but the sampling-based inference will “merge” these incorrect truth values with the truth values of the relations embodying the dependencies they ignore, muting the influence of the incorrect values.

Also: one problem I’ve noted before with MLN and related ideas is that they assume a fully consistent interpretation of all the links in their network.    But a complex knowledge network reflecting the world-understanding of an AGI system, is not going to be fully consistent.  However, I believe the approach described here would inherit PLN’s robustness with regard to inconsistency.   The PLN heuristic inference rules are designed to dampen inconsistencies via locally ignoring them (e.g. if the premises of the PLN deduction rule are wildly inconsistent so that the rule gives a truth value strength outside [0,1], the resultant inference will simply not be revised into the truth value of the conclusion Atom).   In the current proposal, this sort of mechanism would be used both in the Gibbs sampling and the chaining control mechanisms.

Revision versus Gibbs Sampling

Now — if anyone is still following me by this point — I want to take the discussion in a slightly different direction.   I’m going to use the above ideas to make an argument why inference trails are unnecessary in PLN even without Gibbs sampling.

Reading through Thought Experiment #1 above, one might wonder why bother to maintain two truth values, an instantaneous and a primary one.  Why is this better than the traditional PLN approach, where you do the updating directly on the primary truth values, but instead of (as in Gibbs sampling) replacing the old truth value with the new one at each step, just revise the new truth value with the old one?

The answer seems to be: In the long run, if one assumes a fixed set of knowledge in the inference network during the learning process, both approaches amount to the same thing.  So in this somewhat artificial “fixed knowledge” setting, it’s really mainly a matter of convergence rates.   (Which means it’s a matter of the speed of coming to modestly intelligent conclusions, since in a real-world system in a dynamic environment, there is no hope of an inference network converging to a fully coherent conclusion based on its existing data before new data comes in and disrupts things).

Viewed at a sufficient level of abstraction, the Gibbs sampling approach corresponds to taking a Markov matrix M and taking the limit of the power M^n as n goes to infinity, till (M^n x), where x is the initial condition, converges to a stationary distribution.

Specifically, in the approach outlined above, one can think about a long vector, each entry of which refers to a “truth value state” of the PLN system as a whole.   The k’th truth value state corresponds to a proposition of the form “Truth value of Atom 1 lies in interval I_k(1), AND truth value of Atom 2 lies in interval I_k(2), AND … truth value of Atom lies in interval I_k(n).”   So this is a very high dimensional vector.  Given the specific set of inference rules and truth value formulas in a PLN system, if one iterates PLN using parallel forward chaining (i.e. executing all possible single-step forward inferences at the same time, and revising the results together); then PLN execution corresponds to multiplying by a large Markov matrix M.

On the other hand, the standard PLN approach with only one truth value for each Atom and a fixed weight c in the revision rule, corresponds roughly to taking the limit of the power ( c I + (1-c) M )^n as n goes to infinity.   The latter approach will generally take significantly longer to converge to the stationary distribution, because the ratio (second largest eigenvalue) / (largest eigenvalue) will be closer to 1.

Actually it’s a bit subtler than that, because the revision weight c isn’t a constant in PLN. Rather, as the system accumulates more evidence, c gets larger, so that the existing evidence is weighted more and the new evidence is weighted less.

But for each fixed value of c, the iteration would converge to the same stationary distribution as the Gibbs sampling approach (under reasonable assumptions, for a network with fixed knowledge).   And we may assume that as the network learns, eventually c will reach some maximum short of 1 (c=.9999 or whatever).   Under this assumption, it seems PLN iteration with adaptive revision weight will converge to the stationary distribution — eventually.

So the apparent conclusion of this somewhat sketchy mathematical thinking (if all the details work out!) is that, if one makes the (unrealistic) assumption of a fixed body of knowledge in the system,

  • The current PLN revision-based approach will get to the same place as the hypothetical Gibbs Sampling based approach outlined in Thought-Experiment #1 above
  • In this setting, we don’t need trails.  Dependencies will take care of themselves eventually as the network iterates.  (i.e., since Gibbs sampling doesn’t need trails, and the standard PLN approach is equivalent to Gibbs sampling on second-order distributions in the long run, the standard PLN approach also doesn’t need trails)

Now, it may be that trails are still useful in the short run.   On the other hand, there seem other ways to handle the matter.  For instance: If one has a sub-network of tightly interlinked Atoms, then one can do a lot of inference on these Atoms, i.e. accelerating the iterative sampling process as regards the relationships between these Atoms.  In this way the mutual dependencies among those Atoms will get resolved faster, much as if one were using trails.

Thought-Experiment #2

Finally, I’ll present a less extreme thought-experiment, which I think has a greater likelihood of actually being useful for PLN in OpenCog.

Instead of having two truth values per Atom — one the primary, traditional PLN truth value and the other an instantaneous truth value used for Gibbs sampling — what if one had two truth values, both updated via the standard PLN approach, but with widely differing default revision weights?

The standard default revision weight in PLN now is driven by the confidence factor

c = n/(n+k)

where n is a number of observations, and k is the “personality parameter.”  But layered on top of this (in the PLN theory, though not currently in the code), is a “confidence decay factor”, which decays confidence values over time.

One possibility would be to have two different truth values associated with each Atom: one conservative and one adventurous.   The two would differ in their personality parameters.  The conservative truth value would get updated with a small value of k, meaning that it would tend to weight its past experience highly and its new conclusions not so much.   The adventurous truth value would get updated with a large value of k, meaning that it would weight its new conclusions much more than its past experience.

What Thought Experiment #1 teaches us is that: As k goes to infinity, if one follows a simple inference control strategy as outlined there, the adventurous truth value will basically be getting updated according to Gibbs sampling (on second order probability distributions).

We have seen that both the adventurous and conservative truth values will converge to the same stationary distribution in the long run, under unrealistic assumptions of fixed knowledge in the network.  But so what?  Under realistic conditions they will behave quite differently.

There is much to experiment with here.   My point in this post has merely been to suggest some new experiments, and indicate some theoretical connections between PLN, sampling theory, and other probabilistic inference methods like MLN.

OK, that’s a rough summary of my train of thought on these topics at the moment. Feedback from folks with knowledge of PLN, MLNs and sampling would be valued. Am I thinking about this stuff in a sensible way? What do you think?

The current version of this post owes something to a critique of the previous version by Abram Demski.

This entry was posted in Theory, Uncategorized. Bookmark the permalink.