Reframing OpenCog Action Selection: Contextual Bandit Problems and Reinforcement Learning

I thought a bit today about how OpenCog’s action selector (based on the Psi model from Dietrich Dorner and Joscha Bach) relates to approaches to action selection and behavior learning one sees in the reinforcement learning literature.

After some musing, I came to the conclusion that this may be another area where it could make sense to insert a deep neural network inside OpenCog, for carrying out particular functions.

I note that Itamar Arel and others have proposed neural net based AGI architectures in which deep neural nets for perception, action and reinforcement are coupled together.   In OpenCog, one could use deep neural nets for perception, action and reinforcement; but, these networks would all be interfaced via the Atomspace, and in this way would be interfaced with symbolic algorithms such as probabilistic logical inference, hypergraph pattern mining and concept blending, as well as with each other.

One interesting meta-point regarding these musings is that they don’t imply any huge design changes to the OpenCog “OpenPsi” action selector.   Rather, one could implement deep neural net policies for action selection, learned via reinforcement learning algorithms, as a new strategy within the current action selection framework.   This speaks well of the flexible conceptual architecture of OpenPsi.

Action Selection as a Contextual Bandit Problem

For links to fill you in on OpenCog’s current action selection paradigm and code, start here.

The observation I want to make in this blog post is basically that: The problem of action selection in OpenCog (as described at the above link and the others it points to) is an example of the “contextual bandit problem” (CBP).

In the case where more than one action can be chosen concurrently, we have a variation of the contextual bandit problem that has been called “slates” (see the end of this presentation).

So basically the problem is: We have a current context, and we have a goal (or several), and we have bunch of alternative actions.  We want in the long run to choose actions that will maximize goal achievement.  But we don’t know the expected payoff of each possible action.  So we need to make a probabilistic, context-dependent choice of which action to do; and we need to balance exploration and exploitation appropriately, to maximize long-term gain.   (Goals themselves are part of the OpenCog self-organizing system and may get modified as learning progresses and as a result of which actions are chosen, but we won’t deal with that part of the feedback tangle here.)

Psi and MicroPsi and OpenPsi formulate this problem in a certain way.   Contextual bandit problems represent an alternate formulation of basically the same problem.

Some simple algorithms for contextual bandit problems are here and a more interesting neural net based approach is here.   A deep neural net approach for a closely related problem is here.

CBP and OpenPsi

These approaches and ideas can be incorporated into OpenCog’s Psi based action selector, though this would involve using Psi a little differently than we are now.

A “policy” in the CBP context is a function mapping the current context into a set of weightings on implications of the form (Procedure ⇒ Goal).

Most of the time in the reinforcement literature a single goal is considered, whereas in Psi/OpenCog one considers multiple goals; but that’s not an obstacle to useing RL ideas in OpenCog.   One can use RL to figure out procedures likely to lead to fulfillment of an individual goal; or one can apply RL to synthetic goals defined as weighted averages of system goals.

What we have in OpenPsi are implications of the form (Context & Procedure ⇒ Goal) — obviously just a different way of doing what RL is doing…

That is:

  • In RL one lists Contexts, and for each Context one has a set of (Procedure ⇒ Goal) pairs
  • In Psi one lists (Context & Procedure ⇒ Goal) triples (“Psi implications”)

and these two options are obviously logically equivalent.

A policy, in the above sense, could be used to generate a bunch of Psi implications with appropriate weights.   In general a policy may be considered as a concise expression of a large set of Psi implications.

In CBP learning what we have is, often, a set of competing policies (e.g. competing linear functions, or competing neural networks), each of which provides its own mapping from contexts into (Procedure⇒ Goal) implications.   So, if doing action selection in this approach: To generate an action, one would first choose a policy, and then use that policy to generate weighted (Context & Procedure ⇒ Goal) implications [where the Context was very concrete, being simply the current situation], and then use the weights on these implications to choose an action.

In OpenCog verbiage, each policy could in fact be considered a context, so we could have

ContextLink
    ConceptNode “policy_5”
    ImplicationLink
        AndLink
            Context
            procedure
        Goal

and one would then do action selection using the weighting for the current policy.

If, for instance, a policy were a neural network, it could be wrapped up in a GroundedSchemaNode.    A neural net learning algorithm could then be used to manage an ensemble of policies (corresponding behind the scenes to neural networks), and experiment with these policies for action selection.

This does not contradict the use of PLN to learn Psi implications.   PLN would most naturally be used to learn Psi implications with abstract Contexts; whereas in the RL approach, the abstraction goes into the policy, and the policy generates Psi implications that have very specific Contexts.   Both approaches are valid.

In general, the policy-learning-based approach may often be better when the Context consists of a large number of different factors, with fuzzy degrees of relevance.  In this case learning a neural net mapping these contextual factors into weightings across Psi implications may be effective.   On the other hand, when the context consists of a complex, abstract combination of a smaller number of factors, a logical-inference approach to synthesizing Psi implications may be superior.

It may also be useful, sometimes, to learn neural nets for CBP policies, and then abstract patterns from these neural nets using pattern  mining; these patterns would then turn into Psi implications with abstract Contexts.

(Somewhat Sketchy) Examples

To make these ideas a little more concrete, let’s very briefly/roughly go through some example situations.

First, consider question-answering.   There may be multiple sources within an OpenCog system, capable of providing an answer to a certain question, e.g.:

  • A hard-wired response, which could be coded into the Atomspace by a human or learned via imitation

  • Fuzzy matcher based QA taking into account the parse and interpretation of the sentence

  • Pattern matcher lookup, if the Atomspace has definite knowledge regarding the subject of the query

  • PLN reasoning

The weight to be given to each method’s, in each case, needs to be determined  adaptively based on the question and the context.

A “policy” in this case would map some set of features associated with the question and the context, into a weight vector across the various response sources.

A question is what is the right way to quantify the “context” in a question-answering case.  The most obvious approach is to use word-occurrence or bigram-occurrence vectors.  One can also potentially add in, say, extracted RelEx relations or RelEx2Logic relations.

If one has multiple examples of answers provided by the system, and knows which answers were accepted by the questioner and which were not, then this knowledge can be used to drive learning of policies.   Such a policy would tell the system, given a particular question and the words and semantic relationships therein as well as the conversational context, which answer sources to rely on with what probabilities.

A rather different example would be physical movement.   Suppose one has a collection of movement patterns (e.g. “animations” moving parts of a robot body, each of which may have multiple parameters).   In this case one has a slate problem, meaning that one can choose multiple movement patterns at the same time.   Further, one has to specify the parameters of each animation chosen; these are part of the action.   Here a neural network will be very valuable as a policy representation, as one’s policy needs to take in floating-point variables quantifying the context, and output floating-point variables representing the parameters of the chosen animations.   Real-time reinforcement data will be easily forthcoming, thus driving the underlying neural net learning.

(If movement is controlled by a deep neural network, these “animations” may be executed via clamping them in the higher-level nodes of the network, and then allowing the lower-level nodes to self-organize into compatible states, thus driving action.)

Obviously a lot of work and detailed thinking will be required to put these ideas into practice.  However, I thought it might be useful to write this post just to clarify the connections between parts of the RL literature and the cognitive modeling approach used in OpenCog (drawn from Dorner, Bach, Psi, etc.).   Often it happens that the close relationships between two different AI approaches or subfields are overlooked, due to “surface level” issues such as different habitual terminologies or different historical roots.

Potentially, the direction outlined in this post could enable OpenCog to leverage code and insights created in the deep reinforcement learning community; and to enable deep reinforcement neural networks to be used in more general-purpose ways via embedding them in OpenCog’s neural-symbolic framework.

This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Reframing OpenCog Action Selection: Contextual Bandit Problems and Reinforcement Learning

  1. Adrian Borucki says:

    Could looking for a proper context be also considered a context itself? This way, if OpenCog needs to, for example, gauge that it has to deal with Question Answering context, it could rely on multiple options, like PLN reasoning, RelEx, etc again. That context-finding context would be the default context to work with, you could also say that it would represent a temporary “confusion” state, while the system is figuring out what is going on.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.