COVID-19 Modelling and Random Social Networks

Small random network

Seems like everyone wants to be an epidemiologist these days, so why not OpenCog? After all, diseases spread through networks, propagating from one node to the next. A network is a graph, the AtomSpace is a graph database, and the value flow subsystem provides a convenient infrastructure for propagating data (e.g. communicable diseases) from one node to the next. How hard could it be to model the progression of disease through a social network? The answer? Pretty easy.

The full working demo is here. It highlights the brand-new network generation codebase. The network generator is an AtomSpace module, specialized for generating random networks that are constrained by grammatical rules. This is in the same sense as “grammar” in linguistics, or computer science: a description of syntax. With an appropriate grammar, one can generate linguistic parse trees. But there’s no need to constrain oneself to trees: one can generate graphs with undirected edges and loops.The network generator is generic. That said, please note: this is version 0.1 of the network generator! This really is a brand-new module. It is needed for several projects, but really, its just completely brand-new, and is not a complete, finished final product.

Note also: this is a technology demo! This is not a full-scale medical epidemiology tool. It could be used for scientific research — but, for that, it would need to be extended and attached to other scientific tools, such as data analysis tools, graphing tools, network visualization tools. Such tools are commonplace and abundant; we won’t dwell on them here. The demo here is a demo of network generation, and of running a state transition machine on a network. Its a demo of using Atomese for these sorts of computations, playing to the strengths of Atomese.

The demo consists of four parts. The first part demonstrates how to write a finite state machine in Atomese. The second demonstrates the current network generation API. Third: actually run the danged thing, and fourth: collect up and print some stats.  Before starting, though, a few words about Atomese.

Atomese

Atomese is a graphical programming language. Unlike almost all other programming languages, it was designed for automation, and not for humans. Atomese expressions are graphs; they live as graphs in the AtomSpace graph database. Atomese encodes “Abstract Syntax Trees” (see the linked Wikipedia article.) Such a representation is useful for all of the reasons that the Wikipedia article says they are. But it also means that “coding” in Atomese can sometimes feel unusual. Like: “Why can’t I just code in python? It would be so much easier!” — Well, but that misses the point. And the point is, again: Atomese was designed for automation.

What this really means is that the graphs are meant to be easy for other algorithms to manipulate. It’s designed for ease-of-use by machines, not for humans! The style is more barren, unadorned, without short-cuts. Its a bit verbose at times. That’s OK, machines don’t mind verbosity and tedium.

Consider, for example, a graphical editor – something where you can drag-n-drop, hand-draw interacting bubble diagrams. Something easy to use – something a medical professional could get the hang of in a few hours – something where a disease model could be sketched out with bubbles and lines and flow-charts. The goal of Atomese is that it becomes easy — really easy — to convert such diagrams into executable code. That’s what Atomese is designed to do.

In this demo, the disease progression model is written in Atomese. This is the model that a graphical GUI WYSIWYG whizz-bang system would generate.

Atomese enables other, more abstract approaches. Consider an automatic model explorer: a system that automatically creates a variety of different disease progression models, and then explores how each model works. The model generator might try to fit an existing dataset. Alternately, it could mutate models so as to obtain certain characteristics. This moves beyond just adjusting some parameters in some hand-created, hypothesized model. This is not just some monte-carlo search or machine-learning hill-climbing for parameter tuning. This is a whole-sale creation of previously non-existent code. Atomese allows software to read, understand, mutate, modify and write programs.

State transitions

There are two state transition rules. One governs the transmission of disease between pairs of individuals, and the other controls the disease progression in an infected individual. Consider first the evolution of the disease state in a single individual.

 (And
    (Equal (ValueOf (Variable "$A") seir-state) exposed)
    (GreaterThan
       (ValueOf (Variable "$A") susceptibility)
       (RandomNumber (Number 0) (Number 1))))
 (SetValue (Variable "$A") seir-state infected)

I’m hoping this is is self-explanatory. The (Variable “$A”) is a person, an individual in the network. The seir-state is a location – a lookup key, holding the “SEIR” state: “susceptible”, “exposed”, “infected”, “recovered”. If the individual is exposed, and a random draw exceeds that individual’s susceptibility, they will become infected. All of the state transitions can be written in this form.

The above is written in scheme, but you could also use Python — the AtomSpace has Python bindings. Arguing about Python vs. something else kind-of misses the point: the above Atomese is actually a graph, a tree, in this case, that happens to encode, in its nodes and links, an abstract syntax tree (AST) specifying an executable program. Atomese allows you to work directly with the AST, whether in scheme in python, or just abstractly, at a higher layer. The Wikipedia article on AST’s explains exactly why this is a useful thing to have.

The person-to-person transmission rule looks like this:

(Cond
  (And
     (Equal (ValueOf (Variable "$A") seir-state) susceptible)
     (Equal (ValueOf (Variable "$B") seir-state) infected)
     (Or 
        (And
           (Equal (Variable "$REL") (Concept "friend"))
           (GreaterThan
              (RandomNumber (Number 0) (Number 1)) 
              (Number 0.3)))
        (And
           (Equal (Variable "$REL") (Concept "stranger"))
           (GreaterThan
              (RandomNumber (Number 0) (Number 1)) 
              (Number 0.7)))))
  (SetValue (Variable "$A") seir-state exposed)

Again, this is hopefully obvious. If A is susceptible and B is infected, then maybe A becomes exposed, if A gets close enough/spends enough time with B. A passing stranger is unlikely to lead to exposure; but friends are harder to avoid and thus one is more likely to be exposed to an infected friend.

Network grammar

A single seed

The network grammar in this demo is extremely simple. It specifies two types of edges: “friend” and “stranger”, used for different types of social encounter. The grammar itself just consists of a set of “seeds” or “burrs” – a single vertex (a prototype individual) together with some half-edges on it – potential edges that could form either “friend” or “stranger” relationships.

It’s perhaps the most useful to imagine each burr as a jigsaw-puzzle piece: in this case, with only two types of connectors: “friend” or “stranger”, varying in quantity from a few, to many. The visuals show a seed with some attached half-edges, and the process of connecting them. It’s painfully obvious, isn’t it?

Connecting puzzle pieces

A single seed, expressed in Atomese, looks like this:

(Section
   (Concept "person-2-3")
   (ConnectorSeq 
      (Connector (Concept "friend") (ConnectorDir "*"))
      (Connector (Concept "friend") (ConnectorDir "*"))
      (Connector (Concept "stranger") (ConnectorDir "*"))
      (Connector (Concept "stranger") (ConnectorDir "*"))
      (Connector (Concept "stranger") (ConnectorDir "*"))))

It has two unconnected “friend” connectors, and three unconnected “stranger” connectors. The connectors are unpolarized – they have no directionality, so that the resulting relationships are symmetrical. That is, unlike an actual jigsaw-puzzle piece, the “*”  ConnectorDir doesn’t have a “direction”, it allows an any-to-any connection to be made.  When a connection is formed, the connector labels must match – thus, in this case, two different kinds of edges are formed in the final graph.

A pair of connected seeds

Further below is an example random network, generated by the current system. It’s long, and thin – very long and thin – as controlled by the network generation parameters.  Yes, a network of this shape is not very realistic. Exploring different network shapes is part of the fun! Adjustable parameters include what one might expect: a maximum size, number of graphs to generate, a set of starting points that must be bridged, and so on. Again: this is version 0.1; there is more to come.

Graph generation parameters are also specified in Atomese. For example, the maximum network size:

(State (Member max-network-size params) (Number 2000))

The MemberLink denotes set membership. The StateLink is a link that allows one and only one relation at a time. In this case, the MemberLink can be associated to only one Atom: the NumberNode Atom.

Initializing the network

After generating the network, one now has a number of interconnected individuals. Some initial values and state must be assigned to each.  This is straight-forward, in Atomese:

 (Bind
    (TypedVariable (Variable "$person") (Type "ConceptNode"))
    (Present (Member (Variable "$person") anchor))
    (SetValue (Variable "$person") seir-state susceptible)
    (SetValue (Variable "$person") susceptibility
       (RandomNumber (Number 0.2) (Number 0.8)))
    (SetValue (Variable "$person") infirmity
       (RandomNumber (Number 0.01) (Number 0.55)))
    (SetValue (Variable "$person") recovery
       (RandomNumber (Number 0.6) (Number 0.95))))

The BindLink is a graph-rewriting link. It searches for all graphs of a certain shape, having a variable area in them, and then takes that variable, and generates some new graphs. In this particular case, the variable is obvious: the first stanza is just a type declaration for the variable.  The second stanza, the PresentLink, asserts that the indicated graph must be found in the AtomSpace.  It’s a very simple graph here, just a MemberLink that must be present. The remaining stanzas are created and executed, initializing various keys on the individual with various random values.

A very long, thin social network

Running the simulation

The spread of the disease across the network can be achieved using the same graph-query and graph-rewriting mechanism. Consider the graph query below.

 (Bind
   (VariableList
      (TypedVariable (Variable "$A") (Type "ConceptNode"))
      (TypedVariable (Variable "$B") (Type "ConceptNode")))
   (And
      (Present
         (Evaluation
            (Concept "friend")
            (UnorderedLink (Variable "$A") (Variable "$B"))))
      (Equal (ValueOf (Variable "$A") seir-state) susceptible)
      (Equal (ValueOf (Variable "$B") seir-state) infected))
   (SetValue (Variable "$A") seir-state exposed))

This locates all “friendship” graphs: one must find, present in the AtomSpace, a very simple pairwise relation between A and B, marked with the label “friend”. Habitually, EvaluationLinks are used for this purpose. Having found this relationship, and if B is infected, and A is not yet ill, nor has developed immunity (that is, if A is susceptible), then A becomes exposed. The UnorderedLink is a link-type that allows any ordering of the contents, and during searches, explores all permutations. Its uses here to encode a mutual relationship between two friends.

After running such a disease-transmission step, one then runs a state-change step, using the SEIR rules described up top. And then one repeats … looping, until the population has succumbed, or there has been a break in the transmission that keeps a portion of the population isolated and uninfected.

Printing Results

The graph query language is an all purpose tool. Network stats are straight-forward to query. Of course! Like so:

 (Get
    (TypedVariable (Variable "$indiv") (Type "ConceptNode"))
    (And
       (Present (Member (Variable "$indiv") anchor))
       (Equal (ValueOf (Variable "$indiv") seir-state) infected)))

The above just searches for all individuals, identified by their membership to a common anchor point. and then checks to see if their state is “infected”. This returns a list of all of those individuals; one need merely count to see how many of them there are.

Putting it together

The demo is fully functional. Start a guile shell, load the demo file, and off it will go. Network generation may take anywhere from a small fraction of a section to over ten seconds, that’s the only burp. The source is heavily documented. If you get into trouble, the wiki links provide additional documentation. If Atomese is confusing, then perhaps its a review of the basic AtomSpace examples is in order.

The adventurous may try translating the demo to Python. The AtomSpace has Python bindings. They work. There’s no particular reason that Atomese has to be written in scheme (although, clearly, I find it preferable: its easier, simpler than Python. For me, at least.)

Conclusion

The AtomSpace is an in-RAM (hyper-)graph database with a power associated graph query language, specifically formulated for working with large, complex graphs, for re-writing and editing those graphs, and for managing the flow of state through the graph. Modelling disease networks is one of the things it should be good at … and I hope, with this blog post, I’ve convinced you that this is indeed the case.

Posted in Design, Development, Documentation, Introduction, Theory | Leave a comment

Value Flows

Graphs and graphical databases are now accepted as a very good (the best?) way of capturing the relationship between things. Yet, many of the “things” represented graphically are actually processes.  An example might be the water cycle in nature: rain collects into rivers and lakes, flows into oceans, and evaporates again. Rainwater erodes rocks, is absorbed into soils, is taken up by tree roots. This can be represented as a network graph, with nodes and links representing each of the relationships. To be useful, though, the network must be annotated with time-varying data: how much rain fell, how much was absorbed into the ground, and how much was run-off. These are the “values” that inhabit the network graph.

The OpenCog AtomSpace has long supported value annotation; it now has improved support for dynamically changing values. The goal of this blog post is to describe this a bit. But first, its important to draw a sharper distinction between graphs and values. Graphs are searchable — queryable — as typically, one wishes to find all graph patterns of a given shape, to exclude certain sub-patterns from a search, and so on. Performing such searches has a design impact on how data is represented in RAM: indexes must contain lists of nearest-neighbors, so that the graph can be rapidly walked. Parallel searches suggests that lock-free operations, via  immutable structures, are best. This in turn implies that graph nodes and links are fairly heavy-weight, and that updates – adding and deleting graph nodes and links – require CPU-consuming updates to various indexes. This is not particularly different than any other database: there are always indexes provided to speed up searches of a given kind, and updating those indexes takes time.  Values are different: the key idea behind Values is that they do not need to be indexed. If a rain gauge changes from 0.2 to 0.3, there is no imminent need to remove the old value from an index (a sorted, addressable ordered data structure: tree or hash table) and then place the new value into that index. If the time-changing value is sound (audio), or some video pixels, then indexing this data is not just absurd, but impossible in practice.  Thus, as a database design, one is driven to think of two distinct classes of data: the graph data, and the ephemeral, changing values.

This is the distinction made in the OpenCog AtomSpace. Nodes and Links – the AtomSpace Atoms – are part of the graph.  Each Atom has it’s own “little” key-value database attached to it: this holds the Values. We use a capital-V Value to distinguish the actual AtomSpace object from the general notion of values. Originally, Values were TruthValues – a binary 0 or 1, T or F, providing a crisp predicate logic valuation to each graph node or link. The obvious modern extension is to a Bayesian network – where each TruthValue becomes a probability. For uncertain inference, it is convenient to maintain a second number: the certainty. But why stop there? Values can be anything: vectors of numbers, strings of text, records of data, trees and DAGs. Anything that might need to be attached to a graph node or link, anything that seems transient, fleeting, and does not need indexing.

Fleeting implies flowing, or changing: the metaphor is that of pipes, and fluids that flow in them. The graph is the description of how the pipes are connected to one-another. Like actual plumbing, it can be a bit laborious to assemble this. The Values are like the fluid: flowing effortlessly, now that the graph has been assembled.

But what is this like, in practice?  Suppose one has a relationship: if P and Q then R, and one wants to compute some numeric value: say, “if rain and dry soil then fraction absorbed into soil”. This is represented as a formula. Let’s dive right in. An arithmetic formula, expressed in Atomese, looks like this:

(Lambda (VariableList (Variable "$X") (Variable "$Y"))
    (Divide
        (Times (Variable "$X") (Plus (Variable "$Y") (Number 1)))
        (Number 2)))

This is painfully verbose, but you get the idea. This formula is actually a graph: this is why it is so verbose. The (Number 1) is actually a graph node, and the (Plus …) is a link encompassing the Atoms underneath it.  This is stored as a graph, for several reasons. One is that one should not assume that human beings (i.e. software programmers) are writing these formulas: they may be coming out of some machine-learning algorithm. As such, it must be possible to perform abstract symbolic manipulations on the formulas (as you might do in your favorite symbolic math system, e.g. Mathematica, Maxima, Sage,…). It must also be possible to evaluate the formulas: we’ll be plugging in actual numbers into those variables, and so the ability to compile the formula down to JIT code is desirable.

Values are kept in a key-value database, one per Atom (recall: an Atom is just a graph Node, or a Link). To access a specific value, one may write:

(ValueOf (Concept "rain") (Concept "Austin"))

The idea here being that the graph node “rain” has an associated table; one of the table entries is for “Austin” (again, a graph node!), and the value there might be a FloatValue — a list (vector) of floating-point numbers. The above is a graphical representation of how to access that Value. The inverse of this operation is the setting of Values. Putting this all together, one might have an expression like this:

(SetValue (Concept "runoff") (Concept "Austin")
   (Lambda ...)
   (ValueOf (Concept "rain") (Concept "Austin")))

The (Lambda …) here is an un-named, anonymous formula. Like all good systems, formulas can be named, and so (DefineLink (DefinedSchema “runoff calculation”) (Lambda …)) is how one names such things; the name can be used wherever the formulas is expected.

That’s it! The above demonstrates how to wire up a formula, connecting it to Values held in the AtomSpace. There are two kinds of updates possible: manual, and automatic. A “manual” update requires calling cog-execute! on the particular expression to run. That expression will be evaluated, drawing on whatever Values are currently stored, and updating whatever Values are specified as output. The dynamic update is more interesting. Here, the formula is installed in such a way that any time one examines the Value, the formula is triggered, thus running and updating the Values just before they are presented. The dynamic Values are really what provides the title to this blog entry: they capture the idea of Values flowing about, through the network graph.

… But …. But, dynamic Values remain a new, experimental feature of  the system. There are good reasons for this. First, it is all to easy to create an infinite loop: rain runs off, evaporates, goes into clouds and comes back down as rain again. Polling the current runoff can potentially trigger an unending weather-system simulation. Worse, there is the risk of combinatoric explosion: the water in the clouds comes not only from Austin, but also from other geographical locales, which in turn come from elsewhere still. Controlling combinatorial explosion in AI is an infamously hard problem. Providing practical tools to do so, in this context, remains an open problem.  (Yes, we’re working on it.)

The above used rain as a stand-in for a generic flow process. In practical machine learning, one might instead want to use something like TensorFlow, or some other neural-net package.  Indeed, such packages already include small, specific graph-specification languages, describing exactly how the neural net is to be wired up, where its inputs are, where it’s outputs go. But its a special-case. Weather simulation systems running on supercomputers also have ways of specifying what is connected to what. Then there is the Systems Biology Markup Language (SBML), which provides a way of specifying a differential equation that describes the kidney clearance rate of some blood component – or of the mitochondrial metabolism rate. The goal (or rather, the dream, so far) of the OpenCog AtomSpace value-flows subsystem is to provide a common home to all of these kinds of systemic simulation processes.

This journey has hardly begun. Those who are interested are invited to try out the system.  The examples are here, in flows.scm and flow-formulas.scm, although those examples won’t make much sense if you do not already have a good grasp of the overall system: so best start at the beginning, with the README for the tutorials.  A word of caution, though: the world of the AtomSpace is perhaps shockingly different from what you are used to, and shockingly different from what the rest of the world does. This is a barrier to entry. If you are used to Java or JavaScript, and already know GraphQL or Neo4J, you might struggle to understand what is going on, here. For this, we are truly sorry. It really would be nice to make all this more comfortable, more mundane and accessible. That is a real issue.  And, as with everything, we’d love to have help with this. So … call … or write. We’re waiting on you.

Posted in Documentation, Introduction, Theory | Tagged , | Leave a comment

The Status of AGI and OpenCog

AGI stands for Artificial General Intelligence: it is the idea of creating a thinking machine, as smart and clever and powerful as a human being. OpenCog is a project pursuing basic research in AGI. It is attempting to build such a thinking machine.

Let’s compare the search for AGI to Astronomy. An amateur with binoculars can do astronomy and have lots of fun, but will not contribute to scientific advance. For that, you need highly-trained scientists, and you need sophisticated instruments (viz. fancy telescopes.)

OpenCog-the-project is a loose affiliation of about a half-dozen “scientists” of varying talents and varying time devoted to it. It has a loosely articulated research program; some roughly-agreed-upon directions to go in.

OpenCog-the-software is a telescope. It’s the primary tool that I (the author of this blog entry) personally use for research. Others use it, but they struggle to be effective. They lose interest after a while. (Seriously: if you had full-time access to some major scientific instrument: some big telescope, particle accelerator, would you be effective? Would you lose interest after a while? What would you even do with it? Use it to explode cans of soda and post the videos on youtube?)

It’s worth mentioning what OpenCog isn’t: it is not a grand theory of AGI. It is not a master blue-print of a machine, waiting to be built, certain of success once completed. We hope to find some portion of that grand theory. We hope to build some portion of that machine. Bit by bit.

Lets compare this to mainstream big-science. OpenCog the project is comparable to a small department at a small university. A handful of half-active people.  OpenCog the software is… well, I spent a decade building it out. It was not clear what to build, or what it should even do. It’s a bit clearer now.  I still think it’s on the right track.

The first particle accelerators were built by people like Ben Franklin, who had no clue that there were particles, or that they were being accelerated. Neither of those ideas gelled until over 100 years later. When they did, the first intentionally-designed particle accelerators were built by single individuals, over the course of a few months or a few years. (The story of early telescopes is similar). In the meantime, the general idea of why particle accelerators are useful became clear. (Its not obviously obvious, not like a telescope can be obviously used for looking. What would you even look at? The neighbor’s house?) 

But once you get the idea of what something is, and why it is useful, you can build bigger and better ones, and when you can’t build it yourself in a few years, you can organize a staff of grad students and assistants. Maybe even a part-time engineer. Eventually, the general idea becomes so very clear, and you are so certain of success, that you can raise funds to pour concrete for a new building to house it in.

Neither OpenCog, nor AGI are at that level yet.  Deep learning is like the invention of the Geiger counter: thousands of people went off and built themselves a Geiger counter (or downloaded one off the net) and started messing with it. And they’re having lots of fun with it; some are making great discoveries, while others are building better Geiger counters.  

OpenCog-the-software is not anywhere near as fun as a Geiger counter.  Is it even on the right track?  Well, ten years ago, OpenCog-the-software was almost the only graph database on the planet. Now there are more than you can easily count, all enjoying huge commercial success. So maybe we did something right, when we designed it to be a graph database.  Meanwhile, OpenCog has grown with new features, capabilities, and stuff that no other graph DB has, or has even contemplated.  So I feel like we’re way ahead in the race. But…

But are these features/functions what are actually needed for AGI?  Maybe it’s not a telescope/particle-accelerator at all, but just a crazy back-yard sculpture? How can you tell?

I would like to attract more people to the project, but we (OpenCog) have a way of under-publicizing it. We consistently fail to write up high-quality research, we fail to even write blog posts (making amends, here). We fail to twitter about it. We fail to keep compelling demos running (Ten years ago, we had a decent question-answering chatbot that could have attracted attention. It … bit-rotted. )

There’s some excuse, some reason for that: every time you create an experiment, and run it, you will typically get some insightful results. While interesting, those results are not “true AGI”, not by a long-shot. So what do you do with your experiment? Announce to the world “Hey, we once again failed to create AGI?” Do you moth-ball the code? Do you even mention it to newcomers? Perhaps we need to keep a gallery of neat things that we’ve built… or at least, a collection of basic experiments with basic instruments that newcomers need to master.

So this is where the project is at. Come join us!

Posted in Uncategorized | Tagged , , , | 1 Comment

The AtomSpace: a Typed Graphical Distributed in-RAM Knowledgebase

The OpenCog AtomSpace is a typed graphical distributed in-RAM knowledgebase. The meaning of all those buzzwords is that it is trying to fix many different, related problems that computers (and people) have when working with data. This blog-post discusses some basic issues of working with knowledge, and the various ways that people have tried to work around these issues over the decades. It will illuminate why graph databases are explosively popular these days, and will also show how that’s not the end of the story. I’m trying to start the next chapter already.

But first, lets start at the very beginning. Everyone calls it “data” instead of “knowledge” because everyone is a bit embarrassed to admit that they don’t really know how to stick “knowledge” onto a computer. The 1970’s saw the invention and development of relational databases and its mathematical underpinnings, the relational algebra. This formalized the idea of “is-a”, “has-a”, “is a part of”, “inherits from”, “is-similar-to” and whatever other relation you could possibly think of. Not a bad start for representing knowledge. The API for this relational knowledge eventually congealed into SQL.

A lot of people don’t like SQL. Those three letters make them run in the opposite direction. What’s wrong with SQL? Many things; here’s one: you have to put all your data into tables. You have to invent the table first, before you can use it. You have to plan ahead. Compare this to DataLog, the data-only subset of ProLog. Here, you could declare any fact, any statement, any relation at any time, without prior planning. Great, right? What’s wrong with that? Well, two things: a lack of indexes for quickly finding the data you want, and some trouble searching for it. Indexes are great: you just jump to where you want, instead of having to crawl all over your immense dataset. Another problem was the lack of a uniform query language that ran in some predictable, non-surprising amount of time.

Sometimes structure is nice. For example (a bad example?) with XML, you can also declare knowledge in a free-form fashion: put a tag here, a tag there, anywhere, you’re all good. Mark it up with your markup language. The people who want some structure in their life use XML schemas, which constrain some of this free-form, anything-goes attitude. The people who want to treat XML as a large distributed database have a query language to search XML from all over the web: XQuery, the XML query language. We should be in the seventh heaven: world-wide searchable free-form data? What can possibly be wrong with that?

Well, one thing is that you don’t write XML Schemas in XML; you write them in this other weird language. You don’t write XQuery in XML; it’s got it’s own new, weird and different syntax and structure. It a lot like what’s wrong with SQL: an SQL query is not actually a row or a table. It’s so very different from being a “row in a table” that its hard to even imagine that. This is one thing that ProLog got right: a ProLog query to find data stored in ProLog is written in ProLog. There’s no clear distinction between the data itself, and the way you work with it. Yes, that’s a good thing. A very good thing, actually

These various stumbling blocks have driven the explosive popularity of graph databases. You get to write free-form declarations of relationships, abandoning the strictures of tables. You get something a bit more modern that DataLog, which is a bit old and set in it’s ways: the syntax isn’t liberating. You get something that is infinitely more readable and writable that XML. And, with the well-designed graph databases (I’m thinking of you, grakn.ai) you get to avoid a lot of arcane markup and obtuse documentation. At least in the beginning. Graph databases are the second coming. We should be in heaven by now, right? What could possibly be wrong with graph databases?

Well, in most cases, graph queries (expressed in graph query languages) are not themselves actually graphs. They still sit “outside the system”, like SQL, but unlike ProLog. But when you represent your query as a graph itself, you can do neat things. You can go recursive. You can query for queries. You can find, edit and update queries with other queries. Create new ones on the fly. Even go backwards: given an “answer”, you can ask “what are the questions (the queries) that would return this answer?” This is actually a big deal in chatbot systems, where a human says something, and the chatbot has to search for all patterns (all templates) that match what was said. Amazingly, pretty much no general-purpose database can do this, even though its foundational for certain algorithms.

Working directly with queries is a big deal in rule engines, where each rule is a effectively a query, and you want to chain them together. But first you have to find the rule. How do you do that, if you can’t query your queries? Of course you can hack it. But seriously? hack? And then there’s theorem-proving and inferencing, where the reasoning/inferencing has to pick and choose through a collection rules to find which ones might be attachable next. And sometimes, you want to learn new rules: statistically, by counting, or randomly, by trial and error and cross-over (aka “genetic algorithms”) Perhaps by hill-climbing or gradient-descent. If you want to learn a new query, its best if you can store the query itself in your database.

There’s also a different problem. Graph databases don’t have a type system. It turns out that types are incredibly useful for structuring your data. Its the foundation-stone of object-oriented programming. A Java class is a type. A Python class/object is a type. C++ classes are types. For a while, object-oriented databases were all the rage. This is in contrast to SQL, which has a poor type system with only a handful of types: A table. A row in a table. An entry in a row in a table. Which could be a string, or a number. That’s pretty much it. Compare that every object you’ve ever known.

But there’s something way cooler about types that the Java&Python crowd has missed, but the functional-programming crowd (Scala and Haskell) has grokked: type constructors. The true geeks have gone even farther: proofs-as-programs, aka Curry-Howard correspondence. All the hip theorem provers (Agda and Coq) are dependent on types. What makes type constructors neat? Well, you can write programs to compute types, but unlike ordinary programs, they are guaranteed to terminate. This means that many basic tasks become simple, doable, predictable.

Like what? Well, reasoning, inference, learning. The sorts of things you want your AI/AGI system to do. Its not an accident that theorem provers are fundamentally tangled with type theory –blecch, you might think, but reasoning and theorem proving are kind-of the same thing. Who needs all that when we’ve got deep-learning neural nets? OK, off-topic – my earlier blog post dealt with how neural nets and symbolic reasoning are a lot more similar than they are different. The similarities are bit hidden, but they are there, and clear, once you see them. Anyway, off-topic, but when deep-learning hits a wall, and it will, it will be because it lakes any notion of types. Not that it matters just right now: neural nets already lack a formal relational database structure. You can’t ask a neural net “How many sales has Joe Smith made this month?”

Here’s my sales pitch: you want a graph database with a sophisticated type system built into it. Maybe you don’t know this yet. But you do. You will. You’ll have trouble doing anything reasonable with your knowledge (like reasoning, inferencing and learning) if you don’t. This is why the OpenCog AtomSpace is a graph database, with types.

There’s one more thing that the AtomSpace does, that no other graph database does (well, there are several things that no one else does, but this is the only one I want to talk about now). They’re called “Values”. Turns out there’s a cost to storing a graph: you have to know which edge is attached to what vertex, if you are going to walk the graph. You have to keep an index to track edges (I can hear the boss now: “You’re telling me we have five guys named Joe Smith in Sales? Which one did we promote?”) Bummer if your database can’t keep track of who’s who. But sometimes you don’t need to stick every last piece of trivia in an index, to place it where it’s instantly searchable. Sometimes, a key-value store is enough. This is kind-of the noSQL ethos: Got the key? Get the value; don’t need a query to do that.

Every vertex, every edge in the AtomSpace has a key-value database built into it. Its small, its lightweight. It enables fast-changing data without the overhead of indexing it. It takes up less RAM, literally. We’re experimenting with using it for streaming data: video streams, audio streams. Hooking them up the way that tensorflow might hook up data with Keras. Its pretty neat. You can stick truth values, attention values; anything changing quickly over time, into Values. Any floating point number. And if you have a number that almost never changes (like Joe Smith’s salary), and you really do want to query it (so you can pay him), then just store that number in the graph database (and not in the per-edge key-value mini-database). You have a choice. Speed when you need it, searchability where you need it.

Well, that’s it for now. The OpenCog AtomSpace is an in-RAM distributed typed graph knowledgebase. I guess I forgot to talk about “distributed”. It is; run the demo. You can share data across many different machines. It’s in-RAM, which mostly means that you can work with things at more-or-less the speed of accessing RAM. Unlike SQL, it has a rich type system. Unlike SQL or pretty much any graph database, (but like ProLog) the queries are themselves graphs. So anyway, that’s what we tried to do with the AtomSpace. Everything I talked about above works today. Worked last year. Most of it worked just fine 5 years ago. It is still a bit rough around the edges. Its got some usability issues. There’s no pretty website (oh, a wet dream). No easy documentation. (The documentation is accurate. There’s lots of it. Its almost unreadable. I know; I wrote it.) But we’re working on it. Come help us out.


Posted in Uncategorized | Tagged , , , , , , , , | Leave a comment

Symbolic and Neural Nets: Two Sides of the Same Coin

Deep learning and neural nets are all the rage, today, and have displaced symbolic AI systems in most applications. It’s commonly believed that the two approaches have nothing to do with each other; that they’re just completely different, and that’s that.  But this is false: there are some profound similarities; they are not only variants of one-another, but, in a deep way, they have commonalities that render them essentially identical. This blog post attempts to explain how. It introduces a much longer and rather mathematically dense writeup articulating these ideas more precisely.

The clearest starting point for seeing this seems to be natural language, where neural net methods have made great strides, but have not surpassed traditional (symbolic) linguistic theory. However, once this similarity is understood, it can be ported over to other domains, including deep-learning strongholds such as vision. To keep the discussion anchored, and to avoid confusing abstractions, what follows will focus entirely on linguistics; it is up to you, the reader, to imagine other, more general settings.

The starting point for probabilistic approaches (including deep learning) is the Bayesian network: a probability P(x1,x2,…,xn) of observing n events.  For language, the xk are taken to be words, and n is the length of the sentence, so that P(x1,x2,…,xn) is the “probability” of observing the sequence x1,x2,…,xn of words. The technical problem with this viewpoint is the explosively large space: if one limits oneself to a vocabulary of 10 thousand words (and many people don’t) and sentences of 20 words or less, that’s (104)20=1080 =2270 probabilities, an absurdly large number even for Jupiter-sized computers. If n is the length of this blog post, then it really seems impossible. The key, of course, is to realize that almost all of these probabilities are effectively zero; the goal of machine learning is to find a format, a representation for grammar (and meaning) that effortlessly avoids that vast ocean of zero entries.

Traditional linguistics has done exactly that: when one has a theory of syntax, one has a formulation that clearly states which sentences should be considered to be grammatically valid, and which ones not.  The trick is to provide a lexis (a lookup table), and some fairly small number of rules that define how words can be combined; i.e. arranged to the left and right of one-another.  You look up a word in the lexis (the dictionary) to find a listing of what other words are allowed to surround it.  Try every possible combination of rules until you find one that works, where all of the words can hook up to one-another. For the purposes here, the easiest and the best way to visualize this is with Link Grammar, a specific kind of dependency grammar. All theories of grammar will constrain allowed syntax; but Link Grammar is useful for this blog post because it explicitly identifies words to the left, and words to the right.  Each lexical entry is like a template, a fill-in-the-blanks form, telling you exactly what other words are allowed to the left, and to the right, of the given word.  This left-right sequence makes it directly comparable to what neural-net approaches, such as Word2Vec or SkipGram do.

What does Word2Vec do? Clearly, the 2270 probabilities in a twenty-word sentence is overwhelming; one obvious simplification is to look only at N-grams: that is, to only look at the closest neighboring words, working in a window that is N words wide. For N=5, this gives (104)5=1020 =265 which is still huge, but is bearable. When scanning actual text, almost all of these combinations won’t be observed; this  is just an upper bound. In practice, a table of 5-grams fit in latter-day computer RAM. The statistical model is to map each N-gram to a vector, use that vector to define a Boltzmann distribution (P=exp(v⋅w)/Z), and then use gradient ascent (hill-climbing) to adjust the vector coefficients, so as to maximize the probability P.

How are Word2Vec and Link Grammar similar? Well, I hope that the above description of Link Grammar already planted the idea that each lexical entry is a lot like an N-gram. Each lexical entry tells you which words can appear to the right, and which words to the left of a given word. Its a bit less constrained than an N-gram: there’s no particular distance limitation on the dependency links. It can also skip over words: a lexical entry is more like a skip-gram.  Although there is no distance limitation, lexical entries still have a small-N-like behavior, not in window size, but in attachment complexity. Determiners have a valency of 1; nouns a valency of 2 or 3 (a link to a verb, to an adjective, to a determiner); verbs a valency of 2, 3 or 4 (subject, object, etc.). So lexical entries are like skip-grams: the window size is effectively unbounded, but the size of the context remains small (N is small).

Are there other similarities? Yes, but first, a detour.

What happened to the 2270 probabilities? A symbolic theory of grammar, such as Link Grammar, is saying that nearly all of these are zero; the only ones that are not zero are the ones that obey the rules of the grammar. Consider the verb “throw” (“Kevin threw the ball”). A symbolic theory of grammar effectively sates that the only non-vanishing probabilities are those of the form P(x1,x2,…,xk=noun,…,xm=throw,…,xp=object,…,xn) and that all the others must be zero. Now, since nouns make up maybe 1/2 of all words (the subject and object are both nouns), this constraint eliminates 104x104/(2×2)= 224 possibilities (shrinking 2270 to 2270-24 = 2246). Nothing to sneeze at, given that its just one fairly simple rule.  But this is only one rule: there are others, which say things like “singular count nouns must be preceded by a determiner” (so, “the ball”). These constraints are multiplicative: if the determiner is missing, then the probability is exactly zero. There are only a handful of determiners, so another factor of 104=213 is vaporized. And so on.  A relatively small lexis quickly collapses the set of possibilities. Can we make a back of the envelope estimate? A noun-constraint eliminates half the words (leaving 5K of 10K possibilities). A determiner constraint removes all but 10 possibilities. Many grammatical classes have only a few hundred members in them (“throw” is like “hit”, but is not like “smile”). So, realistically, each xk will have from 10 to 1000 possibilities, maybe about 100 possibilities “on average” (averages are dicey to talk about, as these are Zipfian distributions). Thus there are only about 10020 =1040 =2130 grammatically valid sentences that are 20 words long, and these can be encoded fairly accurately with a few thousand lexical entries.

In essence, a symbolic theory of grammar, and more specifically, dependency grammars, accomplish the holy grail of Bayesian networks: factorizing the Bayesian network.  The lexical rules state that there is a node in the network, for example, P(x1,x2,…,xk=noun,…,xm=throw,…,xp=object,…,xn) and that the entire sentences is a product of such nodes: The probability of “Kevin threw the ball” is the product P(x1=Kevin, x2=verb, …, xn) P(x1,x2, …, xk=noun, …, xm=throw, …, xp=object, …, xn) P(x1,x2, …, xk=the, …, xm=noun, …, xn) P(x1,x2, …, xn=ball) Stitch them all together, you’ve got a sentence, and its probability. (In Link Grammar, -log P is called the “cost”, and costs are additive, for parse ordering.)  To be explicit: lexical entries are exactly the same thing as the factors of a factorized Bayesian network. What’s more, figuring out which of these factors come to play in analyzing a specific sentence is called “parsing”. One picks through the lookup table of possible network factors, and wires them up, so that there are no dangling endpoints. Lexical entries are look like subsets of a graph: a vertex, and some dangling edges hanging from the vertex. Pick out the right vertexes (one per word), wire them together so that there are no dangling unconnected edges, and viola! One has a graph: the graph is the Bayesian network. Linguists use a different name: they call it a dependency parse.

The Word2Vec/SkipGram model also factorizes, in much the same way! First, note that the above parse can be written as a product of factors of the form P(word|context), the product running over all of the words in the sentence. For a dependency grammar, the context expresses the dependency relations. The Word2Vec factorization is identical; the context is simpler. In it’s most naive form, its just a bag of the N nearest words, ignoring the word-order.  But the word-order is ignored for practical reasons, not theoretical ones: it reduces the size of the computational problem; it speeds convergence. Smaller values of N mean that long-distance dependencies are hard to discover; the skipgram model partly overcomes this by keeping the bag small, while expanding the size of the window.  If one uses the skip-gram model, with a large window size, and also keep track of the word-order, and restrict to low valencies, then one very nearly has a dependency grammar, in the style of Link Grammar. The only difference is that such a model does not force any explicit dependency constraints; rather, they are implicit, as the words must appear in the context.  Compared to a normal dependency grammar, this might allow some words to be accidentally double-linked, when they should not have been. Dependency grammar constraints are sharper than merely asking for the correct context. To summarize: the notions of context are different, but there’s a clear path from one to the other, with several interesting midway points.

The next obvious difference between Link Grammar and Word2Vec/SkipGram are the mechanisms for obtaining the probabilities. But this is naive: in fact, they are much more similar than it first appears.  In both systems, the starting point is (conceptually) a matrix, of dimension WxK, with W the size of the vocabulary, and  K the size of the context. In general, K is much much larger than W; for Word2Vec, K could get as large as WN for a window size of N, although, in practice, only a tiny fraction of that is observed.  Both Link Grammar and Word2Vec perform approximate matrix factorization on this matrix. The way the approximations are done is different. In the case of Word2Vec, one picks some much smaller dimension M, typically around M=200, or maybe twice as large; this is the number of “neurons” in the middle layer. Then, all of W is projected down to this M-dimensional space (with a linear projection matrix). Separately,  the K-dimensional context is also projected down. Given a word, let its projection be the (M-dimensional) vector u. Given a context, let it’s projection be the vector v.  The probability of a word-in-context is given by a Boltzmann distribution, as exp(u⋅v)/Z where u⋅v is the dot product and Z is a scale factor (called the “partition function“). The basis elements in this M-dimensional space have no specific meaning; the grand-total vector space is rotationally invariant (only the dot product matters,  and dot products are scalars).

The primary task for Word2Vec/SkipGram is to discover the two projection matrices.  This can be done by gradient ascent (hillclimbing), looking to maximize the probability.  The primary output of Word2Vec are the two projection matrices: one that is WxM-dimensional, the other that is MxK-dimensional.  In general, neither of these matrices are sparse (that is, most entries are non-zero).

Link Grammar also performs a dimensional reduction, but not quite exactly by using projection matrices.  Rather, a word can be assigned to several different word-categories (there are de facto about 2300 of these in the hand-built English dictionaries). Associated with each category is a list of dozens to thousands of “disjuncts” (dependency-grammar dependencies), which play the role analogous to “context”.  However, there are far, far fewer disjuncts than there are contexts. This is because every (multi-word) context is associated with a handful of disjuncts, in such a way that each disjunct stands for hundreds to as many as millions of different contexts.  Effectively, the lexis of Link Grammar is a sparse CxD-dimensional matrix, with C grammatical categories, and D disjuncts, and most entries in this CxD dimensional matrix being zero.  (The upper bound on D is LV, where L is the number of link types, and V is the maximum valency – about 5 or 6. In practice, D is in the tens of thousands.) The act of parsing selects a single entry from this matrix for each word in the sentence.   The probability associated to that word is exp(-c) where c is the “cost”, the numerical value stored at this matrix entry.

Thus, both systems perform a rather sharp dimensional reduction, to obtain a much-lower dimensional intermediate form.  Word2Vec is explicitly linear, Link Grammar is not exactly.  However (and this is important, but very abstract) Link Grammar can be described by a (non-symmetric) monoidal category.  This category is similar to that of the so-called “pregroup grammar”, and is described in a number of places (some predating both Link Grammar an pregroup grammar). The curious thing is that linear algebra is also described by a monoidal category. One might say that this “explains” why Word2Vec works well: it is using the same underlying structural framework (monoidal categories) as traditional symbolic linguistics.  The precise details are too complex to sketch here, and must remain cryptic, for now, although they are open to those versed in category theory.  The curious reader is encouraged to explore category-theoretic approaches to grammar, safe in the understanding that they provide a foundational understanding, no matter which detailed theory one works in.  At the same time, the category-theoretic approach suggests how Word2Vec (or any other neural-net or vector-based approach to grammar) can be improved upon: it shows how syntax can be restored, with the result still looking like a funny/unusual kind of sparse neural-net.  These are not conflicting approaches; they have far far more in common than meets the eye.

A few words about word-senses and semantics are in order. It has been generally observed that Word2Vec seems to encode “semantics” in some opaque way, in that it can distinguish different word-senses, based on context.  The same is true for Link Grammar: when a word is used in a specific context, the result of parsing selects a single disjunct. That disjunct can be thought of as a hyper-fine grammatical category; but these are strongly correlated with meaning.  Synonyms can be discovered in similar ways in both systems: if two different words all share a lot of the same disjuncts, they are effectively synonymous, and can be used interchangeably in sentences.

Similarly, given two different words in Word2Vec/SkipGram, if they both project down to approximately the same vector in the intermediate layer, they can be considered to be synonymous.  This illustrates yet another way that Link Grammar and Word2Vec/SkipGram are similar: the list of all possible disjuncts associated with a word is also a vector, and, if two words have almost co-linear disjunct vectors, they are effectively synonymous.  That is, disjunct-vectors behave almost exactly like neuron intermediate-layer vectors.  They encode similar kinds of information into a vector format.

This is also where we have the largest, and the most important difference between neural net approaches, and Link Grammar. In the neural net approach, the intermediate neuron layer is a black box, completely opaque and un-analyzable, just some meaningless collection of floating-point numbers. In Link Grammar, the disjunct vector is clear, overt, and understandable: you can see exactly what it is encoding, because each disjunct tells you exactly the syntactic relationship between a word, and its neighbors.   This is the great power of symbolic approaches to natural-language: they are human-auditable, human understandable in a way that neural nets are not. (Currently; I think that what this essay describes is an effective sketch for a technique for prying open the lid of the black box of neural nets. But that’s for a different day.)

The arguments presented above are worked out in greater detail, with all the mathematical trimmings, in this linked PDF.  The commonality, by means of category theory, is explored in a different paper, on sheaves.

Many thanks and appreciation to Hanson Robotics for providing the time to think and write about such things. (Much much earlier and kludgier versions of these ideas have been seen in public, in older robot demos.)

Posted in Theory | Tagged , , , , , , , | 1 Comment