Text Attribution with Link Grammar

A question was recently asked as to whether Link Grammar could be used to attribute text to a specific author. I had fun writing the reply; let me reproduce it below. It starts at square one.

Consider a police detective analyzing a threatening note. At some point in the prior centuries, it becomes common knowledge that hand-written notes are subject to forensic analysis. Criminals switch to typewriters; alas, some famous spy cases from the 1940’s are solved by linking notes to the typewriters that produced them. By the 1970’s, Hollywood shows us films with the bad guy clipping words from newspapers. Aside from looking for fingerprints left on the paper, psychological profilers look for idiosyncracies in how the criminal expresses ideas. Stranger wording, odd phrases, punctuation or lack thereof.

How about computer text? It’s well know that many people consistently mis-spell words (I consistently mis-spell “thier”) and I think there was some murder trial evidence that hinged on this. Moving into the PC era, 1980’s onwards, we get postmodernism and corpus linguistics. One of the fruits is the “bag of words” model: different texts have different ratios of words. Although spotted much earlier, computers allow this to be applied to a zillion and one problems in text classification. Basically, you have a vector of (word, frequency) pairs and you can judge the similarity between vectors with assorted distance measures (the dot-product is very popular and also just plain wrong, but I digress.) I don’t think you’d have any particular problem with using this method to attribute a novel to James Joyce, for example.

It becomes subtle, perhaps, if the text is short: say, a letter, and you are comparing it to other letters written in the same era, written by eloquent Irishmen. The words that Joyce might use in a letter might not be the ones he’d use in a novel. It’s reasonable to expect that bag-of-words will fail to provide an unambiguous signal.  How about sentence structure, then?   (This is what the original question was.) Yes, I agree: that is a good way – maybe the best way(?) of doing this (at current levels of technology). One might still expect Joyce to construct his sentences in a way that is particular to his mode of thinking, irrespective of the topic that he writes on. Mood and feeling echoes on in the grammar.

So, how might this work? Before I dive into that, a short digression. Besides bag-of-words, there is also a bag of word-pairs. Here, you collect not (word, frequency) pairs, but (word-pair, frequency) pairs. One collects not nearest-neighbor word-pairs, but word-pairs in some window: say, of length six. The problem is that there are vast numbers of word-pairs, like “the-is” and “you-banana” – hundreds of millions. Most are junk. You can weed most of these away by focusing only on those with a high mutual information, but even so, you’re left with the problem of “overfitting”.

Enter the n-gram (as in “google n-gram viewer”) or better yet, the skipgram, which is an n-gram with some “irrelevant” words omitted. Effectively all neural-net techniques are skip-gram based. To crudely paraphrase what a neural net does: as you train it on a body of text (say … James Joyce’s complete works…), it develops a collection of (skigram, frequency) pairs, or rather, a (skipgram, weight) vector. You can then compare this to some unknown text: the neural net will act as a discriminator or classifier, telling you if that other text is sufficiently similar (often using the dot product, which is just plain… but I digress…) The “magic” of the neural net is it figures out which skip-grams are relevant, and which are noise/junk. (There are millions of trillions of skip grams; out of these, the neural net picks out 200 to 500 of them. This is a non-trivial achievement).

How might this work for one of James Joyce’s letters? Hmm. See the problem? If the classifier is trained on his novels, the vocabulary there might be quite different than the vocabulary in his personal letters, and that difference in vocabulary will mess up the recognition.  Joyce may be using the same sentence constructions in his letters and novels, but with a different vocabulary in each. A skip-gram classifier is blind to word-classes: it’s blind to the grammatical constructions.  Something as basic as a synonym trips it up. (Disclaimer: there is some emerging research into solving these kinds of problems for neural nets, and I am not up on the latest! Anyone who knows better is invited to amplify!)

I’ve said before (many many times) that skip-grams are like Link Grammar disjuncts, and it’s time to make this precise. Lets try this:

    +---->WV--->+     +-----IV--->+-----Ost-----+
    +->Wd--+-SX-+--Pa-+--TO--+-Ixt+   +--Dsu*v--+
    |      |    |     |      |    |   |         |
LEFT-WALL I.p am.v proud.a to.r be.v an emotionalist.n

Here, an example skipgram might be (I..proud..be) or (proud..be..emotionalist) A sentence like “I was immodestly an emotionalist” would be enough for a police detective to declare that Joyce wrote that. Yet, there is no skip-gram match.

Consider now the Link-grammar word-disjunct pairs. For the above sentence, here’s the complete list:

               I  == Wd- SX+
              am  == SX- dWV- Pa+
           proud  == Pa- TO+ IV+
              to  == TO- I*t+
              be  == Ix- dIV- O*t+
              an  == Ds**v+
    emotionalist  == D*u- Os-

 

You can double-check this by carefully looking at the diagram above; notice that “proud” links to the left with Pa and to the right with TO and IV.

The original intent of disjuncts is to indicate grammatical structure. So, “Pa” is a “predicative adjective”. “IV” links to “infinitive verb”.  As a side-effect, they work with word-classes: for example, “He was happy to be an idiot” has exactly the same parse, even though the words are quite different.

To finally get back to the original question of author attribution. Well, here’s an idea: “bag of disjuncts”. Let’s collect (disjunct, frequency) pairs from Joyce’s novels, and compare them to his letters. The motivation for this idea is that perhaps the specific vocabulary words are different, but the sentence structures are similar.

How well does this work? I dunno. No one has ever studied this in any quantitative, scientific setting.  Some failings are obvious: There is a 100% match to “He was happy to be an idiot” even though the word-choice might not be Joycian. There is a poor match to “I was immodestly an emotionalist” even though the word “emotionalist” is extremely rare, and a dead-giveaway. There’s also a problem with the correspondence “immodestly” <=> “proud to be” because “immodestly” is an adverb, not a predicative adjective, and it’s a single word, not a word-phrase. Raw, naive Link Grammar is insensitive to synonymy between word-phrases.

There is a two-decade old paper that explains exactly how to solve the multi-word synonymous-phrases problem. It’s been done. It’s doable. I can certainly point out a half-dozen other tricks and techniques to further refine this process.  So, yes, I think that this all provides a good foundation for text attribution experiments. But I mean what I say: “experiments”. I think it could work, and I think it might work quite well. But, to do better, you’d have to actually do it. Try it.  It would take a goodly amount of work before any literary critic would accept your results; and even more before a judge would accept it as admissible evidence in a court of law.

As to existing software: I have a large collection of tools for counting things and pairs of things, and comparing the similarity of vectors. Most enthusiasts would find that code unusable, until it gets re-written in python. Alas, that is not forthcoming.  If you wanted to actually do what I describe above, some very concrete plans would need to be made.

I also have this daydream about *generating text* in the style of a given author: given a corpus, create more sentences and paragraphs, in the style and vocabulary of that corpus. My ideas for this follow along similar lines of thought to the above, but this is … a discussion for some other day.

Posted in Theory, Uncategorized | Tagged , | Leave a comment

“Wave function collapse” for procedural generation

I’ve just been notified about an exciting and important algorithm for procedural generation: the “wave function collapse algorithm“, described by Marian Kleineberg in 2019. I want to talk about it here, from the viewpoint of natural language generation. It’s a kind of a greedy algorithm, and its an entropy-minimizing algorithm, but what makes it different is that it is explicitly working with “jigsaw-puzzle pieces”, and it also has a twist, a trick: its selecting what slot to fill, rather than what to fill in the slot.

Re-read that last sentence, because that’s the key innovation that excites me. Everyone who has attempted to do procedural generation has encountered and struggled with the issue of combinatoric explosion.  I struggle with it in the opencog/generate code base. The existing prototype there uses an odometer-based approach to enumerate all possibilities. Since the jigsaw pieces potentially combine in a recursive fashion, the odometer will enumerate a countable infinity of possibilities. Mathematically interesting, but not very practical for AI/AGI applications.

Some background. The prior blog post: “Everything is a Network“, argues that all AGI and knowledge representation and reasoning systems can be understood in terms of jigsaw puzzle pieces. Thus, algorithms for working with the same become important.

The jigsaw puzzle piece paradigm figures prominently in the theory of Link Grammar (LG). Each natural language (English, Russian, etc.) is represented in terms of a dictionary (lexis) of pieces, each with one word written on it. Parsing a sentence is then converted into the problem of finding appropriate pieces, and assembling them, without gaps or mismatches. The result is a dependency grammar parse (which can be converted to an head phrase structure grammar (HPSG) parse, if that is what your heart desires.)

Curiously, while many systems distinguish strongly between parsing and generation, this is not the case for the jigsaw assembly paradigm. So, for example, given a fixed sentence, say, “This is a test”, one first finds all jigsaw pieces (aka “disjuncts“) for each word, and then selects among them to find which ones fit. For example, the parse

    +----->WV----->+---Os---+
    +-->Wd---+-Ss--+  +--Ds-+    
    |        |     |  |     |
LEFT-WALL this.p is.v a  test.n

is assembled from five pieces:

LEFT-WALL: Wd+ & WV+;
this: Wd- & Ss+;
is: Ss- & WV-;
a: Ds+;
test: Ds- & Os-;

The plus and minus signs represent the linkage direction (think tab and slot, in a puzzle-piece), while the letters represent the link type (they represent different shapes for the connectors; they are type-theoretical types.) Snap them all together, and bingo! A parse! Some more examples:

Now, lets consider natural language generation. In this case, one needs to merely replace specific locations in a sentence with wild-cards. Rather than working with only those puzzle pieces corresponding to a particular word at that location, one can work with any puzzle pieces that might possibly fit in that location. As long as the pieces snap together, one gets a grammatically valid sentence.  To find out what sentence one got, one need only read the word off of the piece that was selected to fit. Viola! Parsing and generation are the same thing in the puzzle-piece paradigm!

This is easy to say, and only a little harder to do. However, if one really wants to generate any sentence, one promptly runs into a commonplace issue: combinatorial explosion. The number of possible choices is vast: far too vast to be usable in most practical applications. And so, with that, we’ve laid the groundwork for this blog post. Sorry for the long detour.

The “wave function collapse algo”, which I would rather call the “entropy collapse algo”, might work like so. It starts with Kleineberg’s core idea, and limits it to a one-dimensional grid — a sentence, with grid locations for each word. Part of the generation task is to say “I want a sentence with words A,B,C in it, and I want that sentence to be no more than 20 words long”.

Given a partially generated sentence string, it selects the next word location to fill in. The magic of the algo is in selecting this slot. It computes the entropy for each slot, then picks the one with the lowest entropy. Explicitly, this is

S = – sum_i p_i log p_i

where p_i is the probability of word-disjunct-pair i and -log p_i is the LG cost. In LG, a “word-disjunct pair” is a “jigsaw puzzle piece”. The “cost” is something that LG uses to assign priority by mimicking entropy. The sum_i is a sum over all word-disjunct pairs that meet the current boundary condition, i.e. can connect to the neighboring connectors.

Given all possible word locations in the sentence, the algo selects one location, by being greedy: it picks the one with the smallest entropy. Having selected a location to fill in, the “entropy collapse algo” then selects something (some word) for that slot, and then repeats.

I like this algo because it explicitly avoids a combinatoric explosion, and it also explicitly selects the lowest-cost sentences from the get-go. It can generate more than one sentence, by back-tracking, and doing so will generating sentences in cost-increasing order, more or less. This makes it very nice for practical applications.

This sounds like a great way to generate sentences, until you start thinking about how to write the code, at which point, things get a little dicey. In LG, the constraints are non-local, unlike blocks-world, where all constraints are nearest-neighbor constraints. Also, unlike blocks-world, selecting a word can result in the creation of new spaces (“dark energy” for physics aficionados) — for example, given the partial sentence

   +-->Wd--+--S--
   |       |
LEFT-WALL this

it can be extended as

    +-->Wd--+--S--+--O--
    |       |     |
LEFT-WALL this   is

or as

    +-->Wd--+--S--------+--O--
    |       |    +--Em--+
    |       |    |      |
LEFT-WALL this   *     is

where the star is a new empty slot, which can be filled in with some emphasizing adverb: e.g. “this really is …” or “this surely is …” — so unlike blocks-world, the size of the LG universe can grow.

There are some other subtle points. Suppose I want to generate sentences with specific words in them, but I don’t know the word order, yet. The words might be “Ben” “eats” and “pizza” but of course, one could generate: “the pizza was eaten by Ben” or “we saw Ben eating a pizza” or “he was eating a pizza, when we last saw Ben“. Worse: we need lexical functions to provide the alternatives eats-eating-eaten. So there’s lots of interesting stuff to explore.

None-the-less, the core idea, of assembling the jigsaw-puzzle pieces in such a way that one is working with the minimum-entropy locations first, is a fantastic idea. (Painfully obvious, in retrospect: when one assembles a real-life jigsaw puzzle, one does exactly this: one starts with edges, cause there aren’t very many of them. Then one organizes by color, and does the ones with the smallest number of choices, first. And so on. In physics, there is a related idea, called “diffusion limited aggregation“.)

Posted in Design, Theory, Uncategorized | Leave a comment

Everything is a Network

The goal of AGI is to create a thinking machine, a thinking organism, an algorithmic means of knowledge representation, knowledge discovery and self-expression. There are two conventional approaches to this endeavor. One is the ad hoc assembly of assorted technology pieces-parts, with the implicit belief that, after some clever software engineering, it will just come alive. The other approach is to propose some grand over-arching theory-of-everything that, once implemented in software, will just come alive and become the Singularity.

This blog post is a sketch of the second case. As you read what follows, your eyes might glaze over, and you might think to yourself, “oh this is silly, why am I wasting my time reading this?” The reason for your impatience is that, to say what I need to say, I must necessarily talk in such generalities, and provide such silly, childish examples, that it all seems a bit vapid. The problem is that a theory of everything must necessarily talk about everything, which is hard to do without saying things that seem obvious. Do not be fooled. What follows is backed up by some deep and very abstract mathematics that few have access to. I’ll try to summon a basic bibliography at the end, but, for most readers who have not been studying the mathematics of knowledge for the last few decades, the learning curve will be impossibly steep. This is an expedition to the Everest of intellectual pursuits. You can come at this from any (intellectual) race, creed or color; but the formalities may likely exhaust you. That’s OK. If you have 5 or 10 or 20 years, you can train and work out and lift (intellectual) weights. You can get there. And so… on with the show.

The core premise is that “everything is a network” — By “network”, I mean a graph, possibly with directed edges, usually with typed edges, usually with weights, numbers, and other data on each vertex or edge. By “everything” I mean “everything”. Knowledge, language, vision, understanding, facts, deduction, reasoning, algorithms, ideas, beliefs … biological molecules… everything.

A key real-life, observed fact about the “graph of everything” is it consists almost entirely of repeating sub-patterns. For example, “the thigh bone is connected to the hip bone” — this is true generically for vertebrates, no matter which animal it might be, or if it’s alive or dead, it’s imaginary or real. The patterns may be trite, or they may be complex. For images/vision, an example might be “select all photos containing a car” — superficially, this requires knowing how cars look alike, and what part of the pattern is important (wheels, windshields) and what is not (color, parked in a lot or flying through space).

The key learning task is to find such recurring patterns, both in fresh sensory input (what “the computer” is seeing/hearing/reading right now) and in stored knowledge (when processing a dataset – previously-learned, remembered knowledge – for example, a dataset of medical symptoms). The task is not just “pattern recognition” identifying a photo of a car, but of pattern discovery — learning that there are things in the universe called “cars”, and that they have wheels and windows — extensive and intensive properties.

Learning does not mean “training” — of course, one can train, but AGI cannot depend on some pre-existing dataset, gathered by humans, annotated by humans. Learning really means that, starting from nothing at all, except one’s memories, one’s sensory inputs, and one’s wits and cleverness, one discovers something new, and remembers it.

OK, fine, the above is obvious to all. The novelty begins here: The best way to represent a graph with recurring elements in it is with “jigsaw puzzle pieces”. (and NOT with vertexes and edges!!) The pieces represent the recurring elements, and the “connectors” on the piece indicate how the pieces are allowed to join together. For example, the legbone has a jigsaw-puzzle-piece connector on it that says it can only attach to a hipbone. This is true not only metaphorically, but (oddly enough) literally! So when I say “everything is a network” and “the network is a composition of jigsaw puzzle pieces”, the deduction is “everything can be described with these (abstract) jigsaw pieces.”

That this is the case in linguistics has been repeatedly rediscovered by more than a few linguists. It is explained perhaps the most clearly and directly in the original Link Grammar papers, although I can point at some other writings as well; one from a “classical” (non-mathematical) humanities-department linguist; another from a hard-core mathematician – a category theorist – who rediscovered this from thin air. Once you know what to look for, its freakin everywhere.  Say, in biology, the Krebs cycle (citric acid cycle) – some sugar molecules come in, some ATP goes out, and these chemicals relate to each other not only abstractly as jigsaw-pieces, but also literally, in that they must have the right shapes! The carbon atom itself is of this very form: it can connect, by bonds, in very specific ways. Those bonds, or rather, the possibility of those bonds, can be imagined as the connecting tabs on jigsaw-puzzle pieces.  This is not just a metaphor, it can also be stated in a very precise mathematical sense. (My lament: the mathematical abstraction to make this precise puts it out of reach of most.)

The key learning task is now transformed into one of discerning the shapes of these pieces, given a mixture of “what is known already” plus “sensory data”. The scientific endeavor is then: “How to do this?” and “How to do this quickly, efficiently, effectively?” and “How does this relate to other theories, e.g. neural networks?” I believe the answer to the last question is “yes, its related”, and I can kind-of explain how. The answer to the first question is “I have a provisional way of doing this, and it seems to work“. The middle question – efficiency? Ooooof. This part is … unknown.

There is an adjoint task to learning, and that is expressing and communicating. Given some knowledge, represented in terms of such jigsaw pieces, how can it be converted from its abstract form (sitting in RAM, on the computer disk), into communications: a sequence of words, sentences, or a drawing, painting?

That’s it. That’s the meta-background. At this point, I imagine that you, dear reader, probably feel no wiser than you did before you started reading. So … what can I say to impart actual wisdom? Well, lets try an argument from authority: a jigsaw-puzzle piece is an object in an (asymmetric) monoidal category. The internal language of that category is … a language … a formal language having a syntax. Did that make an impression? Obviously, languages (the set of all syntactically valid expressions) and model-theoretic theories are dual to one-another (this is obvious only if you know model theory). The learning task is to discover the structure, the collection of types, given the language.  There is a wide abundance of machine-learning software that can do this in narrow, specific domains. There is no machine learning software that can do this in the fully generic, fully abstract setting of … jigsaw puzzle pieces.

Don’t laugh. Reread this blog post from the beginning, and everywhere that you see “jigsaw piece”, think “syntactic, lexical element of a monoidal category”, and everywhere you see “network of everything”, think “model theoretic language”.  Chew on this for a while, and now think: “Is this doable? Can this be encoded as software? Is it worthwhile? Might this actually work?”. I hope that you will see the answer to all of these questions is yes.

And now for a promised bibliography. The topic both deep and broad. There’s a lot to comprehend, a lot to master, a lot to do. And, ah, I’m exhausted from writing this; you might be exhausted from reading.  A provisional bibliography can be obtained from two papers I wrote on this topic:

The first paper is rather informal. The second invoked a bunch of math. Both have bibliographies. There are additional PDF’s in each of the directories that fill in more details.

This is the level I am currently trying to work at. I invite all interested parties to come have a science party, and play around and see how far this stuff can be made to go.

p.s. Did I forget to talk about semantics and meaning? That would appear to be the case, as these words do not appear above. Fear not! It’s all in there! You can get a glimmer of this by examining the marvelous bridge from mundane syntax to deep semantics provided by Mel’čuk’s Meaning-Text Theory. I’ve convinced myself that it goes much farther than that, but I have not the ability to convince you of it, unless you arrive at that conclusion yourself.

p.p.s. Yes, these kinds of ideas have been circulating for decades, or longer. My overt goal is to convert these ideas into working software. Towards this end, some fair amount of progress has been made, see the learn and the generate github repos. The meta-goal is to recruit enough interested parties to pursue this at something more than a snails pace of progress.

Posted in Theory, Uncategorized | 2 Comments

COVID-19 Modelling and Random Social Networks

Small random network

Seems like everyone wants to be an epidemiologist these days, so why not OpenCog? After all, diseases spread through networks, propagating from one node to the next. A network is a graph, the AtomSpace is a graph database, and the value flow subsystem provides a convenient infrastructure for propagating data (e.g. communicable diseases) from one node to the next. How hard could it be to model the progression of disease through a social network? The answer? Pretty easy.

The full working demo is here. It highlights the brand-new network generation codebase. The network generator is an AtomSpace module, specialized for generating random networks that are constrained by grammatical rules. This is in the same sense as “grammar” in linguistics, or computer science: a description of syntax. With an appropriate grammar, one can generate linguistic parse trees. But there’s no need to constrain oneself to trees: one can generate graphs with undirected edges and loops.The network generator is generic. That said, please note: this is version 0.1 of the network generator! This really is a brand-new module. It is needed for several projects, but really, its just completely brand-new, and is not a complete, finished final product.

Note also: this is a technology demo! This is not a full-scale medical epidemiology tool. It could be used for scientific research — but, for that, it would need to be extended and attached to other scientific tools, such as data analysis tools, graphing tools, network visualization tools. Such tools are commonplace and abundant; we won’t dwell on them here. The demo here is a demo of network generation, and of running a state transition machine on a network. Its a demo of using Atomese for these sorts of computations, playing to the strengths of Atomese.

The demo consists of four parts. The first part demonstrates how to write a finite state machine in Atomese. The second demonstrates the current network generation API. Third: actually run the danged thing, and fourth: collect up and print some stats.  Before starting, though, a few words about Atomese.

Atomese

Atomese is a graphical programming language. Unlike almost all other programming languages, it was designed for automation, and not for humans. Atomese expressions are graphs; they live as graphs in the AtomSpace graph database. Atomese encodes “Abstract Syntax Trees” (see the linked Wikipedia article.) Such a representation is useful for all of the reasons that the Wikipedia article says they are. But it also means that “coding” in Atomese can sometimes feel unusual. Like: “Why can’t I just code in python? It would be so much easier!” — Well, but that misses the point. And the point is, again: Atomese was designed for automation.

What this really means is that the graphs are meant to be easy for other algorithms to manipulate. It’s designed for ease-of-use by machines, not for humans! The style is more barren, unadorned, without short-cuts. Its a bit verbose at times. That’s OK, machines don’t mind verbosity and tedium.

Consider, for example, a graphical editor – something where you can drag-n-drop, hand-draw interacting bubble diagrams. Something easy to use – something a medical professional could get the hang of in a few hours – something where a disease model could be sketched out with bubbles and lines and flow-charts. The goal of Atomese is that it becomes easy — really easy — to convert such diagrams into executable code. That’s what Atomese is designed to do.

In this demo, the disease progression model is written in Atomese. This is the model that a graphical GUI WYSIWYG whizz-bang system would generate.

Atomese enables other, more abstract approaches. Consider an automatic model explorer: a system that automatically creates a variety of different disease progression models, and then explores how each model works. The model generator might try to fit an existing dataset. Alternately, it could mutate models so as to obtain certain characteristics. This moves beyond just adjusting some parameters in some hand-created, hypothesized model. This is not just some monte-carlo search or machine-learning hill-climbing for parameter tuning. This is a whole-sale creation of previously non-existent code. Atomese allows software to read, understand, mutate, modify and write programs.

State transitions

There are two state transition rules. One governs the transmission of disease between pairs of individuals, and the other controls the disease progression in an infected individual. Consider first the evolution of the disease state in a single individual.

 (And
    (Equal (ValueOf (Variable "$A") seir-state) exposed)
    (GreaterThan
       (ValueOf (Variable "$A") susceptibility)
       (RandomNumber (Number 0) (Number 1))))
 (SetValue (Variable "$A") seir-state infected)

I’m hoping this is is self-explanatory. The (Variable “$A”) is a person, an individual in the network. The seir-state is a location – a lookup key, holding the “SEIR” state: “susceptible”, “exposed”, “infected”, “recovered”. If the individual is exposed, and a random draw exceeds that individual’s susceptibility, they will become infected. All of the state transitions can be written in this form.

The above is written in scheme, but you could also use Python — the AtomSpace has Python bindings. Arguing about Python vs. something else kind-of misses the point: the above Atomese is actually a graph, a tree, in this case, that happens to encode, in its nodes and links, an abstract syntax tree (AST) specifying an executable program. Atomese allows you to work directly with the AST, whether in scheme in python, or just abstractly, at a higher layer. The Wikipedia article on AST’s explains exactly why this is a useful thing to have.

The person-to-person transmission rule looks like this:

(Cond
  (And
     (Equal (ValueOf (Variable "$A") seir-state) susceptible)
     (Equal (ValueOf (Variable "$B") seir-state) infected)
     (Or 
        (And
           (Equal (Variable "$REL") (Concept "friend"))
           (GreaterThan
              (RandomNumber (Number 0) (Number 1)) 
              (Number 0.3)))
        (And
           (Equal (Variable "$REL") (Concept "stranger"))
           (GreaterThan
              (RandomNumber (Number 0) (Number 1)) 
              (Number 0.7)))))
  (SetValue (Variable "$A") seir-state exposed)

Again, this is hopefully obvious. If A is susceptible and B is infected, then maybe A becomes exposed, if A gets close enough/spends enough time with B. A passing stranger is unlikely to lead to exposure; but friends are harder to avoid and thus one is more likely to be exposed to an infected friend.

Network grammar

A single seed

The network grammar in this demo is extremely simple. It specifies two types of edges: “friend” and “stranger”, used for different types of social encounter. The grammar itself just consists of a set of “seeds” or “burrs” – a single vertex (a prototype individual) together with some half-edges on it – potential edges that could form either “friend” or “stranger” relationships.

It’s perhaps the most useful to imagine each burr as a jigsaw-puzzle piece: in this case, with only two types of connectors: “friend” or “stranger”, varying in quantity from a few, to many. The visuals show a seed with some attached half-edges, and the process of connecting them. It’s painfully obvious, isn’t it?

Connecting puzzle pieces

A single seed, expressed in Atomese, looks like this:

(Section
   (Concept "person-2-3")
   (ConnectorSeq 
      (Connector (Concept "friend") (ConnectorDir "*"))
      (Connector (Concept "friend") (ConnectorDir "*"))
      (Connector (Concept "stranger") (ConnectorDir "*"))
      (Connector (Concept "stranger") (ConnectorDir "*"))
      (Connector (Concept "stranger") (ConnectorDir "*"))))

It has two unconnected “friend” connectors, and three unconnected “stranger” connectors. The connectors are unpolarized – they have no directionality, so that the resulting relationships are symmetrical. That is, unlike an actual jigsaw-puzzle piece, the “*”  ConnectorDir doesn’t have a “direction”, it allows an any-to-any connection to be made.  When a connection is formed, the connector labels must match – thus, in this case, two different kinds of edges are formed in the final graph.

A pair of connected seeds

Further below is an example random network, generated by the current system. It’s long, and thin – very long and thin – as controlled by the network generation parameters.  Yes, a network of this shape is not very realistic. Exploring different network shapes is part of the fun! Adjustable parameters include what one might expect: a maximum size, number of graphs to generate, a set of starting points that must be bridged, and so on. Again: this is version 0.1; there is more to come.

Graph generation parameters are also specified in Atomese. For example, the maximum network size:

(State (Member max-network-size params) (Number 2000))

The MemberLink denotes set membership. The StateLink is a link that allows one and only one relation at a time. In this case, the MemberLink can be associated to only one Atom: the NumberNode Atom.

Initializing the network

After generating the network, one now has a number of interconnected individuals. Some initial values and state must be assigned to each.  This is straight-forward, in Atomese:

 (Bind
    (TypedVariable (Variable "$person") (Type "ConceptNode"))
    (Present (Member (Variable "$person") anchor))
    (SetValue (Variable "$person") seir-state susceptible)
    (SetValue (Variable "$person") susceptibility
       (RandomNumber (Number 0.2) (Number 0.8)))
    (SetValue (Variable "$person") infirmity
       (RandomNumber (Number 0.01) (Number 0.55)))
    (SetValue (Variable "$person") recovery
       (RandomNumber (Number 0.6) (Number 0.95))))

The BindLink is a graph-rewriting link. It searches for all graphs of a certain shape, having a variable area in them, and then takes that variable, and generates some new graphs. In this particular case, the variable is obvious: the first stanza is just a type declaration for the variable.  The second stanza, the PresentLink, asserts that the indicated graph must be found in the AtomSpace.  It’s a very simple graph here, just a MemberLink that must be present. The remaining stanzas are created and executed, initializing various keys on the individual with various random values.

A very long, thin social network

Running the simulation

The spread of the disease across the network can be achieved using the same graph-query and graph-rewriting mechanism. Consider the graph query below.

 (Bind
   (VariableList
      (TypedVariable (Variable "$A") (Type "ConceptNode"))
      (TypedVariable (Variable "$B") (Type "ConceptNode")))
   (And
      (Present
         (Evaluation
            (Concept "friend")
            (UnorderedLink (Variable "$A") (Variable "$B"))))
      (Equal (ValueOf (Variable "$A") seir-state) susceptible)
      (Equal (ValueOf (Variable "$B") seir-state) infected))
   (SetValue (Variable "$A") seir-state exposed))

This locates all “friendship” graphs: one must find, present in the AtomSpace, a very simple pairwise relation between A and B, marked with the label “friend”. Habitually, EvaluationLinks are used for this purpose. Having found this relationship, and if B is infected, and A is not yet ill, nor has developed immunity (that is, if A is susceptible), then A becomes exposed. The UnorderedLink is a link-type that allows any ordering of the contents, and during searches, explores all permutations. Its uses here to encode a mutual relationship between two friends.

After running such a disease-transmission step, one then runs a state-change step, using the SEIR rules described up top. And then one repeats … looping, until the population has succumbed, or there has been a break in the transmission that keeps a portion of the population isolated and uninfected.

Printing Results

The graph query language is an all purpose tool. Network stats are straight-forward to query. Of course! Like so:

 (Get
    (TypedVariable (Variable "$indiv") (Type "ConceptNode"))
    (And
       (Present (Member (Variable "$indiv") anchor))
       (Equal (ValueOf (Variable "$indiv") seir-state) infected)))

The above just searches for all individuals, identified by their membership to a common anchor point. and then checks to see if their state is “infected”. This returns a list of all of those individuals; one need merely count to see how many of them there are.

Putting it together

The demo is fully functional. Start a guile shell, load the demo file, and off it will go. Network generation may take anywhere from a small fraction of a section to over ten seconds, that’s the only burp. The source is heavily documented. If you get into trouble, the wiki links provide additional documentation. If Atomese is confusing, then perhaps its a review of the basic AtomSpace examples is in order.

The adventurous may try translating the demo to Python. The AtomSpace has Python bindings. They work. There’s no particular reason that Atomese has to be written in scheme (although, clearly, I find it preferable: its easier, simpler than Python. For me, at least.)

Conclusion

The AtomSpace is an in-RAM (hyper-)graph database with a power associated graph query language, specifically formulated for working with large, complex graphs, for re-writing and editing those graphs, and for managing the flow of state through the graph. Modelling disease networks is one of the things it should be good at … and I hope, with this blog post, I’ve convinced you that this is indeed the case.

Posted in Design, Development, Documentation, Introduction, Theory | Leave a comment

Value Flows

Graphs and graphical databases are now accepted as a very good (the best?) way of capturing the relationship between things. Yet, many of the “things” represented graphically are actually processes.  An example might be the water cycle in nature: rain collects into rivers and lakes, flows into oceans, and evaporates again. Rainwater erodes rocks, is absorbed into soils, is taken up by tree roots. This can be represented as a network graph, with nodes and links representing each of the relationships. To be useful, though, the network must be annotated with time-varying data: how much rain fell, how much was absorbed into the ground, and how much was run-off. These are the “values” that inhabit the network graph.

The OpenCog AtomSpace has long supported value annotation; it now has improved support for dynamically changing values. The goal of this blog post is to describe this a bit. But first, its important to draw a sharper distinction between graphs and values. Graphs are searchable — queryable — as typically, one wishes to find all graph patterns of a given shape, to exclude certain sub-patterns from a search, and so on. Performing such searches has a design impact on how data is represented in RAM: indexes must contain lists of nearest-neighbors, so that the graph can be rapidly walked. Parallel searches suggests that lock-free operations, via  immutable structures, are best. This in turn implies that graph nodes and links are fairly heavy-weight, and that updates – adding and deleting graph nodes and links – require CPU-consuming updates to various indexes. This is not particularly different than any other database: there are always indexes provided to speed up searches of a given kind, and updating those indexes takes time.  Values are different: the key idea behind Values is that they do not need to be indexed. If a rain gauge changes from 0.2 to 0.3, there is no imminent need to remove the old value from an index (a sorted, addressable ordered data structure: tree or hash table) and then place the new value into that index. If the time-changing value is sound (audio), or some video pixels, then indexing this data is not just absurd, but impossible in practice.  Thus, as a database design, one is driven to think of two distinct classes of data: the graph data, and the ephemeral, changing values.

This is the distinction made in the OpenCog AtomSpace. Nodes and Links – the AtomSpace Atoms – are part of the graph.  Each Atom has it’s own “little” key-value database attached to it: this holds the Values. We use a capital-V Value to distinguish the actual AtomSpace object from the general notion of values. Originally, Values were TruthValues – a binary 0 or 1, T or F, providing a crisp predicate logic valuation to each graph node or link. The obvious modern extension is to a Bayesian network – where each TruthValue becomes a probability. For uncertain inference, it is convenient to maintain a second number: the certainty. But why stop there? Values can be anything: vectors of numbers, strings of text, records of data, trees and DAGs. Anything that might need to be attached to a graph node or link, anything that seems transient, fleeting, and does not need indexing.

Fleeting implies flowing, or changing: the metaphor is that of pipes, and fluids that flow in them. The graph is the description of how the pipes are connected to one-another. Like actual plumbing, it can be a bit laborious to assemble this. The Values are like the fluid: flowing effortlessly, now that the graph has been assembled.

But what is this like, in practice?  Suppose one has a relationship: if P and Q then R, and one wants to compute some numeric value: say, “if rain and dry soil then fraction absorbed into soil”. This is represented as a formula. Let’s dive right in. An arithmetic formula, expressed in Atomese, looks like this:

(Lambda (VariableList (Variable "$X") (Variable "$Y"))
    (Divide
        (Times (Variable "$X") (Plus (Variable "$Y") (Number 1)))
        (Number 2)))

This is painfully verbose, but you get the idea. This formula is actually a graph: this is why it is so verbose. The (Number 1) is actually a graph node, and the (Plus …) is a link encompassing the Atoms underneath it.  This is stored as a graph, for several reasons. One is that one should not assume that human beings (i.e. software programmers) are writing these formulas: they may be coming out of some machine-learning algorithm. As such, it must be possible to perform abstract symbolic manipulations on the formulas (as you might do in your favorite symbolic math system, e.g. Mathematica, Maxima, Sage,…). It must also be possible to evaluate the formulas: we’ll be plugging in actual numbers into those variables, and so the ability to compile the formula down to JIT code is desirable.

Values are kept in a key-value database, one per Atom (recall: an Atom is just a graph Node, or a Link). To access a specific value, one may write:

(ValueOf (Concept "rain") (Concept "Austin"))

The idea here being that the graph node “rain” has an associated table; one of the table entries is for “Austin” (again, a graph node!), and the value there might be a FloatValue — a list (vector) of floating-point numbers. The above is a graphical representation of how to access that Value. The inverse of this operation is the setting of Values. Putting this all together, one might have an expression like this:

(SetValue (Concept "runoff") (Concept "Austin")
   (Lambda ...)
   (ValueOf (Concept "rain") (Concept "Austin")))

The (Lambda …) here is an un-named, anonymous formula. Like all good systems, formulas can be named, and so (DefineLink (DefinedSchema “runoff calculation”) (Lambda …)) is how one names such things; the name can be used wherever the formulas is expected.

That’s it! The above demonstrates how to wire up a formula, connecting it to Values held in the AtomSpace. There are two kinds of updates possible: manual, and automatic. A “manual” update requires calling cog-execute! on the particular expression to run. That expression will be evaluated, drawing on whatever Values are currently stored, and updating whatever Values are specified as output. The dynamic update is more interesting. Here, the formula is installed in such a way that any time one examines the Value, the formula is triggered, thus running and updating the Values just before they are presented. The dynamic Values are really what provides the title to this blog entry: they capture the idea of Values flowing about, through the network graph.

… But …. But, dynamic Values remain a new, experimental feature of  the system. There are good reasons for this. First, it is all to easy to create an infinite loop: rain runs off, evaporates, goes into clouds and comes back down as rain again. Polling the current runoff can potentially trigger an unending weather-system simulation. Worse, there is the risk of combinatoric explosion: the water in the clouds comes not only from Austin, but also from other geographical locales, which in turn come from elsewhere still. Controlling combinatorial explosion in AI is an infamously hard problem. Providing practical tools to do so, in this context, remains an open problem.  (Yes, we’re working on it.)

The above used rain as a stand-in for a generic flow process. In practical machine learning, one might instead want to use something like TensorFlow, or some other neural-net package.  Indeed, such packages already include small, specific graph-specification languages, describing exactly how the neural net is to be wired up, where its inputs are, where it’s outputs go. But its a special-case. Weather simulation systems running on supercomputers also have ways of specifying what is connected to what. Then there is the Systems Biology Markup Language (SBML), which provides a way of specifying a differential equation that describes the kidney clearance rate of some blood component – or of the mitochondrial metabolism rate. The goal (or rather, the dream, so far) of the OpenCog AtomSpace value-flows subsystem is to provide a common home to all of these kinds of systemic simulation processes.

This journey has hardly begun. Those who are interested are invited to try out the system.  The examples are here, in flows.scm and flow-formulas.scm, although those examples won’t make much sense if you do not already have a good grasp of the overall system: so best start at the beginning, with the README for the tutorials.  A word of caution, though: the world of the AtomSpace is perhaps shockingly different from what you are used to, and shockingly different from what the rest of the world does. This is a barrier to entry. If you are used to Java or JavaScript, and already know GraphQL or Neo4J, you might struggle to understand what is going on, here. For this, we are truly sorry. It really would be nice to make all this more comfortable, more mundane and accessible. That is a real issue.  And, as with everything, we’d love to have help with this. So … call … or write. We’re waiting on you.

Posted in Documentation, Introduction, Theory | Tagged , | 2 Comments