The OpenCog AtomSpace is a typed graphical distributed in-RAM knowledgebase. The meaning of all those buzzwords is that it is trying to fix many different, related problems that computers (and people) have when working with data. This blog-post discusses some basic issues of working with knowledge, and the various ways that people have tried to work around these issues over the decades. It will illuminate why graph databases are explosively popular these days, and will also show how that’s not the end of the story. I’m trying to start the next chapter already.
But first, lets start at the very beginning. Everyone calls it “data” instead of “knowledge” because everyone is a bit embarrassed to admit that they don’t really know how to stick “knowledge” onto a computer. The 1970’s saw the invention and development of relational databases and its mathematical underpinnings, the relational algebra. This formalized the idea of “is-a”, “has-a”, “is a part of”, “inherits from”, “is-similar-to” and whatever other relation you could possibly think of. Not a bad start for representing knowledge. The API for this relational knowledge eventually congealed into SQL.
A lot of people don’t like SQL. Those three letters make them run in the opposite direction. What’s wrong with SQL? Many things; here’s one: you have to put all your data into tables. You have to invent the table first, before you can use it. You have to plan ahead. Compare this to DataLog, the data-only subset of ProLog. Here, you could declare any fact, any statement, any relation at any time, without prior planning. Great, right? What’s wrong with that? Well, two things: a lack of indexes for quickly finding the data you want, and some trouble searching for it. Indexes are great: you just jump to where you want, instead of having to crawl all over your immense dataset. Another problem was the lack of a uniform query language that ran in some predictable, non-surprising amount of time.
Sometimes structure is nice. For example (a bad example?) with XML, you can also declare knowledge in a free-form fashion: put a tag here, a tag there, anywhere, you’re all good. Mark it up with your markup language. The people who want some structure in their life use XML schemas, which constrain some of this free-form, anything-goes attitude. The people who want to treat XML as a large distributed database have a query language to search XML from all over the web: XQuery, the XML query language. We should be in the seventh heaven: world-wide searchable free-form data? What can possibly be wrong with that?
Well, one thing is that you don’t write XML Schemas in XML; you write them in this other weird language. You don’t write XQuery in XML; it’s got it’s own new, weird and different syntax and structure. It a lot like what’s wrong with SQL: an SQL query is not actually a row or a table. It’s so very different from being a “row in a table” that its hard to even imagine that. This is one thing that ProLog got right: a ProLog query to find data stored in ProLog is written in ProLog. There’s no clear distinction between the data itself, and the way you work with it. Yes, that’s a good thing. A very good thing, actually
These various stumbling blocks have driven the explosive popularity of graph databases. You get to write free-form declarations of relationships, abandoning the strictures of tables. You get something a bit more modern that DataLog, which is a bit old and set in it’s ways: the syntax isn’t liberating. You get something that is infinitely more readable and writable that XML. And, with the well-designed graph databases (I’m thinking of you, grakn.ai) you get to avoid a lot of arcane markup and obtuse documentation. At least in the beginning. Graph databases are the second coming. We should be in heaven by now, right? What could possibly be wrong with graph databases?
Well, in most cases, graph queries (expressed in graph query languages) are not themselves actually graphs. They still sit “outside the system”, like SQL, but unlike ProLog. But when you represent your query as a graph itself, you can do neat things. You can go recursive. You can query for queries. You can find, edit and update queries with other queries. Create new ones on the fly. Even go backwards: given an “answer”, you can ask “what are the questions (the queries) that would return this answer?” This is actually a big deal in chatbot systems, where a human says something, and the chatbot has to search for all patterns (all templates) that match what was said. Amazingly, pretty much no general-purpose database can do this, even though its foundational for certain algorithms.
Working directly with queries is a big deal in rule engines, where each rule is a effectively a query, and you want to chain them together. But first you have to find the rule. How do you do that, if you can’t query your queries? Of course you can hack it. But seriously? hack? And then there’s theorem-proving and inferencing, where the reasoning/inferencing has to pick and choose through a collection rules to find which ones might be attachable next. And sometimes, you want to learn new rules: statistically, by counting, or randomly, by trial and error and cross-over (aka “genetic algorithms”) Perhaps by hill-climbing or gradient-descent. If you want to learn a new query, its best if you can store the query itself in your database.
There’s also a different problem. Graph databases don’t have a type system. It turns out that types are incredibly useful for structuring your data. Its the foundation-stone of object-oriented programming. A Java class is a type. A Python class/object is a type. C++ classes are types. For a while, object-oriented databases were all the rage. This is in contrast to SQL, which has a poor type system with only a handful of types: A table. A row in a table. An entry in a row in a table. Which could be a string, or a number. That’s pretty much it. Compare that every object you’ve ever known.
But there’s something way cooler about types that the Java&Python crowd has missed, but the functional-programming crowd (Scala and Haskell) has grokked: type constructors. The true geeks have gone even farther: proofs-as-programs, aka Curry-Howard correspondence. All the hip theorem provers (Agda and Coq) are dependent on types. What makes type constructors neat? Well, you can write programs to compute types, but unlike ordinary programs, they are guaranteed to terminate. This means that many basic tasks become simple, doable, predictable.
Like what? Well, reasoning, inference, learning. The sorts of things you want your AI/AGI system to do. Its not an accident that theorem provers are fundamentally tangled with type theory –blecch, you might think, but reasoning and theorem proving are kind-of the same thing. Who needs all that when we’ve got deep-learning neural nets? OK, off-topic – my earlier blog post dealt with how neural nets and symbolic reasoning are a lot more similar than they are different. The similarities are bit hidden, but they are there, and clear, once you see them. Anyway, off-topic, but when deep-learning hits a wall, and it will, it will be because it lakes any notion of types. Not that it matters just right now: neural nets already lack a formal relational database structure. You can’t ask a neural net “How many sales has Joe Smith made this month?”
Here’s my sales pitch: you want a graph database with a sophisticated type system built into it. Maybe you don’t know this yet. But you do. You will. You’ll have trouble doing anything reasonable with your knowledge (like reasoning, inferencing and learning) if you don’t. This is why the OpenCog AtomSpace is a graph database, with types.
There’s one more thing that the AtomSpace does, that no other graph database does (well, there are several things that no one else does, but this is the only one I want to talk about now). They’re called “Values”. Turns out there’s a cost to storing a graph: you have to know which edge is attached to what vertex, if you are going to walk the graph. You have to keep an index to track edges (I can hear the boss now: “You’re telling me we have five guys named Joe Smith in Sales? Which one did we promote?”) Bummer if your database can’t keep track of who’s who. But sometimes you don’t need to stick every last piece of trivia in an index, to place it where it’s instantly searchable. Sometimes, a key-value store is enough. This is kind-of the noSQL ethos: Got the key? Get the value; don’t need a query to do that.
Every vertex, every edge in the AtomSpace has a key-value database built into it. Its small, its lightweight. It enables fast-changing data without the overhead of indexing it. It takes up less RAM, literally. We’re experimenting with using it for streaming data: video streams, audio streams. Hooking them up the way that tensorflow might hook up data with Keras. Its pretty neat. You can stick truth values, attention values; anything changing quickly over time, into Values. Any floating point number. And if you have a number that almost never changes (like Joe Smith’s salary), and you really do want to query it (so you can pay him), then just store that number in the graph database (and not in the per-edge key-value mini-database). You have a choice. Speed when you need it, searchability where you need it.
Well, that’s it for now. The OpenCog AtomSpace is an in-RAM distributed typed graph knowledgebase. I guess I forgot to talk about “distributed”. It is; run the demo. You can share data across many different machines. It’s in-RAM, which mostly means that you can work with things at more-or-less the speed of accessing RAM. Unlike SQL, it has a rich type system. Unlike SQL or pretty much any graph database, (but like ProLog) the queries are themselves graphs. So anyway, that’s what we tried to do with the AtomSpace. Everything I talked about above works today. Worked last year. Most of it worked just fine 5 years ago. It is still a bit rough around the edges. Its got some usability issues. There’s no pretty website (oh, a wet dream). No easy documentation. (The documentation is accurate. There’s lots of it. Its almost unreadable. I know; I wrote it.) But we’re working on it. Come help us out.