I’ve recently made some good progress on something that I’m calling “the Viterbi decoder”, a new parser for the Link Grammar natural language parser. So I guess that means its time to talk a bit about the why and how of this parser.
The goal of providing this decoder is to present a flexible, powerful interface for implementing high-level semantic algorithms on top of the the low-level link-grammar syntactic parser, and, in particular, for steering the parse based on high-level semantic knowledge. This allows the parser to move beyond being merely a syntactic parser, and to become fully integrated with general semantic artificial intelligence.
A less abstract list of expected benefits include:
- Incremental parsing: the ability to obtain partial results after providing partial sentences, a word at a time.
- Less sensitivity to sentence boundaries, allowing longer, run-on sentences to be parsed far more quickly.
- Mitigation of the combinatorial explosion of parses.
- Allow gramatically broken/incorrect chat dialog to be parsed; in general, to do better with slang, hip-speak.
- Enable co-reference resolution and anaphora resolution across sentences (resolve pronouns, etc.)
- Enable annotation of the parse graph with word-sense data, entity markers.
- Allow richer state to be passed up to higher layers: specifically, alternate parses for fractions of a sentence, alternative reference resolutions.
- Allow a plug-in architecture, so that plugins, employing higher- level semantic (AGI) algorithms can provide parse guidance and parse disambiguation.
- Eliminate many of the hard-coded array sizes in the code.
The data structures used to implement this resemble those of the OpenCog AtomSpace. All data classes inherit from a class called Atom (which is an atomic predicate, in the sense of mathematical logic). Atoms are typed; the two core types are Links and Nodes. Thus, all data is represented in the form of a “term algebra” (aka the “Free Theory”, in the sense of model theory). This structure allows all data to be represented as (hyper-)graphs, which in turn makes the implementation of graph algorithms easier to implement. All these theoretical considerations provide a natural setting for storing Viterbi state information. Put differently, this provide a generic, uniform way of holding the various partly-finished parses, and effecting state transformations on them.
Since all of the data is represented dynamically (at run-time) by these (hyper-)graphs composed of atoms, developing custom algorithms to manipulate the parse becomes easy: there are no strange compile-time structures to master. All algorithms can access the data in a uniform, common way.
Making the internal state directly visible allows low-level syntactic algorithms, as well as high-level, semantic algorithms to control parsing. In other words, the intended use of the Viterbi decoder is to provide a framework for parsing that should make it possible to integrate tightly (and cleanly) with high-level semantic analysis algorithms. Thus, reference and anaphora resolution can be done using the same graph structure as used for parsing; it should also allow graphical transformations, such as those currently implemented in RelEx.
One may argue that Viterbi is a more natural, biological way of working with sequences. Some experimental, psychological support for this can be found via the news story “Language Use is Simpler Than Prviously Thought“, per Morten Christiansen, Cornell professor of psychology.
Currently, the parser can correctly parse many short sentences. It currently runs very slowly, as no pruning algorithms have yet been implemented. Instructions for turning it on can be found in the viterbi/README file. The code is not in the 4.7.10 tarball; you need something newer: i.e. pull from the svn source tree. It will be in 4.7.11, whenever that comes out.
Here’s an example parse of “this is a test”. First, the usual link-parser output:
+--Ost--+ +-Ss*b+ +-Ds-+ | | | | this.p is.v a test.n
or, with the wall words:
+---------------RW--------------+ | +--Ost--+ | +---Wd---+-Ss*b+ +-Ds-+ | | | | | | | LEFT-WALL this.p is.v a test.n RIGHT-WALL
The output of viterbi, with some explanatory comments, is this:
SEQ : # a sequence, an ordered set LING : # a link-grammar link; naming conflict with opencog link. LING_TYPE : Wd # the type of the link connecting two words. WORD_DISJ : # holds the word and the connector used WORD : LEFT-WALL # all sentences begin with the left-wall. CONNECTOR : Wd+ # + means "connect to the right". - means left WORD_DISJ : WORD : this.p # word with suffix as it appears in link-grammar dictionary CONNECTOR : Wd- LING : LING_TYPE : Ss*b # and so on ... WORD_DISJ : WORD : this.p CONNECTOR : Ss*b+ WORD_DISJ : WORD : is.v CONNECTOR : Ss- LING : LING_TYPE : Ds WORD_DISJ : WORD : a CONNECTOR : Ds+ WORD_DISJ : WORD : test.n CONNECTOR : Ds- LING : LING_TYPE : Ost WORD_DISJ : WORD : is.v CONNECTOR : O*t+ WORD_DISJ : WORD : test.n CONNECTOR : Os-
Oh, and I suppose its appropriate to answer the question “why is it called the Viterbi parser”? I’m calling it that because it is inspired by (and vaguely resembles) the Viterbi algorithm famous from signal processing. A characteristic feature of that algorithm is that it maintains a set of states in parallel. As each new bit is received, some of the states become inherently inconsistent (e.g. because some checksum is violated), while other new states become possible. Once some certain number of bits have been received, the ones that can be consistently interpreted with the checksum constraints can be output. The process then repeats with each new bit streaming in.
In link-grammar, a “disjunct” can be thought of as a puzzle piece with a word printed on it. There are many different puzzle pieces with the same word on it. As each word comes in, one tries to find a piece that fits (this is like the viterbi checksum). Sometimes, more than one fits, so one has multiple ‘alternatives’ (this is like the viterbi state-vector). The algo keeps a set of these alternatives (of assembled pieces), and, as words come in, alternatives are either discarded (because nothing fits) or are elaborated on.
Unlike the viterbi algorithm, in natural language processing, it is useful to keep some of these alternatives or ambiguities around until much later stages of processing, when the disambiguation can finally be performed. As a famous example: “I saw the man with the telescope” has two valid syntactic parses, and two valid semantic interpretations. Who was holding the telescope, me, or the man? Resolving this would be like applying a checksum to two different paths very late in the Viterbi game.
I like this analogy because it is vaguely biological as well: or perhaps I should say “neural net-ish”. The multiple, provisional states that are kept around are sort of like the activation states of a feed-forward artificial neural network. But this is not very deep: the feed-forward neural net looks like a Hidden Markov Model (HMM), and the Viterbi algorithm is essentially an HMM algorithm. No surprise!
But all this talk of algorithms hides the true reason for this work. The above algo is not strong enough to reproduce the old parser behavior: it can create islands; it ignores post-processing. The original algorithm uses integer-valued “cost” to rank parses; I want to replace this by floating point values (probabilities! maximum entropy!).
I also want to implement an “algorithm plug-in” API — basically, a way of offering “here’s the current state, go and modify it.” — to have ‘mind-agents’ in OpenCog terminology. The above puzzle-piece assembly algo would be the first to run, but clearly, others are needed to prevent islands, or to re-order states by probability/likelihood. Some of these may be clearly distinct algos; others may end up as tangled balls of complexity. Factorization into distinct algos is clearly possible: RelEx already had a list of algos there were applied in sequential order. First, some POS tagging was done, then some head-word verb extraction, then some entity extraction, etc. Algorithms can be layered.
So really, the core issue I’m hoping to solve here is that of having a uniform development environment: link-grammar is in C, has no probability (besides cost), and no internal API. RelEx is in Java, is explicitly graphical, but is not a hypergraph, has no probabilities, and can’t provide parse feed-back to control link-grammar. RelEx output was pumped into OpenCog, which is in C++; it cannot feedback into RelEx or Link Grammar. The Link Grammar dictionaries are files: how can an automated system learn a new word, and stick it into a file?
At the moment, there aren’t really any new or novel algorithms: I’ve day-dreamed some in the past, but the fractured approach halts progress. All these boundaries are barriers; the hope here is to get past all of these barriers. The work is really an architectural re-design, and not a new whiz-bang algo.