Meaning-Text Theory

During some recent reading, it struck me that a useful framework for thinking about and talking about sentence generation is the MTT or “meaning-text theory” of Igor Mel’cuk, et al Here is one readable reference:

Igor A. Mel’čuk and Alain Polguère, (1987) “A Formal Lexicon in Meaning-Text Theory”, Computational Linguistics, vol. 13, pp. 261-275.

portal.acm.org/citation.cfm?id=48160.48166
www.aclweb.org/anthology/J/J87/J87-3006.pdf

Within the context of that theory, the output of the Stanford parser is strictly at the SSynR or “surface syntactic representation” level, while, as a general rule Relex attempts to generate the DSynR or “Deep syntactic representation” structure.  Some of what I’ve been trying to do with opencog is towards the “SemR” structure, as described in that paper.

The more I read about MTT, the more it seems to capture some of what we are trying to do (defacto are doing) with NLP within opencog.  In particular, the MTT concept of a “lexical function” (which is not really described in that paper??) could be a particularly strong way of guaranteeing correct syntactic output for segsim, nlgen or NLGen2

– Linas Vepstas

Posted in Theory | Tagged , , , , , , , , | Leave a comment

An Update

Time that we post a status update!

OpenCog has been a little more quiet than usual over the last couple of months. The developers list is still sporadically active, but some of the main developers are having to spend time on other work related projects meaning less AGI-driven focus (want to change that? donate here). We’re following several options for establishing further funding for the end of 2009 and through 2010, but we’ll see how that goes.

Instead of writing a long summary post, I’ll just give some bullet points:

  • Dr. Ben Goertzel spoke on building beneficial AGI at the Singularity Summit last month (video here).
  • Cassio Pennachin and Dr. Joel Pitt attended the GSoC Mentor’s Summit at the Googleplex in Mountain View, which led to meeting FOSS developers from around the word. This also allowed them to meet up with Moshe Looks (MOSES and PLOP author) for dinner and discussions around AGI with a foray into Newcomb’s Paradox.
  • Dr. Linas Vepstas released RelEx 1.2.1, an affiliate OpenCog project, along with the related project/dependency Link-grammer 4.6.5
  • .

I’m sure there are other items of note, so to the other contributors reading this, please feel free to comment and I’ll update this post ;-)

Posted in Development | Leave a comment

Semantic dependency relations

I spent the weekend comparing the Stanford parser to RelEx, and learned a lot. RelEx really does deserve to be called a “semantic relation extractor”, and not just a “dependency relation extractor”. It provides a more abstract, more semantic output than the Stanford parser, which sticks very narrowly to the syntactic structure of a sentence.

I wrote up a few paragraphs on the most prominent differences; most of my updates were to the RelEx dependency relations page.

Here are the main bullet points:

  • RelEx attempts basic entity extraction, and thus avoids generating nn noun modifier relations for named entities.
  • RelEx will collapse the object and complement of a preposition into one. Stanford will do this for some, but not all relationships.
  • RelEx will convert passive subjects into objects, and instead indicate passiveness by tagging the verb with a passive tense feature.
  • RelEx avoids generating copulas, if at all possible, and instead indicates copular relations as predicative adjectives, or in other ways.
  • RelEx extracts semantic variables from questions, with the intent of simplifying question answering. For example, “Where is the ball?” generates _pobj(_%atLocation, _$qVar) _psubj(_%atLocation, ball), which can then pattern-match a plausible answer: _pobj(under, couch).
  • RelEx attempts to extract comparison variables.

Its also clear to me that I could split up the relex processing into two stages: one which generates stanford-style syntactic relations, and a second stage that generates the more abstract stuff. This might be a wise move … Since RelEx is already more than 3x faster than the Stanford parser, this could attract new users.

– Linas Vepstas

Posted in Design, Development, Documentation, Theory | Tagged , , , , , , , | Leave a comment

Sentence Patterns

I’ve recently resumed work on the question-answering chatbot, and am trying to get it to comprehend a broader range of questions and statements.   The “big idea” is to create a number of “sentence patterns” that the pattern matcher can recognize and respond to.  The reason this is a “big” idea is because I am trying to avoid anything algorothmic or procedural — everything is to be done by specifying OpenCog hypergraphs, and NOT by writing C++ code, or scheme code (or python code…etc). The reason for working entirely with patterns and hypergraphs, rather than with C++ or scheme, is because this puts the “knowledge” of the system into a form that AI routines can manipulate it: learning algos can learn new hypergraphs; statistical algos can gather usage information on which hypergraphs get triggered, and so on.  This is all easer said than done: although I’ve eliminated a fair amount of question-answering code previously written in C++, I’ve also had to write some new scheme code. Bummer. :-(

Patten matching is now used through-out all of the OpenCog NLP pipeline, although not in a unified manner. The Link Grammar parser uses patterns (called “disjuncts”) to determine how the words in a sentence can link to one-another, thus “parsing”, or pulling the grammatical structure out of a sentence (this paper provides an excellent overview). The RelEx dependency relation extractor applies patterns on the link-grammar output to extract syntactic relations. For example, the sentence “John threw a rock” becomes

_obj(throw, ball)
_subj(throw, John)

after RelEx gets done with it. And now, there are a dozen patterns inside of OpenCog that can pick out certain kinds of questions and statements from RelEx output, and pattern-match questions to find answers to them.

For example, the new OpenCog patterns convert “The capital of France is Paris” into

capital_of(France, Paris)

and similarly, “What is the capital of France?” into

capital_of(France,what)

Treating “what” as a variable, there is yet another pattern that matches up the form of the question to the form of the answer, thus deducing that “what” must be “Paris”.

Somewhat harder is using patterns to distinguish similar from dis-similar concepts, so that sentences like “John threw a green ball” aren’t used as answers to questions such as “Did John throw a red ball?”: the word “ball” with modifier “green” has to be detected as a different entity than the word “ball” with modifier “red”: these are two different entities (called “semes” in the code). In fact, out of laziness, I’ve punted on this one: the promotion of word-instances to “semes” is done by code, rather than by pattern matching. But soon, I hope, this will change. In the meanwhile, the README file provides a more detailed discussion.

Here are some patterns that work these days:

<me>         John threw a green ball.
<me>         Fred threw a red ball
<me>         Mary threw a blue rock
<me>         who threw a ball?
<cogita-bot> Syntax pattern match found: Fred John
<me>         who threw a red ball?
<cogita-bot> Syntax pattern match found: Fred

<me>         Did Fred throw a ball?
<cogita-bot> Truth query determined “yes”: throw

<me>         Did Fred throw a red ball?
<cogita-bot> Truth query determined “yes”: throw

<me>         The color of the book is red.
<me>         What is the color of the book?
<cogita-bot> Triples abstraction found: red

<me>         the cat sat on the mat
<me>         what did the cat sit on?
<cogita-bot> Triples abstraction found: mat

And here are some that don’t yet work: “Did Fred throw a green ball?” — gets no reply, because the system can’t find an answer, and doesn’t make the common-sense leap of “can’t find answer-> answer must be no”. Another common-sense problem is illustrated by: “Did Fred throw a round ball?” — the system doesn’t know that balls are round, and simply assumes that a “round ball” is some special kind of “ball”. Oh well. There’s work to be done.

You can try out the chatbot yourself (when its up, and not broken!) on the IRC chat channel #opencog on the freenode.net chat servers.

– Linas Vepstas

Posted in Design, Introduction, Theory | Tagged , , , , , | 2 Comments

Frequency of grammatical disjuncts

The link-grammar parser uses labeled links to connect together pairs of words.  In order to capture the idea of proper grammatical construction, any given word is only allowed to have very specific links to its right or left: for example, verbs have their subject on the left, and an object on the right.  Link-grammar defines hundreds of different link types, and there are typically dozens or even hundreds of ways that these can attach to a word. Each allowed set of links is called a “disjunct”. So, for example:

MVp- Js+

is a disjunct that says “there must be an MVp link from this word, going to the left, and an Js link, going to the right”. This disjunct commonly connects prepositions to a verb on their left (the MV- link) and the object of the preposition on the right (the J+ link).

A good way to think about disjuncts is to imagine them as very fine-grained part-of-speech tags. Thus, when one sees “MVp- Js+” associated to a word, one knows not only that the word is a preposition, but even a bit more: its a preposition that took a singular object.  Disjuncts classify words not just into crude part-of-speech categories, but much finer categories:  thus verbs are not just as transtivie or intransitive verbs, but mgiht be transitive verbs that take both direct and indirect objects, or participles, etc.

Siva Reddy, a GSOC 2009 summer student, prepared a table of the frequency of occurrence of different disjuncts in a large collection of text. The top six entries are

Ds+           950275.635843
Xp-           838569.90527
A+          616522.664867
AN+        566658.997313
MVp- Js+       563082.649325
MVp- Jp+      446487.310222

and these are exactly what one might expect:

  • Ds+ connects the determiner “the” to nouns: and of course, “the” is the most frequent word in the English language.
  • Xp- connects the period at the end of the sentence to the start of the sentence, so of course its frequently observed.
  • A+ connects adjectives to nouns, AN+ connects noun modifiers to nouns.
  • As noted above, MV connects verbs to modifying phrases, and J connects prepositions to objects, so that MV- J+ is the disjunct that most prepositions will get. Js connects to a singular object, Jp connects to a plural count or mass noun.

A graph of rank vs. frequency is shown below:

Disjunct rank vs. frequency of occurance

Disjunct rank vs. frequency of occurance

As can be seen, the distribution is more or less Zipfian, with a power-law exponent of 1.5.  The fact that the long tail appears to be linear indicates that grammatical construction in the English language appears to be more ore less scale-free: difficult and akward constructions are increasingly rare.  The fact that the graph is not purely Zipfian, but instead has a knee for the most common grammatical connections suggests that the most common grammatical constructions are “less common than they should be”: almost as if English speakers are resisting the use of formulaic sentence constructions. So, for example, since adjectives and noun-modifiers appear near the top of the rank, this suggests that English speakers “could have” used more adjectives and noun-modifiers, but didn’t. Quite why this is so is not clear.  Perhaps the use of anaphora and references in general  helps decrease the need for lots of modifiers.

The open questions are then:

  1. Why a power law of 1.5?
  2. Why is there a knee?
  3. Does this result hold for other languages?

The corpus used here consists of approximately 1 million sentences, obtained by parsing entire Wikipedia articles, Voice of America news stories, and 10 books from Project Gutenberg, including War and Peace, Jane Austen, and some scientific or medical texts.

– Linas Vepstas

Posted in Development, Theory | Tagged , , , , , | 4 Comments

Visualizing PLN inference

Recently Jared Wigmore, a student of Waikato University, New Zealand, created a tool for visualizing PLN as part of a visualisation project.

BIT visualizer

BIT visualizer

In my opinion, the BIT visualiser shows great promise as a tool for understanding the complexities of BIT expansion. In particular, the cross joins between sub-trees make it much clearer how sharing of sub-trees is occurring. The size of the BITNodes reflect their fitness evaluation in determining which node of the inference tree will be expand next and will inevitably be useful when we get to the stage of tuning the the fitness heuristic.

Being a prototype, there is plenty of scope for continued development, a couple of the many ideas that immediately come to mind are:

  • expansion of BITNodes by clicking on them (this would require OpenCog to provide an XML-RPC interface first however), and
  • thematic colouring of rules so that it’s easier to distinguish between the subtrees.

This is part of the bigger challenge of general AtomSpace visualisation. How do we convey knowledge about the processes that are going on in a digital mind to humans in a meaningful way?

Posted in Development | Tagged , | Leave a comment

proto-chatbot at last!

Hands-on tutorials are planned for the next month or so; we’ve already had a few on PLN, and my turn is coming up, for the opencog NLP pipeline. So I thought I’d wire up a cute demo for the occasion: a rough, crude IRC chatbot, “La Cogita”. It can answer simple questions about straightforward statements. Nothing fancy … it doesn’t do any reasoning at all … but it can work off of the basic syntactic structure of English sentences to find answers. Thus, for example:


<linas> Mary ate a mango
<cogita-bot> Hello linas, parsing ...
<cogita-bot> linas, you made a statement: Mary ate a mango
<linas> what did Mary eat?
<cogita-bot> Hello linas, parsing ...
<cogita-bot> linas, you asked a question: what did Mary eat?
<cogita-bot> The answer to your question is: mango

Its meant to be a demo of the basic NLP pipeline within OpenCog. It takes input text, runs it throught the Link Grammar + RelEx parser, imports the results into the OpenCog atomspace, sucks in a small common-sense database, and waits for questions to be posed. The common-sense database is derived from MIT’s ConceptNet (OpenMind/CommonSense project), and so one can have interactions like the following:


<linas> cogita-bot: what is a saxophone?
<cogita-bot> Hello linas, parsing ...
<cogita-bot> linas, you asked a question: what is a saxophone?
<cogita-bot> No answer was found to your question.
<linas> hmm
<linas> cogita-bot: what is an instrument?
<cogita-bot> Hello linas, parsing ...
<cogita-bot> linas, you asked a question: what is an instrument?
<cogita-bot> The answer to your question is: woodwind r bass harmonica An_Oboe Oboe megaphone saxophone chronometer drum scale ukulele cymbal instrument
<linas> Heh. Complete with assorted linguistic garbage :-)

You get the idea. Don’t ask it anything more complicated than the above examples: it will fail to find any answer. Again, it does no reasoning at all. Its as thick as a brick. You can test-drive it at the #opencog channel on the freenode.net IRC network.  Assuming its not down for development.

Next up: wire in NLGen for natural-language output, and start taking baby steps in actual reasoning. Anyway, I’m pretty excited, as this means that a lot of the basic bits&pieces are working, and I can now dive into the deep end, and start working on the hard stuff.

– Linas Vepstas

Posted in Development, Meta, Theory | Tagged , , , , | Leave a comment

GSoC 2009 project list

The decision process for the 2009 GSoC projects has been completed. You can read Ben’s announcement on the opencog-soc Google group.

The accepted projects are:

  • Joel Lehman – Extending MOSES to evolve Recurrent Neural Networks
  • David Kilgore – Python Interfaces For OpenCog Framework API
  • Ruiting Lian – Natural Language Generation using RelEx and the Link Parser
  • Rui Liu – Application of Pleasure Algorithm Project
  • samir souza – Integration of Language Comprehension with Virtual Agent Control in OpenCog
  • siva reddy – Statistical Learning and Refinement of RelEx Graph Transformation Rules
  • Jeremy Schlatter – Distributed and Persistent AtomSpace
  • Kemal Eren – Neurobiological data analysis in OpenBioMind
  • Xiaohui Liu – Improved hBOA by integrating the BBHC and implement the simulated annealing algorithm

More detail is available for each on the GSoC OpenCog home page.

Posted in Development | 1 Comment

OpenCog and Google Summer of Code 2009

We are happy to announce that the SIAI has been selected again this year to participate in the Google Summer of Code program as a mentoring organization. GSoC is an annual program that awards successful student contributors a 4500 USD summer stipend to work on open source and free software projects for three months. Around one thousand students worldwide participated in GSoC 2008, with eleven students working on OpenCog related projects. Students may apply for GSoC 2009, beginning at the SIAI organization page. The student application period closes on April 3, 2009 at 19:00 UTC.

Posted in Development | Tagged , | Leave a comment

Distribution of Mutual Information

I’ve been playing NLP statistics games for a long time now, and got to thinking that I had no clue as to the statistical distribution of some of the things I work with.  So below follow some graphs.

Mutual information of nearby words

Mutual information of nearby words

Above is a graph showing the distribution of the mutual information of word pairs that occur in the same sentence. A number of texts were analyzed, including a portion of Wikipedia, some books from project Gutenberg, etc. A collection of all possible pairs of words was created, where each word in the pair occurs in the same sentence (with the left word of the pair having occurred in the sentence to the left of the right word in the pair). These were counted — about 10 million word pairs were observed — and their mutual information was calculated.

Mutual information is a measure of the likelihood of seeing two words occur together: thus, for example “Northern Ireland” will have a high mutual information, since the words “Northern” and “Ireland” are used together frequently.  By contrast, “Ireland is” will have negative mutual information, mostly because the word “is” is used with many, many other words besides “Ireland”; there is no special relationship between these words. High-mutual-information word pairs are typically noun phrases, often idioms and “collocations”, and almost always embody some concept (so, for example, “Northern Ireland” is the name of a place — the name of the conception of a particular country).

In mathematical terms, the mutual information of a word pair (x,y) is defined as:

M(x,y) = log_2  P(x,y) / P(x,*) P(*,y)

where P(x,y) is the probability of seeing the word pair (x,y), P(x,*) is the probability of seeing a word pair where the left word is x, and P(*,y) is the probability of seing a word pair where the right word is y.

The graph shows M(x,y) on the horizontal axis, and the probability of seeing such a value of M on the vertical axis. This is a bin-count of the distribution of possible values of mutual information, over all word pairs.  This is *NOT* a scatterplot of M(x,y) vs. P(x,y).

Here’s another graph: same as above, except that this time, only pairs of words that occur immediately next to one-another are considered.  The sample size is much smaller: only about 2.4M word-pairs were collected.

Mutual Information of Neighboring Word Pairs

Mutual Information of Neighboring Word Pairs

The blue and green exponential lines are located in exactly the same place as in the previous graph. It’s humped in a different way than the previous graph. What is the shape of this hump?  Are the slopes characteristic, or do they vary from one corpus sample to another?  If anyone knows the answers to these questions, please let me know!

Notice the peaks off to the right, at high MI values, in the first graph. I think these are word pairs which are heavily used (topics/terms that are discussed) in one single contributing text, but in none of the others. That’s the hypothesis, I don’t know.

Here is a more detailed discussion, with many other additional figures.

– Linas Vepstas

Posted in Development, Theory | Tagged , , , , , , | 3 Comments