OpenCog and Google Summer of Code 2009

We are happy to announce that the SIAI has been selected again this year to participate in the Google Summer of Code program as a mentoring organization. GSoC is an annual program that awards successful student contributors a 4500 USD summer stipend to work on open source and free software projects for three months. Around one thousand students worldwide participated in GSoC 2008, with eleven students working on OpenCog related projects. Students may apply for GSoC 2009, beginning at the SIAI organization page. The student application period closes on April 3, 2009 at 19:00 UTC.

Posted in Development | Tagged , | Leave a comment

Distribution of Mutual Information

I’ve been playing NLP statistics games for a long time now, and got to thinking that I had no clue as to the statistical distribution of some of the things I work with.  So below follow some graphs.

Mutual information of nearby words

Mutual information of nearby words

Above is a graph showing the distribution of the mutual information of word pairs that occur in the same sentence. A number of texts were analyzed, including a portion of Wikipedia, some books from project Gutenberg, etc. A collection of all possible pairs of words was created, where each word in the pair occurs in the same sentence (with the left word of the pair having occurred in the sentence to the left of the right word in the pair). These were counted — about 10 million word pairs were observed — and their mutual information was calculated.

Mutual information is a measure of the likelihood of seeing two words occur together: thus, for example “Northern Ireland” will have a high mutual information, since the words “Northern” and “Ireland” are used together frequently.  By contrast, “Ireland is” will have negative mutual information, mostly because the word “is” is used with many, many other words besides “Ireland”; there is no special relationship between these words. High-mutual-information word pairs are typically noun phrases, often idioms and “collocations”, and almost always embody some concept (so, for example, “Northern Ireland” is the name of a place — the name of the conception of a particular country).

In mathematical terms, the mutual information of a word pair (x,y) is defined as:

M(x,y) = log_2  P(x,y) / P(x,*) P(*,y)

where P(x,y) is the probability of seeing the word pair (x,y), P(x,*) is the probability of seeing a word pair where the left word is x, and P(*,y) is the probability of seing a word pair where the right word is y.

The graph shows M(x,y) on the horizontal axis, and the probability of seeing such a value of M on the vertical axis. This is a bin-count of the distribution of possible values of mutual information, over all word pairs.  This is *NOT* a scatterplot of M(x,y) vs. P(x,y).

Here’s another graph: same as above, except that this time, only pairs of words that occur immediately next to one-another are considered.  The sample size is much smaller: only about 2.4M word-pairs were collected.

Mutual Information of Neighboring Word Pairs

Mutual Information of Neighboring Word Pairs

The blue and green exponential lines are located in exactly the same place as in the previous graph. It’s humped in a different way than the previous graph. What is the shape of this hump?  Are the slopes characteristic, or do they vary from one corpus sample to another?  If anyone knows the answers to these questions, please let me know!

Notice the peaks off to the right, at high MI values, in the first graph. I think these are word pairs which are heavily used (topics/terms that are discussed) in one single contributing text, but in none of the others. That’s the hypothesis, I don’t know.

Here is a more detailed discussion, with many other additional figures.

– Linas Vepstas

Posted in Development, Theory | Tagged , , , , , , | 2 Comments

OpenCog at KiwiFoo

I’m currently at the tail end of KiwiFoo, a version of O’Reilly’s Foocamp in New Zealand. I hosted a talk about OpenCog which, as inevitably happens, turned into some interesting philosophical discussions about learning vs. memory.

I was also interviewed for the TVNZ7 show Media7 about OpenCog, so hopefully it will show up on national TV (even if it’s just in NZ ;) ).

Posted in Events | Leave a comment

Visualization with UbiGraph

Jared Wigmore has just finished implementing a prototype connection to the Ubigraph dynamic visualisation tool. It’s really neat! Currently available only in his personal bzr branch ( lp:~jared-wigmore/opencog/misc ), but it should be pushed to staging eventually.

This follows on from the visualisation stuff I did for Ben’s presentation at the Singularity Summit demonstrating first order PLN inference on word pairs.

(I’m going to try to make more regular posts here, even if they are short posts like this one).

Ubigraph demo

Posted in Development | Leave a comment

Determining word senses from grammatical usage

I’ve recently been tinkering with a mechanism for determining word senses based on their grammatical usage.  This has me pretty excited, because, so far, it seems to be reasonably accurate (i.e. not terrible), and lightning-fast.  I’m doing this by doing some heavy statistical NLP work, computing statistical correlations between word senses and syntax — specifically, link-grammar disjuncts.

The basic observation driving this work is that fairly often, one can identify the meaning of a word (or at least narrow it down) simply by observing how it is used in a sentence. To do this, one needs accurate, fine-grained syntactical information about a sentence: and Link Grammar provides an abundance of it. Link Grammar generates very fine-grained syntactic linkage information for every word in a sentence. Every dictionary word is associated with dozens, or even hundreds, of different linkage patterns, or ‘disjuncts’. These disjuncts indicate how a word in a sentence can be connected to other words to its left or right.  One may imagine that these disjuncts provide highly detailed grammatical information about a word. For example, they not only distinguish between a noun and a verb; they not only distinguish between present, past and future tenses of a verb, they not only distinguish between a transitive and an intransitive verb, but they also distinguish between a wide variety of relationships that are so fine that most are not even given formal names by linguists. This fine-grained information is a rich source of information, and is ideal for correlating with word senses.

For word-sense tagging, it turns out that simply picking the Most Frequent Sense (MFS) already gives a rather strong indication of what the correct word-sense assignment should be — it gives the correct answer about half the time– See [Mi04] below. So the idea here is that supplementing the MFS with additional data– how the word was used syntactically — will improve accuracy even further. That is, the idea is to compute the “MFS for a given syntactic context”.

Before one can perform statistical correlations between syntax and sense, one must first assign a sense to each word.  This is, of course, the really hard part to modern NLP. For now, I’m taking a fairly easy, straightforward, yet strong approach to this: I’m using  an algorithm due to Rada Mihalcea[Mi05].  This algorithm is currently, as far as I know, the most accurate algorithm known for tagging words with word senses. Unfortunately, it is also fairly slow and CPU intensive, making practical deployment tricky.  It performs this tagging by associating, with each word in a sentence, a list of its possible senses.  Then, senses of nearby words are linked together with a similarity measure. This forms a network, a graph, over the sentence, with vertices being word-senses, and edges being weighted by the similarity between senses. Such a network is formally a Markov chain, and can be solved as such. There are many ways of solving a Markov chain; Mihalcea proposes, and I’ve implemented, the Google (Page-Brin) page-rank algorithm[PB]. The result is that each vertex (each word-sense) is assigned a probability; The highest-probability senses do indeed appear to be the linguistically correct senses most of the time; the correct sense will almost always appear in the top three probabilities.

Once a word sense is identified, it can be correlated with the Link Grammar disjunct in play for the particular sentence.  This is done simply by processing a lot of sample text. A database then stores a frequency count for each (word, disjunct, word-sense) triple. After a reasonable amount of data is accumulated the unconditional probability p(w,d,s) can be calculated (w==word, d==disjunct, s==sense), and, from this, various marginal and conditional probabilities and entropies. To make use of this information in a new sentence, one first parses the sentence using the Link Grammar parser, thus obtaining (word,disjunct) pairs. It is then a straightforward (and fast) database lookup to obtain the conditional probability p(s|w,d), the probability of observing the sense s given the pair (w,d).  The exciting result of this effort is that, quite often, the conditional probability p(s|w,d) identifies one sense more or less uniquely (i.e. there is one sense for which p(s|w,d) is about 1).

Although this result is quite exciting, its based on the inspection of a small handful of nouns and verbs. I believe that the result holds well in a broad setting, but I don’t have any quantitative measure for the extent of the setting. Clearly, there will be *some* words for which the sense will be obvious from the grammatical usage. But, on average, how many of these are there per sentence?  In some cases, the sense won’t be unique, but there will be many senses that are ruled out. How often does this happen?  Is it possible that the accuracy results are equal to, or even improve, on the Mihalcea accuracy results? (They may improve on them by averaging over and eliminating false-positives, eliminating them because of the various different semantic contexts a word might appear in).  A quantitative measure of the recall and accuracy can, in principle, be done, as there is a database (the SemCor database) of text that has been hand-annotated with the correct senses.  I’ve not yet given any serious thought to performing this quantitative analysis.

Still, I’m pretty excited. It seems to work pretty well; I like that. I’ve already roughed in some basic infrastructure into the Link Grammar parser so that it will return the sense tags for each parse, assuming you have the database installed.  The tags returned are WordNet 3.0 sense keys — strings like “run%2:38:04::” which can be used to look up specific senses from WordNet.

To expose this function, database support has been added to the link-grammar parser.  This has been added to the parser itself, as opposed to a layer built on top of it, because database support is needed for other reasons — specifically, for parse ranking (Gee, I haven’t talked about parse ranking, have I?).  The database support is provided by sqllite[SQLLITE]. This was picked for two reasons: (1) its license is public domain, and is thus compatible with the link-grammar BSD license, and (2) it is an embedded database, requiring zero administration by the user. This second point is quite unlike traditional SQL databases, which typically require trained database administrator to configure and operate. One reason that zero administration is possible is because the database is used in a read-only fashion: the data it holds is static. Code integrating this database is in the link-grammar SVN repository now, and will be available in version 4.4.2.

Creating the dataset is a good bit tricker. Currently, the Mihalcea algorithm is implemented within OpenCog.  Was this a good technology choice? I dunno, but it seemed like a reasonable experiment at the time. Parsed sentences are fed to OpenCog, where word senses are assigned, and then frequency counts are updated.  I’ve been feeding it a diet consisting solely of parsed Wikipedia articles — not very healthy, but maybe OK for now.  The Markov chain network used to solve for word senses is four sentences wide, as a window sliding across an article. That is, a given word sense is influenced by other words occurring in sentences as far as four sentences away.  This should keep accuracy up, without bogging down in solving a Markov chain across an entire article. The current OpenCog implementation uses the Page-Brin PageRank algorithm; however, I’m thinking tht it might be faster simple to use the linpack subroutine library to solve for the eigenvectors directly. (The Page-Brin algorithm shows its power when the Markov chain has  billions or trillions of nodes, e.g. as used by Google.  By contrast, with a four-sentence sliding window, the Markov matrix connects at most thousands of senses, and thus should be rapidly solvable by ordinary linear equation techniques.)

So far, I’ve put a few CPU-months of data crunching into this. Its not much. It’s slow; it takes minutes per sentence — and this gives only one word-sense, disjunct pair. To build up a reasonable statistical dataset will require cpu-hours or more per word — and there’s not really all that many cpu-hours in a cpu-month. So its slow slogging. The  database coverage remains quite thin, as most disjuncts have been observed only a handful of times. There are maybe about a thousand or so words for which an ‘adequate’ amount of statistics have been collected; I’d like it better if the  database was deep enough to cover at least 20K words and 100K senses.  Since the preliminary results look so promising, I’m corssing my fingers, and am slowly been tinkering with ways to improve performance.

I’ll let you know …
References
==========
[Mi04] Rada Mihalcea, Paul Tarau, Elizabeth Figa, “PageRank on Semantic Networks, with Application to Word Sense Disambiguation” (2004) COLING ’04: Proceedings of the 20th international conference on Computational Linguistics DOI
[Mi05]  Rada Mihalcea, “Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data  Labeling”, (2005) HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing DOI

[SQLLITE] http://www.sqlite.org/

[PB] http://en.wikipedia.org/wiki/PageRank

– Linas Vepstas

Posted in Design, Theory | Tagged , , , , | 6 Comments

Fun with first-order inference

Joel Pitt has done some experiments testing first-order PLN inference in OpenCog, on some very simple data.

These experiments don’t use the indefinite probability formulas but rather the good old fashioned SimpleTruthValue PLN formulas.

What they involve is using PLN to extrapolate indirect word associations, from direct words associations mined from text (by some statistical text mining software created for OpenCog by Linas Vepstas).

This obviously does not stress the generality of PLN as an inference framework (no VariableNodes! no quantifiers! no intension! no fuzzy MemberLinks!).  There is nothing particularly revolutionary AI-wise here … it’s just some fairly straightforward, state-of-the-art statistical NLP … Hebbian learning on a neural net, among many other techniques, could do basically the same thing … but this is a reasonable “smoke test” of the ability to load a bunch of nodes and links into OpenCog and perform some basic inference processes on them.  One nice point about PLN is that it can handle relatively simple, associative-neural-netty stuff like this, as well as more complex reasoning involving variables and quantifiers and such, all seamlessly within the same mathematical, conceptual and software approach.

The reason I decided to write a blog post on this is that Joel produced some nifty pictures based on his work, using the open-source graph visualization package Tulip.

Here is a big nasty network of nodes and links in OpenCog, before inference:

Here is the same network, after some first-order PLN inference, with the inferred links in green:

Obviously the above don’t tell you too much.  Tulip was configured so that nodes representing more greatly similar words (in terms of their statistical association) would generally be placed closer together in the visualization.  Slightly more insight is given by zooming in, using Tulip, to see some of the nodes and links close up.  Again, the links in green are the products of inference:

Note that in the immediately above example, to build the associative link between “foreign” and “administration” requires the system to make two inferences in sequence:

Posted in Development | Tagged , , | 1 Comment

OpenCog tutorial sessions

A few weeks back Ben announced he’d be running IRC tutorial sessions on OpenCogPrime. Last night was the second tutorial, and was on the topic of knowledge representation – introducing people to the basic concepts of the AtomSpace, such as Atoms, Nodes, and Links and how various types of each represent things in OCP. If you missed out, there are logs linked from the wiki for both sessions (and future session should also end up logged and available from the wiki).

Lastly, Ben also donned a wizard’s hat for the event:

Posted in Theory | Tagged , , | Leave a comment

Progress update

[cross posted from The Singularity Institute Blog]

This blog post constitutes an update on the current state of work on the OpenCog open-source AI project.

No particular event occasioned me writing the post — no dramatic milestone has been reached — it just seemed like a good time for an update, as a lot of things are going on and not many people know about most of it.

While the OpenCog project is still at an early stage, progress has been exciting on a variety of fronts. After reviewing the work that’s been done and is underway, I’ll make a few comments on where I hope things will be by the end of the year, OpenCog-wise … and then (always future-oriented!) look ahead a bit to next year and beyond.

On the most practical front, Gustavo Gama, on the SIAI/OpenCog team, has been working on getting the core OpenCog Framework codebase in shape for an official release later this fall. The codebase has already been opened up to the public and made available to developers interested in participating in the early-stage development of the platform; the official release will signify that the code is ready for a wider variety of developers to participate, including those who want to use OpenCog as a platform for their own work rather than contributing to the development of the framework.

On the theoretical side, Dr. Ben Goertzel, SIAI Director of Research, released a wikibook comprising the equivalent of several hundred pages, outlining a detailed and specific design for an advanced AGI system based on the OpenCog framework. This design is called OpenCog Prime and is heavily inspired by the Novamente Cognition Engine. A weekly series of online tutorial sessions on OpenCogPrime is being offered, the first one of which was held on September 10, and scheduled to continue until early 2009.

Dr. Joel Pitt, on the SIAI/OpenCog team, has been working on adding AI functionality to OpenCog, with the OpenCogPrime design as well as more general AGI utility in mind. A port of the Probabilistic Logic Networks framework from the proprietary Novamente Cognition Engine codebase into OpenCog is underway and should be completed by mid-fall. Also, earlier this year Joel successfully implemented an initial version of an artificial-economics-based system for the allocation of attention within OpenCog, and leadership on the development of this code has now been taken over by Dr. Matthew Ikle’ from Adams State College.

Dr. Linas Vepstas, a Novamente LLC researcher, has integrated a number of natural language processing tools into OpenCog, based on various statistical algorithms, as well as the Carnegie-Mellon link parser and the related RelEx language processing framework. This code provides powerful mechanisms for turning English sentences into logical relationships residing in OpenCog’s knowledge base.

Novamente LLC researchers Dr. Predrag Janicic and Dr. Nil Geissweiller, working with Google researcher Dr. Moshe Looks, have integrated a new version of the MOSES probabilistic evolutionary learning framework (initially described in Dr. Looks’ 2006 PhD thesis, available at metacog.org) into OpenCog.

A team of Novamente LLC engineers led by Cassio Pennachin (and including Welter Silva, Carlos Lopes and Samir Araujo), in collaboration with Jani Pirkola and others on the RealXTend team, have been working on porting to OpenCog a substantial amount of Novamente LLC code concerned with the control of intelligent agents in 3D virtual worlds. Scheduled for completion in October, this project will initially result in an OpenCog system capable of controlling intelligent, adaptive virtual dogs in the open-source online world RealXTend (a modification of the OpenSim codebase, which began as an open-source analogue to the proprietary Second Life virtual world platforms). However it is intended for extensibility beyond virtual dogs to enable the general OpenCog-based control of virtual agents in virtual worlds.

Two Chinese PhD students, Rui Liu at Wuhan University and Lian Ruiting at Xiamen University, have been working with Novamente LLC and SIAI on creating OpenCog-based code for natural language generation: that is, for taking knowledge in an OpenCog knowledge base and translating it into English. A prototype system exists that works for simple sentences, and Ruiting is currently figuring out how to generalize it.

Last but definitely not least, 11 interns were funded to work on OpenCog for Summer 2008 via the Google Summer of Code project (thanks, Google!!). Their projects covered a variety of areas, and some were dramatically successful. To name just two among many deserving examples: Filip Maric designed and implemented a new approach to grammar parsing based on Boolean satisfaction, and integrated it with the Carnegie-Mellon link parser which is integrated into OpenCog’s RelEx natural language processing framework; and Cesar Maracondes refactored the implementation of key portions of the Probabilistic Logic Networks framework. A full recounting of the GSoC work may be found at http://brainwave.opencog.org/2008/09/, which links to another page that in turn links to individual student project pages.

There are a lot of things going on with OpenCog and I’m aware I’ve left some interesting things out … but I hope I’ve given a reasonable overall flavor of the state of progress.

Where do we hope to be by the end of the year? An officially-released OpenCog Framework … with PLN, MOSES, attention allocation, virtual-agent control in Multiverse and RealXTend, and basic natural language comprehension and generation integrated. All these things are underway, under intense development, and not too far from completion. With all these things in place, we’ll be poised for a really exciting 2009 for OpenCog.

With luck, during 2009 we will make serious progress toward creating a “virtually embodied artificial infant” (which may take humanoid, animal or virtual-robot form: that’s not the point) based on the OpenCogPrime design, and will also see a variety of other AGi approaches implemented within OpenCog by a diversity of researchers.

Also, the vague and sketchy OpenCogPrime roadmap recently posted is scheduled to be turned into a more thorough and precise document with careful attention to evaluation and metrics at each envisioned stage.

As the roadmap indicates, there is a clear and definite plan in place, that seems to have a plausible chance of leading from the current early stage of OpenCog development through a series of more and more advanced stages, defined loosely by reference to human childhood cognitive development (although the specifics of OpenCogPrime and OpenCog more generally are not closely tied to human biopsychology, nevertheless it seems that human development may be used as a rough analogue for the developmental progress of virtually or physically embodied AI systems based on OpenCogPrime and some other OpenCog-based designs).

This year the focus is on getting the basic AI mechanisms in place … next year (though for sure more work on AI mechanisms will continue!) we hope to segue into more of a focus on artificial baby-building … and with hard work and just a little luck, a few years down the road we’ll have a robust and highly intelligent OpenCog-based AGi dude on our hands. But I don’t want to diverge too far onto the glorious future … I’ve written enough on that already elsewhere … the point of this post was supposed to be to summarize the state of current progress. So, that’s all for now!

“May you live in interesting times.” ;-)

Posted in Development, Meta | Tagged , | Leave a comment

OpenCog Google-Summer-Of-Code Roundup

This summer OpenCog was chosen by Google to participate in the Google Summer of Code project: Google funded 11 students from around the world to work on OpenCog coding projects under the supervision of experienced mentors associated with the OpenCog project, and the associated OpenBiomind project

Applying for GSoC was David Hart’s idea originally; and, David and the Singularity Institute did a lot of the work needed to make it happen — so I need to extend very hearty thanks to the both of them, as all in all it worked out wonderfully well.

There were plenty of ups and downs over the summer, but overall the GSoC projects went extremely well and a lot of fabulous work got done.  Furthermore, a number of the projects are going to be continued during the fall and beyond, either via students continuing them as course or thesis projects, or via students continuing to work on them in their spare time … and in one case, via a student being funded to continue the project by a commercial organization interested in using their OpenCog work.

OpenCog is a large AI software project with hugely ambitious goals (you can’t get much more ambitious than “creating powerful AI at the human level and beyond”) and a lot of “moving parts” — and the most successful OpenCog GSoC projects seemed to be the ones that successfully split off “summer sized chunks” from the whole project, which were meaningful and important in themselves, and yet also formed part of the larger OpenCog endeavor … moving toward greater and greater general intelligence.

Should OpenCog be chosen to participate in GSoC next year, I believe the projects will take a quite different flavor because OpenCog will be more mature then: I would hope to see more 2009 OpenCog GSoC projects involving the integrated functionality of the OpenCog system.  But this year OpenCog is young so the best approach was to have students work on various important pieces of the overall system, and that’s what happened, generally to quite good effect.

This page

http://opencog.org/wiki/GSoCProjects2008

contains brief summaries of the projects that were done, and links to Web pages, blogs and code repositories allowing you to dig in more detail into the work if you’re interested.  Here I’ll just give an extremely high level summary.

Many of the projects were outstanding but perhaps the most dramatically successful (in my own personal view) was Filip Maric’s project (mentored by Predrag Janicic) which involved pioneering an entirely new approach to natural language parsing technology.  The core parsing algorithm of the link parser, a popular open-source English parser (that is used within OpenCog’s RelEx language processing subsystem), was replaced with a novel parsing algorithm based on a Boolean satisfaction solver: and the good news is, it actually works … getting the best parses of a sentence faster than the old, standard parsing algorithm; and, most importantly, providing excellent avenues for future integration of NL parsing with semantic analysis and other aspects of language-utilizing AI systems.  This work was very successful but needs a couple more months effort to be fully wrapped up and Filip will continue working on it during September and October.

Cesar Maracondes, working with Joel Pitt, made a lot of progress on porting the code of the Probabilistic Logic Networks (PLN) probabilistic reasoning system from a proprietary codebase to the open-source OpenCog codebase, resolving numerous software design issues along the way.  This work was very important as PLN is a key aspect of OpenCog’s long-term AI plans.   Along the way Cesar helped with porting OpenCog to MacOS.

There were two extremely successful projects involving OpenBiomind a sister project to OpenCog:

  • Bhavesh Sanghvi (working with Murilo Queiroz) designed and implemented a Java user interface to the OpenBiomind bioinformatics toolkit, an important step which should greatly increase the appeal of the toolkit within the biological community (not all biologists are willing to use command-line tools, no matter how powerful)
  • Paul Cao (working with Lucio Coelho) implemented a new machine learning technique within OpenBiomind, in which recursive feature selection is combined with OpenBiomind’s novel “model ensemble based important features analysis.”  The empirical results on real bio datasets seem good.  This is novel scientific research embodied in working open-source code, and should be a real asset to scientists doing biological data analysis.

Two projects dealt with improvements to OpenCog’s probabilistic program learning system: Shuo Chen (working with Moshe Looks) experimented with ways of improving the internals of the MOSES algorithms; whereas Alesis Novik (working with Nil Geissweiller) implemented an initial version of the PLEASURE algorithm, an alternative to MOSES that shares some of the latter’s code infrastructure.  Both these difficult research-coding projects yielded promising though preliminary results and will be continued into the fall.

And the list goes on and on: in this short post I can’t come close to doing justice to all that was done, but please see the above page and the links in it for more details!

Costa Ciprian worked with Boris Iordanov on designing and creating a distributed version of the HypergraphDB, a persistent store for OpenCog; and Rich Jones worked with David Hart on creating a distributed web crawler suitable for massively distributed text parsing using OpenCog’s RelEx language parser.

In a different direction, Kino High Coursey (working with Andre Senna) designed and implemented a very elegant approach for interfacing between OpenCog and online simulation worlds such as OpenSim, implementing a framework using LISP to execute OpenCog-originated actions in simulation worlds.  There is (conceptual and code-level) work to be done integrating this with other OpenCog work that involves OpenCog control of agents in simulated worlds, but Kino has introduced some excellent code and ideas into the project that is sure to be of value as things unfold.

Junfei Guo (working with Ben Goertzel) attacked a problem deep in the heart of OpenCog: mapping OpenCog’s unique AtomTable hypergraph knowledge representation into the more standard graph format used by the standard open-source Boost Graph Library.  This opened up some important new discussions regarding the extent to which various graph algorithms (applied to the graph derived from a hypergraph) can serve as heuristic approximations to less-tractable hypergraph algorithms.

Elizabeth Dawn Alpert (working with Luke Kaiser) investigated the problem of making the link parser (used within OpenCog’s RelEx language framework) better handle ungrammatical text as seen in chats, IM, Twitter and so forth.  This proved a thorny issue and the most progress was made on the level of cleaning up ungrammatical formats of individual words.

All in all, we are very grateful to Google for creating the GSoC program and including us in it.  

Thanks to Google, and most of all to the students and mentors involved.

Onward!

Ben G

Posted in Development | Tagged , , | Leave a comment

Hacking on Link-Grammar

I hack, heads-down, on link-grammar every now and then. Yesterday, I fixed another round of broken parse rules: making sure that sentences like “John is altogether amazingly quick.” “That one is marginally better” “I am done working” “I asked Jim a question” “I was told that crap, too” all parse correctly.

Solving these required adding new rules to the link-grammar dictionary: so, for example, adding the rule “O+ & {@MV+}” to the dictionary entry for “asked” — thus allowing the word “asked” to take a direct object (“…asked Jim…”) and an indirect object (“… a qeustion”).

Patching dictionary entries by hand is tedious and time consuming. The long term goal is to get to the point where the system can learn new entries on its own. So I’ve been day-dreaming about how to do that, and some of the baby-steps in that direction.

The first step is to put all of the rules into a database where they can be easily modified, added-to, and deleted. Currently, the rules all live in a file, which can only be hand-edited.

A second step is to assign probabilities to the rules: often, multiple parses are possible, yet not each rule is equally likely to lead to a correct parse. So it would be better to have a probability assigned to each rule, indicating how often it leads to a good parse. These probabilities are already needed for the GSoC project of adding a SAT solver to link-grammar. I’ve got a database of tens of millions of word pairs and link types, ready to go for this step.

A third step is to distinguish good parses from bad ones: this is the parse-ranking step: to judge parsees for the likely-hood of being good, or bad. There are superficial things one can do for parse ranking, and quite complex ones. I am hoping to someday-soon reinforce the parse-ranking scores with the results of word-sense disambiguation. Which brings us to the next point:

Step four — some grammar parsing rules are appropriate only for some senses of a word, but not for others. Link grammar already accounts for this at a rough scale: it has distinct rules for different parts of speech. These are indicated by tacking a single extra letter to the end of a word: walk.v (I walk.v) versus walk.n (I took.v a walk.n) This basic idea can be further refined, for finer divisions than just parts of speech: some word senses can only be used in certain ways, and not others. The technical problem is that 26 letters is not enough… already, link-grammar is using some 20 or so of these suffixes. So this mechanism needs to be expanded, somehow.

Step five — where teh rubber hits the road — actually learning new rules. This is much more vague; my ideas are still swimming. There are two distinct problems: new words, and fixing rules for existing words. New words should not be *much* of a problem: try to find synonyms for a word, and assume the new word can be used like the synonym. Modifying existing rules is much much harder…

… so hard, in fact, that I’m not sure I’m ready to engage in that, just yet. Some steps can be taken in that direction, by looking at minimum-spanning-tree dependency grammars (MST grammars): that is, computing the mutual information of word pairs, and comparing them to the output of link-grammar. This should suggest where new links can be created. Then comes the question: what should the appropriate link type be, for this new link? For some words, perhaps synonyms can help, but other words are so unique, that they have no synonyms: the word “to be” is so central to the English language that trying to discover information about it is very difficult: it has no synonyms, even as it has many.

I’ve got some infrastructure set up to run this last experiment: I’ve got a fairly large collection of parsed sentences, from which I can build mutual information pairs, and compare these to link-grammar parses. In fact, I’ve already done so: I use these for parse ranking. (that is, if a link-grammar parse has a high mutual information content, then I assume its a good parse).

I could turn this around: given a sentence that is parsing badly, I can look for high mutual-info parses. Perhaps broaden the coverage by comparing to parses of approximately synonymous words, perhaps reinforced by word-sense disambiguation. But this generally leads to a combinatorial explosion — which, I guess is expected. We know that general intelligence requires a lot of CPU. The trick is to have the patience to set up an experiment, run the experiment, wait for its results, and then do it again…

Enough for now. I’m off to work on “related words” — another part of the puzzle.

Posted in Design, Development, Theory | Tagged , , , , | 6 Comments