<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>OpenCog Brainwave &#187; linguistics</title>
	<atom:link href="http://blog.opencog.org/tag/linguistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.opencog.org</link>
	<description>The latest developments in building an open-source mind</description>
	<lastBuildDate>Wed, 21 Mar 2012 16:44:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Meaning-Text Theory</title>
		<link>http://blog.opencog.org/2009/11/08/meaning-text-theory/</link>
		<comments>http://blog.opencog.org/2009/11/08/meaning-text-theory/#comments</comments>
		<pubDate>Sun, 08 Nov 2009 04:58:49 +0000</pubDate>
		<dc:creator>Linas Vepstas</dc:creator>
				<category><![CDATA[Theory]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mel'cuk]]></category>
		<category><![CDATA[MTT]]></category>
		<category><![CDATA[Natural Language Generation]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[RelEx]]></category>
		<category><![CDATA[semantics]]></category>
		<category><![CDATA[Syntax]]></category>

		<guid isPermaLink="false">http://brainwave.opencog.org/?p=154</guid>
		<description><![CDATA[During some recent reading, it struck me that a useful framework for thinking about and talking about sentence generation is the MTT or "meaning-text theory" of Igor Mel'cuk, et al  Here is one readable reference:

Igor A. Mel'čuk and ...]]></description>
			<content:encoded><![CDATA[<p>During some recent reading, it struck me that a useful framework for thinking about and talking about sentence generation is the MTT or &#8220;meaning-text theory&#8221; of Igor Mel&#8217;cuk, <em>et al </em> Here is one readable reference:</p>
<p>Igor A. Mel&#8217;čuk and Alain Polguère, (1987) &#8220;A Formal Lexicon in Meaning-Text Theory&#8221;, Computational Linguistics, vol. 13, pp. 261-275.</p>
<p><a href="http://portal.acm.org/citation.cfm?id=48160.48166" target="_blank">portal.acm.org/citation.cfm?id=48160.48166</a><br />
<a href="http://www.aclweb.org/anthology/J/J87/J87-3006.pdf" target="_blank">www.aclweb.org/anthology/J/J87/J87-3006.pdf</a></p>
<p>Within the context of that theory, the output of the Stanford parser is strictly at the SSynR or &#8220;surface syntactic representation&#8221; level, while, as a general rule Relex attempts to generate the DSynR or &#8220;Deep syntactic representation&#8221; structure.  Some of what I&#8217;ve been trying to do with opencog is towards the &#8220;SemR&#8221; structure, as described in that paper.</p>
<p>The more I read about MTT, the more it seems to capture some of what we are trying to do (defacto are doing) with NLP within opencog.  In particular, the MTT concept of a &#8220;lexical function&#8221; (which is not really described in that paper??) could be a particularly strong way of guaranteeing correct syntactic output for <a href="http://opencog.org/wiki/SegSim">segsim</a>, <a href="https://launchpad.net/nlgen">nlgen</a> or <a href="http://www.louisiana.edu/~bal2277/NLGen2.doc">NLGen2</a><br />
<span style="color:#888888"><br />
</span></p>
<p>&#8211; Linas Vepstas</p>
<p class="wp-flattr-button"></p> <p><a href="http://blog.opencog.org/?flattrss_redirect&amp;id=154&amp;md5=019c90de4afd053a0b39c15697b9c6f5" title="Flattr" target="_blank"><img src="http://blog.opencog.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.opencog.org/2009/11/08/meaning-text-theory/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Semantic dependency relations</title>
		<link>http://blog.opencog.org/2009/10/05/semantic-dependency-relations/</link>
		<comments>http://blog.opencog.org/2009/10/05/semantic-dependency-relations/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 03:51:03 +0000</pubDate>
		<dc:creator>Linas Vepstas</dc:creator>
				<category><![CDATA[Design]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Documentation]]></category>
		<category><![CDATA[Theory]]></category>
		<category><![CDATA[dependency grammar]]></category>
		<category><![CDATA[grammar]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[Rel]]></category>
		<category><![CDATA[Semantic relations]]></category>
		<category><![CDATA[Syntax]]></category>

		<guid isPermaLink="false">http://opencog.wordpress.com/?p=145</guid>
		<description><![CDATA[I spent the weekend comparing the Stanford parser to RelEx, and learned a lot.  RelEx really does deserve to be called a "semantic relation extractor", and not just a "dependency relation extractor".  It provides a more abstract, ...]]></description>
			<content:encoded><![CDATA[<p>I spent the weekend comparing the <a href="http://nlp.stanford.edu/software/lex-parser.shtml">Stanford parser</a> to <a href="http://opencog.org/wiki/RelEx_Semantic_Relationship_Extractor">RelEx</a>, and learned a lot.  RelEx really does deserve to be called a &#8220;semantic relation extractor&#8221;, and not just a &#8220;dependency relation extractor&#8221;.  It provides a more abstract, more semantic output than the Stanford parser, which sticks very narrowly to the syntactic structure of a sentence.</p>
<p>I wrote up a few paragraphs on the most prominent differences; most of my updates were to the <a href="http://opencog.org/wiki/Dependency_relations">RelEx dependency relations</a> page.</p>
<p>Here are the main bullet points:</p>
<ul>
<li>RelEx attempts basic entity extraction, and thus avoids generating nn noun modifier relations for named entities.</li>
<li>RelEx will collapse the object and complement of a preposition into one. Stanford will do this for some, but not all relationships.</li>
<li>RelEx will convert passive subjects into objects, and instead indicate passiveness by tagging the verb with a passive tense feature.</li>
<li> RelEx avoids generating copulas, if at all possible, and instead indicates copular relations as predicative adjectives, or in other ways.</li>
<li>RelEx extracts semantic variables from questions, with the intent of simplifying question answering. For example, &#8220;<em>Where is the ball?</em>&#8221; generates <em>_pobj(_%atLocation, _$qVar) _psubj(_%atLocation, ball)</em>, which can then pattern-match a plausible answer: <em>_pobj(under, couch)</em>.</li>
<li>RelEx attempts to extract <a href="http://opencog.org/wiki/Comparison_variables">comparison variables</a>.</li>
</ul>
<p>Its also clear to me that I could split up the relex processing into two stages: one which generates stanford-style syntactic relations, and a second stage that generates the more abstract stuff.  This might be a wise move &#8230; Since RelEx is already more than 3x faster than the Stanford parser, this could attract new users.</p>
<p>&#8211; Linas Vepstas</p>
<p class="wp-flattr-button"></p> <p><a href="http://blog.opencog.org/?flattrss_redirect&amp;id=145&amp;md5=97df5fdc21f8c4787c761d192245b6ae" title="Flattr" target="_blank"><img src="http://blog.opencog.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.opencog.org/2009/10/05/semantic-dependency-relations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Frequency of grammatical disjuncts</title>
		<link>http://blog.opencog.org/2009/07/06/frequency-of-grammatical-disjuncts/</link>
		<comments>http://blog.opencog.org/2009/07/06/frequency-of-grammatical-disjuncts/#comments</comments>
		<pubDate>Mon, 06 Jul 2009 18:14:56 +0000</pubDate>
		<dc:creator>Linas Vepstas</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Theory]]></category>
		<category><![CDATA[frequency]]></category>
		<category><![CDATA[grammar]]></category>
		<category><![CDATA[GSoC]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[link-grammar]]></category>
		<category><![CDATA[NLP]]></category>

		<guid isPermaLink="false">http://opencog.wordpress.com/?p=123</guid>
		<description><![CDATA[The link-grammar parser uses labeled links to connect together pairs of words.  In order to capture the idea of proper grammatical construction, any given word is only allowed to have very specific links to its right or left: for ...]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.abisource.com/projects/link-grammar/">link-grammar parser</a> uses labeled links to connect together pairs of words.  In order to capture the idea of proper grammatical construction, any given word is only allowed to have very specific links to its right or left: for example, verbs have their subject on the left, and an object on the right.  Link-grammar defines hundreds of different link types, and there are typically dozens or even hundreds of ways that these can attach to a word. Each allowed set of links is called a &#8220;disjunct&#8221;. So, for example:</p>
<p style="text-align:center">MVp- Js+</p>
<p>is a disjunct that says &#8220;there must be an MVp link from this word, going to the left, and an Js link, going to the right&#8221;. This disjunct commonly connects prepositions to a verb on their left (the MV- link) and the object of the preposition on the right (the J+ link).</p>
<p>A good way to think about disjuncts is to imagine them as very fine-grained part-of-speech tags. Thus, when one sees &#8220;MVp- Js+&#8221; associated to a word, one knows not only that the word is a preposition, but even a bit more: its a preposition that took a singular object.  Disjuncts classify words not just into crude part-of-speech categories, but much finer categories:  thus verbs are not just as transtivie or intransitive verbs, but mgiht be transitive verbs that take both direct and indirect objects, or participles, etc.</p>
<p>Siva Reddy, a GSOC 2009 summer student, prepared a table of the frequency of occurrence of different disjuncts in a large collection of text. The top six entries are</p>
<p style="text-align:center">Ds+           950275.635843<br />
Xp-           838569.90527<br />
A+          616522.664867<br />
AN+        566658.997313<br />
MVp- Js+       563082.649325<br />
MVp- Jp+      446487.310222</p>
<p style="text-align:left">and these are exactly what one might expect:</p>
<ul>
<li>Ds+ connects the determiner &#8220;the&#8221; to nouns: and of course, &#8220;the&#8221; is the most frequent word in the English language.</li>
<li>Xp- connects the period at the end of the sentence to the start of the sentence, so of course its frequently observed.</li>
<li>A+ connects adjectives to nouns, AN+ connects noun modifiers to nouns.</li>
<li>As noted above, MV connects verbs to modifying phrases, and J connects prepositions to objects, so that MV- J+ is the disjunct that most prepositions will get. Js connects to a singular object, Jp connects to a plural count or mass noun.</li>
</ul>
<p>A graph of rank vs. frequency is shown below:</p>
<div id="attachment_132" class="wp-caption alignnone" style="width: 490px"><img class="size-full wp-image-132" src="http://blog.opencog.org/files/2009/07/disjunct-true-rank2.png" alt="Disjunct rank vs. frequency of occurance " width="480" height="360" /><p class="wp-caption-text">Disjunct rank vs. frequency of occurance </p></div>
<p>As can be seen, the distribution is more or less Zipfian, with a power-law exponent of 1.5.  The fact that the long tail appears to be linear indicates that grammatical construction in the English language appears to be more ore less scale-free: difficult and akward constructions are increasingly rare.  The fact that the graph is not purely Zipfian, but instead has a knee for the most common grammatical connections suggests that the most common grammatical constructions are &#8220;less common than they should be&#8221;: almost as if English speakers are resisting the use of formulaic sentence constructions. So, for example, since adjectives and noun-modifiers appear near the top of the rank, this suggests that English speakers &#8220;could have&#8221; used more adjectives and noun-modifiers, but didn&#8217;t. Quite why this is so is not clear.  Perhaps the use of anaphora and references in general  helps decrease the need for lots of modifiers.</p>
<p>The open questions are then:</p>
<ol>
<li>Why a power law of 1.5?</li>
<li>Why is there a knee?</li>
<li>Does this result hold for other languages?</li>
</ol>
<p>The corpus used here consists of approximately 1 million sentences, obtained by parsing entire Wikipedia articles, Voice of America news stories, and 10 books from Project Gutenberg, including War and Peace, Jane Austen, and some scientific or medical texts.</p>
<p>&#8211; Linas Vepstas</p>
<p class="wp-flattr-button"></p> <p><a href="http://blog.opencog.org/?flattrss_redirect&amp;id=123&amp;md5=071ee4693b8c455d8b281862fe2a1e0d" title="Flattr" target="_blank"><img src="http://blog.opencog.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.opencog.org/2009/07/06/frequency-of-grammatical-disjuncts/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Distribution of Mutual Information</title>
		<link>http://blog.opencog.org/2009/03/11/distribution-of-mutual-information/</link>
		<comments>http://blog.opencog.org/2009/03/11/distribution-of-mutual-information/#comments</comments>
		<pubDate>Wed, 11 Mar 2009 22:28:55 +0000</pubDate>
		<dc:creator>Linas Vepstas</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[Theory]]></category>
		<category><![CDATA[corpus linguistics]]></category>
		<category><![CDATA[frequency]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[mutual information]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[word pair]]></category>

		<guid isPermaLink="false">http://brainwave.opencog.org/?p=89</guid>
		<description><![CDATA[A bit of corpus linguistics is performed to examine the mutual information distribution of word pairs. <a href="http://blog.opencog.org/2009/03/11/distribution-of-mutual-information/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been playing NLP statistics games for a long time now, and got to thinking that I had no clue as to the statistical distribution of some of the things I work with.  So below follow some graphs.</p>
<div id="attachment_95" class="wp-caption alignnone" style="width: 490px"><img class="size-full wp-image-95" src="http://blog.opencog.org/files/2009/03/mi-nearby1.png" alt="Mutual information of nearby words" width="480" height="360" /><p class="wp-caption-text">Mutual information of nearby words</p></div>
<p>Above is a graph showing the distribution of the mutual information of word pairs that occur in the same sentence. A number of texts were analyzed, including a portion of Wikipedia, some books from project Gutenberg, etc. A collection of all possible pairs of words was created, where each word in the pair occurs in the same sentence (with the left word of the pair having occurred in the sentence to the left of the right word in the pair). These were counted &#8212; about 10 million word pairs were observed &#8212; and their mutual information was calculated.</p>
<p>Mutual information is a measure of the likelihood of seeing two words occur together: thus, for example &#8220;Northern Ireland&#8221; will have a high mutual information, since the words &#8220;Northern&#8221; and &#8220;Ireland&#8221; are used together frequently.  By contrast, &#8220;Ireland is&#8221; will have negative mutual information, mostly because the word &#8220;is&#8221; is used with many, many other words besides &#8220;Ireland&#8221;; there is no special relationship between these words. High-mutual-information word pairs are typically noun phrases, often idioms and &#8220;collocations&#8221;, and almost always embody some concept (so, for example, &#8220;Northern Ireland&#8221; is the name of a place &#8212; the name of the conception of a particular country).</p>
<p>In mathematical terms, the mutual information of a word pair (x,y) is defined as:</p>
<p>M(x,y) = log_2  P(x,y) / P(x,*) P(*,y)</p>
<p>where P(x,y) is the probability of seeing the word pair (x,y), P(x,*) is the probability of seeing a word pair where the left word is x, and P(*,y) is the probability of seing a word pair where the right word is y.</p>
<p>The graph shows M(x,y) on the horizontal axis, and the probability of seeing such a value of M on the vertical axis. This is a bin-count of the distribution of possible values of mutual information, over all word pairs.  This is *NOT* a scatterplot of M(x,y) vs. P(x,y).</p>
<p>Here&#8217;s another graph: same as above, except that this time, only pairs of words that occur immediately next to one-another are considered.  The sample size is much smaller: only about 2.4M word-pairs were collected.</p>
<div id="attachment_97" class="wp-caption alignnone" style="width: 490px"><img class="size-full wp-image-97" src="http://blog.opencog.org/files/2009/03/mi-pair.png" alt="Mutual Information of Neighboring Word Pairs" width="480" height="360" /><p class="wp-caption-text">Mutual Information of Neighboring Word Pairs</p></div>
<p>The blue and green exponential lines are located in <strong>exactly</strong> the same place as in the previous graph. It&#8217;s humped in a different way than the previous graph. What is the shape of this hump?  Are the slopes characteristic, or do they vary from one corpus sample to another?  If anyone knows the answers to these questions, please let me know!</p>
<p>Notice the peaks off to the right, at high MI values, in the first graph. I think these are word pairs which are heavily used (topics/terms that are discussed) in one single contributing text, but in none of the others. That&#8217;s the hypothesis, I don&#8217;t know.</p>
<p>Here is a <a href="http://linas.org/nlp/word-pairs.pdf">more detailed discussion, with many other additional figures</a>.</p>
<p>&#8211; Linas Vepstas</p>
<p class="wp-flattr-button"></p> <p><a href="http://blog.opencog.org/?flattrss_redirect&amp;id=89&amp;md5=935a83757361733fc1af99acacb3a054" title="Flattr" target="_blank"><img src="http://blog.opencog.org/wp-content/plugins/flattr/img/flattr-badge-large.png" alt="flattr this!"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.opencog.org/2009/03/11/distribution-of-mutual-information/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

