Frequency of grammatical disjuncts

The link-grammar parser uses labeled links to connect together pairs of words.  In order to capture the idea of proper grammatical construction, any given word is only allowed to have very specific links to its right or left: for example, verbs have their subject on the left, and an object on the right.  Link-grammar defines hundreds of different link types, and there are typically dozens or even hundreds of ways that these can attach to a word. Each allowed set of links is called a “disjunct”. So, for example:

MVp- Js+

is a disjunct that says “there must be an MVp link from this word, going to the left, and an Js link, going to the right”. This disjunct commonly connects prepositions to a verb on their left (the MV- link) and the object of the preposition on the right (the J+ link).

A good way to think about disjuncts is to imagine them as very fine-grained part-of-speech tags. Thus, when one sees “MVp- Js+” associated to a word, one knows not only that the word is a preposition, but even a bit more: its a preposition that took a singular object.  Disjuncts classify words not just into crude part-of-speech categories, but much finer categories:  thus verbs are not just as transtivie or intransitive verbs, but mgiht be transitive verbs that take both direct and indirect objects, or participles, etc.

Siva Reddy, a GSOC 2009 summer student, prepared a table of the frequency of occurrence of different disjuncts in a large collection of text. The top six entries are

Ds+           950275.635843
Xp-           838569.90527
A+          616522.664867
AN+        566658.997313
MVp- Js+       563082.649325
MVp- Jp+      446487.310222

and these are exactly what one might expect:

  • Ds+ connects the determiner “the” to nouns: and of course, “the” is the most frequent word in the English language.
  • Xp- connects the period at the end of the sentence to the start of the sentence, so of course its frequently observed.
  • A+ connects adjectives to nouns, AN+ connects noun modifiers to nouns.
  • As noted above, MV connects verbs to modifying phrases, and J connects prepositions to objects, so that MV- J+ is the disjunct that most prepositions will get. Js connects to a singular object, Jp connects to a plural count or mass noun.

A graph of rank vs. frequency is shown below:

Disjunct rank vs. frequency of occurance

Disjunct rank vs. frequency of occurance

As can be seen, the distribution is more or less Zipfian, with a power-law exponent of 1.5.  The fact that the long tail appears to be linear indicates that grammatical construction in the English language appears to be more ore less scale-free: difficult and akward constructions are increasingly rare.  The fact that the graph is not purely Zipfian, but instead has a knee for the most common grammatical connections suggests that the most common grammatical constructions are “less common than they should be”: almost as if English speakers are resisting the use of formulaic sentence constructions. So, for example, since adjectives and noun-modifiers appear near the top of the rank, this suggests that English speakers “could have” used more adjectives and noun-modifiers, but didn’t. Quite why this is so is not clear.  Perhaps the use of anaphora and references in general  helps decrease the need for lots of modifiers.

The open questions are then:

  1. Why a power law of 1.5?
  2. Why is there a knee?
  3. Does this result hold for other languages?

The corpus used here consists of approximately 1 million sentences, obtained by parsing entire Wikipedia articles, Voice of America news stories, and 10 books from Project Gutenberg, including War and Peace, Jane Austen, and some scientific or medical texts.

— Linas Vepstas

About Linas Vepstas

Computer Science Researcher
This entry was posted in Development, Theory and tagged , , , , , . Bookmark the permalink.
  • ftyers

    Two comments really… the first is that perhaps the “resisting the use of formulaic sentences” comes from using more formal text as opposed to e.g. speech or chat type text. When I write for Wikipedia, I often find myself re-wording sentences.

    The second is, have you considered trying the same with the Persian Link grammar ? [1]

    1. http://www.ling.ohio-state.edu/~jonsafari/#projects

  • ftyers

    Two comments really… the first is that perhaps the “resisting the use of formulaic sentences” comes from using more formal text as opposed to e.g. speech or chat type text. When I write for Wikipedia, I often find myself re-wording sentences.

    The second is, have you considered trying the same with the Persian Link grammar ? [1]

    1. http://www.ling.ohio-state.edu/~jonsafari/#projects

  • http://linas.org linasv

    Martin Reynaert wrote to say: ”From what I have learned from the work of mainly Ramon Ferrer i Cancho ( http://www.lsi.upc.edu/~rferrericancho/publications_by_year.html ), I would say that adding syntactic patterns to the words turns the natural language into more formal language. For more formal language a power law exponent well above 1 is `natural’.

  • http://linas.org linasv

    Martin Reynaert wrote to say: ”From what I have learned from the work of mainly Ramon Ferrer i Cancho ( http://www.lsi.upc.edu/~rferrericancho/publications_by_year.html ), I would say that adding syntactic patterns to the words turns the natural language into more formal language. For more formal language a power law exponent well above 1 is `natural’.