A question was recently asked as to whether Link Grammar could be used to attribute text to a specific author. I had fun writing the reply; let me reproduce it below. It starts at square one.
Consider a police detective analyzing a threatening note. At some point in the prior centuries, it becomes common knowledge that hand-written notes are subject to forensic analysis. Criminals switch to typewriters; alas, some famous spy cases from the 1940’s are solved by linking notes to the typewriters that produced them. By the 1970’s, Hollywood shows us films with the bad guy clipping words from newspapers. Aside from looking for fingerprints left on the paper, psychological profilers look for idiosyncracies in how the criminal expresses ideas. Stranger wording, odd phrases, punctuation or lack thereof.
How about computer text? It’s well know that many people consistently mis-spell words (I consistently mis-spell “thier”) and I think there was some murder trial evidence that hinged on this. Moving into the PC era, 1980’s onwards, we get postmodernism and corpus linguistics. One of the fruits is the “bag of words” model: different texts have different ratios of words. Although spotted much earlier, computers allow this to be applied to a zillion and one problems in text classification. Basically, you have a vector of (word, frequency) pairs and you can judge the similarity between vectors with assorted distance measures (the dot-product is very popular and also just plain wrong, but I digress.) I don’t think you’d have any particular problem with using this method to attribute a novel to James Joyce, for example.
It becomes subtle, perhaps, if the text is short: say, a letter, and you are comparing it to other letters written in the same era, written by eloquent Irishmen. The words that Joyce might use in a letter might not be the ones he’d use in a novel. It’s reasonable to expect that bag-of-words will fail to provide an unambiguous signal. How about sentence structure, then? (This is what the original question was.) Yes, I agree: that is a good way – maybe the best way(?) of doing this (at current levels of technology). One might still expect Joyce to construct his sentences in a way that is particular to his mode of thinking, irrespective of the topic that he writes on. Mood and feeling echoes on in the grammar.
So, how might this work? Before I dive into that, a short digression. Besides bag-of-words, there is also a bag of word-pairs. Here, you collect not (word, frequency) pairs, but (word-pair, frequency) pairs. One collects not nearest-neighbor word-pairs, but word-pairs in some window: say, of length six. The problem is that there are vast numbers of word-pairs, like “the-is” and “you-banana” – hundreds of millions. Most are junk. You can weed most of these away by focusing only on those with a high mutual information, but even so, you’re left with the problem of “overfitting”.
Enter the n-gram (as in “google n-gram viewer”) or better yet, the skipgram, which is an n-gram with some “irrelevant” words omitted. Effectively all neural-net techniques are skip-gram based. To crudely paraphrase what a neural net does: as you train it on a body of text (say … James Joyce’s complete works…), it develops a collection of (skigram, frequency) pairs, or rather, a (skipgram, weight) vector. You can then compare this to some unknown text: the neural net will act as a discriminator or classifier, telling you if that other text is sufficiently similar (often using the dot product, which is just plain… but I digress…) The “magic” of the neural net is it figures out which skip-grams are relevant, and which are noise/junk. (There are millions of trillions of skip grams; out of these, the neural net picks out 200 to 500 of them. This is a non-trivial achievement).
How might this work for one of James Joyce’s letters? Hmm. See the problem? If the classifier is trained on his novels, the vocabulary there might be quite different than the vocabulary in his personal letters, and that difference in vocabulary will mess up the recognition. Joyce may be using the same sentence constructions in his letters and novels, but with a different vocabulary in each. A skip-gram classifier is blind to word-classes: it’s blind to the grammatical constructions. Something as basic as a synonym trips it up. (Disclaimer: there is some emerging research into solving these kinds of problems for neural nets, and I am not up on the latest! Anyone who knows better is invited to amplify!)
I’ve said before (many many times) that skip-grams are like Link Grammar disjuncts, and it’s time to make this precise. Lets try this:
+---->WV--->+ +-----IV--->+-----Ost-----+ +->Wd--+-SX-+--Pa-+--TO--+-Ixt+ +--Dsu*v--+ | | | | | | | | LEFT-WALL I.p am.v proud.a to.r be.v an emotionalist.n
Here, an example skipgram might be (I..proud..be) or (proud..be..emotionalist) A sentence like “I was immodestly an emotionalist” would be enough for a police detective to declare that Joyce wrote that. Yet, there is no skip-gram match.
Consider now the Link-grammar word-disjunct pairs. For the above sentence, here’s the complete list:
am == SX- dWV- Pa+
proud == Pa- TO+ IV+
to == TO- I*t+
be == Ix- dIV- O*t+
an == Ds**v+
emotionalist == D*u- Os-
You can double-check this by carefully looking at the diagram above; notice that “proud” links to the left with Pa and to the right with TO and IV.
The original intent of disjuncts is to indicate grammatical structure. So, “Pa” is a “predicative adjective”. “IV” links to “infinitive verb”. As a side-effect, they work with word-classes: for example, “He was happy to be an idiot” has exactly the same parse, even though the words are quite different.
To finally get back to the original question of author attribution. Well, here’s an idea: “bag of disjuncts”. Let’s collect (disjunct, frequency) pairs from Joyce’s novels, and compare them to his letters. The motivation for this idea is that perhaps the specific vocabulary words are different, but the sentence structures are similar.
How well does this work? I dunno. No one has ever studied this in any quantitative, scientific setting. Some failings are obvious: There is a 100% match to “He was happy to be an idiot” even though the word-choice might not be Joycian. There is a poor match to “I was immodestly an emotionalist” even though the word “emotionalist” is extremely rare, and a dead-giveaway. There’s also a problem with the correspondence “immodestly” <=> “proud to be” because “immodestly” is an adverb, not a predicative adjective, and it’s a single word, not a word-phrase. Raw, naive Link Grammar is insensitive to synonymy between word-phrases.
There is a two-decade old paper that explains exactly how to solve the multi-word synonymous-phrases problem. It’s been done. It’s doable. I can certainly point out a half-dozen other tricks and techniques to further refine this process. So, yes, I think that this all provides a good foundation for text attribution experiments. But I mean what I say: “experiments”. I think it could work, and I think it might work quite well. But, to do better, you’d have to actually do it. Try it. It would take a goodly amount of work before any literary critic would accept your results; and even more before a judge would accept it as admissible evidence in a court of law.
As to existing software: I have a large collection of tools for counting things and pairs of things, and comparing the similarity of vectors. Most enthusiasts would find that code unusable, until it gets re-written in python. Alas, that is not forthcoming. If you wanted to actually do what I describe above, some very concrete plans would need to be made.
I also have this daydream about *generating text* in the style of a given author: given a corpus, create more sentences and paragraphs, in the style and vocabulary of that corpus. My ideas for this follow along similar lines of thought to the above, but this is … a discussion for some other day.