2010年12月25日

Antoine Casanova's research

2000/10/16, posted by Adam McLean

I have just reread Antoine Casanova's posting on 6th March 2000, based on his thesis, which reveals a structure within the individual 'tokens' in the Voynich language. These he shows as a series of rules, and from these he concludes that the language of the Voynich is not a natural language but has the characteristic signature of an artificial language.

2000/10/17, posted by Jorge Stolfi

In my view, the most significant feature of Antoine's substitution patterns is that the first letter of a Voynichese word seem to have more "inflectional freedom", while the final letters are relatively invariant. These patterns are precisely oposite to what we would expect to see in Indo-European languages (at least Romance and Germanic), where grammaticalinflection usually modifies letters near the end of the word.
Presumably this is what Antoine has in mind whe he says that Voynichese words are "built from synthetic rules which exclude ... natural language". Anyway, I think that this conclusion is unwarranted. After all, there are non-IE natural languages, which I do not dare to mention by name 8-), that do seem to have `substitution patterns' similar to those of Voynichese.
Thus I don't accept Antoine conclusion that Voynichese must be an artificial language, or at best a code based on "progressive modification [similar to] the discs of Alberti". It cannot be just some IE language with a funny alphabet, sure; but we already knew that.
I find it interesting also that his analysis yield a very anomalous pattern for n = 8, namely P_8 = ( 6 8 1 2 3 4 7 5 ). While that pattern may be just a noise artifact, it may also be telling us that the rare 8-letter words are mostly the result of joining a 2-letter word to a 6-letter one.
I am not sure what to make of Antoine's rules for generating P_n from P_{n+1}. For one thing, they seem to be a bit too complicated given the limited amount of data that they have to explain. Moreover, the counts s_2,.. s_{n-2} seem to be fairly similar, and the differences seem to be mostly statistical noise; therefore, their relative ranks do not seem to be very significant. Indeed, applying Antoine's method to Currier's transcription we get P_6 = ( 1 4 2 6 5 3 ), whereas from Friedman's we get P_6 = ( 1 5 2 4 6 3 ). Moreover, the latter would change to P_6 = ( 1 5 3 4 6 2 ) if we omitted just two words from the input text.
But the main limitation I see in Antoine's method is that he considers the absolute position of each letter in the word to be a significant parameter for statistical analysis. I.e., he assumes implicitly that an n-letter word contains exactly n "inflectional", slots, each each of them containing exactly one letter. This view seems too simplistic when one considers the patterns of inflection of natural languages, where each morphological "slot" can usually be filled by strings of different lengths, including zero. To uncover the inflection rules of English, for example, one would have to compare words of different lengths, because the key substitution patterns are

dog / dogs / dog's / dogs'
dance / dances / danced / dancing / dancer / dancers / ...
strong / stronger /strongest / strongly

and so on.

Another problem of Antoine's method is that the most important structural features of words in natural languages are usually based on *relative* letter positions, and may not be visible at all in an analysis based on absolute positions. For example, in Spanish there is a particularly strong alternation of vowels and consonants, so that if words were aligned by syllables one would surely find that the "even" letter slots have very different substitution properties than the "odd" slots. But since Spanish words may begin with either vowel or consonant, and may contain occasional VV and CC clusters, the 3rd and 4th letters in a 6-letter word should be about as likely to be VC as CV; and, therefore, will probably have very similar substitution statistics.

Indeed, aligning words letter-by-letter is a bit like classifying fractional numeric data like 3.15 and -0027 into classes by the number of characters, and then analyzing the statistics of the ith character within each class, without regards for leading zeros, omitted signs, or the position of the decimal point. While some statistical features of the data may still have some visible manifestation after such mangling, we cannot expect to get reliable and understandable results unless we learn to align the data by the decimal point before doing the analysis.
posted by ぶらたん at 21:07| Comment(0) | その他

意味がある vs 意味がない

2000/1/23, posted by Rene Zandbergen

In order to have real, strong evidence that the VMs contains meaningful text, we need to know how one can create a 'meaningless' text that still exhibits the same properties as meaningful text. More to the point: we need to find a mechanism that could have been applied 400-500 years ago.

Jacques already pointed out that we don't actually know how to define meaningful and meaningless. This may well prove to be a serious problem. When trying to generate meaningless texts which the LSC would classify as mneaningful, or vice versa, we're likely to end up in the no-man's land bordering on the two.... Take a meaningful text and start removing words (every 10th, every 2nd, at random...). When does the text stop being meaningful? How does the LSC curve behave?

2000/1/24, posted by Jorge Stolfi

Consider that an ideal text compression algorithm should take "typical" texts and turn them into random-looking strings of bits. Of course this transformation preserves meaning (as long as one has the decompression algorithm!); but, for maximum compression, the program should equalize the bit probabilities and remove any correlations. Modern compressors like PKZIP go a long way in that direction. The compressed text, being shorter than the original, will actually have more meaning per unit length; but it will look like perfect gibberish to LSC-like tests.

Or, consider a meaningful plaintext XORed with the binary expansion of pi. The result will have uniform bit probabilities, and no visible correlations; but it will still carry the original meaning, which can be easily recovered. It would take a very sophisticated algorithm (one that knows that pi is a "special" number) to notice that the text is not an entirely random string of bits.

So the LSC and possible variants are not tests of `meaning' but rather of `naturalness.' They work because natural language uses its medium rather inefficiently, but in a rather peculiar way: it uses symbols with unequal frequencies (a feature that mechanical monkeys can imitate), but changes those frequencies over long distances (something which simple monkeys won't do).

However, with slightly smarter monkeys one *can* generate meaningless texts that fool the LSC; and the same applies for any "meaning detector" that looks only at the message. Conversely, one can always encode a meaninful text so as to make it look "random" to the LSC. In short, a naturally produced (and natural-looking) text can be quite meaningless, while a meaningful text may be (and look) quite unnatural.
posted by ぶらたん at 10:54| Comment(0) | その他
HPへ戻る