2010年12月25日

Antoine Casanova's research

2000/10/16, posted by Adam McLean

I have just reread Antoine Casanova's posting on 6th March 2000, based on his thesis, which reveals a structure within the individual 'tokens' in the Voynich language. These he shows as a series of rules, and from these he concludes that the language of the Voynich is not a natural language but has the characteristic signature of an artificial language.

2000/10/17, posted by Jorge Stolfi

In my view, the most significant feature of Antoine's substitution patterns is that the first letter of a Voynichese word seem to have more "inflectional freedom", while the final letters are relatively invariant. These patterns are precisely oposite to what we would expect to see in Indo-European languages (at least Romance and Germanic), where grammaticalinflection usually modifies letters near the end of the word.
Presumably this is what Antoine has in mind whe he says that Voynichese words are "built from synthetic rules which exclude ... natural language". Anyway, I think that this conclusion is unwarranted. After all, there are non-IE natural languages, which I do not dare to mention by name 8-), that do seem to have `substitution patterns' similar to those of Voynichese.
Thus I don't accept Antoine conclusion that Voynichese must be an artificial language, or at best a code based on "progressive modification [similar to] the discs of Alberti". It cannot be just some IE language with a funny alphabet, sure; but we already knew that.
I find it interesting also that his analysis yield a very anomalous pattern for n = 8, namely P_8 = ( 6 8 1 2 3 4 7 5 ). While that pattern may be just a noise artifact, it may also be telling us that the rare 8-letter words are mostly the result of joining a 2-letter word to a 6-letter one.
I am not sure what to make of Antoine's rules for generating P_n from P_{n+1}. For one thing, they seem to be a bit too complicated given the limited amount of data that they have to explain. Moreover, the counts s_2,.. s_{n-2} seem to be fairly similar, and the differences seem to be mostly statistical noise; therefore, their relative ranks do not seem to be very significant. Indeed, applying Antoine's method to Currier's transcription we get P_6 = ( 1 4 2 6 5 3 ), whereas from Friedman's we get P_6 = ( 1 5 2 4 6 3 ). Moreover, the latter would change to P_6 = ( 1 5 3 4 6 2 ) if we omitted just two words from the input text.
But the main limitation I see in Antoine's method is that he considers the absolute position of each letter in the word to be a significant parameter for statistical analysis. I.e., he assumes implicitly that an n-letter word contains exactly n "inflectional", slots, each each of them containing exactly one letter. This view seems too simplistic when one considers the patterns of inflection of natural languages, where each morphological "slot" can usually be filled by strings of different lengths, including zero. To uncover the inflection rules of English, for example, one would have to compare words of different lengths, because the key substitution patterns are

dog / dogs / dog's / dogs'
dance / dances / danced / dancing / dancer / dancers / ...
strong / stronger /strongest / strongly

and so on.

Another problem of Antoine's method is that the most important structural features of words in natural languages are usually based on *relative* letter positions, and may not be visible at all in an analysis based on absolute positions. For example, in Spanish there is a particularly strong alternation of vowels and consonants, so that if words were aligned by syllables one would surely find that the "even" letter slots have very different substitution properties than the "odd" slots. But since Spanish words may begin with either vowel or consonant, and may contain occasional VV and CC clusters, the 3rd and 4th letters in a 6-letter word should be about as likely to be VC as CV; and, therefore, will probably have very similar substitution statistics.

Indeed, aligning words letter-by-letter is a bit like classifying fractional numeric data like 3.15 and -0027 into classes by the number of characters, and then analyzing the statistics of the ith character within each class, without regards for leading zeros, omitted signs, or the position of the decimal point. While some statistical features of the data may still have some visible manifestation after such mangling, we cannot expect to get reliable and understandable results unless we learn to align the data by the decimal point before doing the analysis.
posted by ぶらたん at 21:07| Comment(0) | その他
この記事へのコメント
コメントを書く
お名前:

メールアドレス:

ホームページアドレス:

コメント: [必須入力]

認証コード: [必須入力]


※画像の中の文字を半角で入力してください。
HPへ戻る