2010年12月10日

ラベル/本文中の単語頻度

1997/4/10, posted by Karl Kluge

こういうふうに、ラベルと本文中の文字の使用頻度が違うから、ヴォイニッチがランダムなでたらめで無意味な文章であることを否定したくもなる。

label-initial/final and running text line-initial/final stats differ: using a label corpus from f68r1, f70v2, f72r2, f88r, and f100r

4O is line initial 17.9% in the mss, < 1% in the labels
AM is line final 11.5% in the mss, 3.6% in the labels
OF is line initial 2.1% in the mss, 22.2% in the labels
OP is line initial 3.5% in the mss, 33.9% in the labelsinitial 3.5% in the mss, 33.9% in the labels


1997/4/28, posted by Denis Mardle

I have also produced some interesting counts on positions of labels in words. OE89 for instance ( a label on f99v) is always the end of a word or more often a word on its own when, of 16 occurrences 10 are at the end of a line and one at the end of para 2 on f99v whereas text ( less plant labels ) ends the page with ZOE89 on f82v; SOE89 similarly on f89v1 and OEFCCOE89 on f99v.
posted by ぶらたん at 21:18| Comment(0) | テキストの性質

2010年12月09日

Compare the Entropies of Known Repetitive and Non-repetitive texts

1997/3/23, posted by Dennis Stallings

One of the most striking characteristics of the VMs is the text's repetitiousness. From time to time, some have suggested that it is simply a very repetitious text.

decided to take samples of known repetitious texts (food recipes, religious texts, catalogs) and compare their second-order entropies with those of known texts that should be less repetitious (prose fiction, essays).

I looked at the following measures:

h0: zero-order entropy (log2 of the number of different
characters)
h1: first-order entropy
h2: second-order entropy
h1 - h2: difference between first- and second order entropies
% rel h2: (h1 - h2) as a percentage of h1, that is

% rel h2 = (h1 - h2) / h1 * 100

*Discussion of Results*

Some runs were on several portions of a large text, because of MONKEY's size limitation - 1 Kings, both KJV and Vulgate, and F. Bacon's Essays. The ranges of rel % h2 were:

23.7 - 22.7 = 1.0, King James 1 Kings
17.7 - 17.3 = 0.4, Latin Vulgae 1 Kings
20.6 - 20.4 = 0.2, Francis Bacon Essays.

These figures give some idea of how reproducible MONKEY results are on a text sample of 32,000 characters.

In the Jacobean English texts, the difference in percent points of rel % h2 between the highest and lowest values was 24.5 - 20.5 = 4.0. The same difference for modern English texts was 23.8 - 20.1 = 3.7. (The presumably repetitious cajun.txt was slightly less so (20.1) than the short story crane.txt (20.3). The numbers for Jacobean and modern English do not seem significantly different. The rel % h2's of the Latin Vulgate Bible texts were not significantly different from that of the presumably less repetitious Boethius text boecons.lat. The difference between non-repetitious English texts and Latin texts is about 20.4 - 17.5 = 2.9. The really significant differences are between Voynich and natural language texts. Even taking the most repetitious English text (KJV Joshua, 24.5) and the least repetitious Voynich text (Herbal-A in Currier, 34.7) gives a difference of 10.2, much greater than the range of repetitiousness between various English texts (3.7 or 4.0). Equally significant is the difference between the various Voynich transcription alphabets. The range from Frogguy to Currier is:

Voynich A: 43.9 - 34.7 = 9.2
Voynich B: 42.1 - 34.9 = 7.2

Once again, this is greater than the range of repetitiousness seen in English texts. EVA is more repetitious than Frogguy, which I would not have expected. Both Frogguy and EVA use combinations of characters to represent single Currier letters. The difference between Latin and English may be due to the same sort of thing: Latin represents much fewer phonemes by multiple characters than English does.


1997/11/7, posted by Rene Zandbergen

It has been pointed out in the recent past that the value of h2 by itself is a very incomplete 'characteristic' of the language in question, unlike, say, the values of mean and standard deviation for a normal distribution, which essentially tell you all you need. Now if one were to compute (sum(x^^3)/N)^^(1/3) for a normal distribution, one would not obtain a value that 'completely describes' it. However, if one finds a value which does not fit together with mean and sigma, one knows that 'something is wrong' and one should not discard this information, even if it's incomplete.

So, whereas I agree that entropy is not a very good measure of the properties of a particular text.
posted by ぶらたん at 22:29| Comment(0) | エントロピー

2010年12月08日

俺の説

ヴォイニッチは暗号ではない。

でっち上げではない。

自然言語。

しかし、大層な意味があるかは疑問。

一見無意味に見える文字の羅列

詩、楽譜、帳簿、調合方法など?


Jim Reedsの仮説

"I think:
(1) the VMS was written in Europe by a literate European, and
(2) if it has a plain text, it is in a widely used European language such as Latin or Italian. Why by a "literate European"? Because the author clearly knows the ordinary Latin alphabet, a distorted and elaborated version of which forms the VMS character set. If he usually only wrote in Arabic or Hebrew, say, his letters would not look the way they do. I suppose
(3) the author must had had some contact with cryptography, which in 1470 (to make up a date) meant he had some contact with some potentate's secretary.

しかし:
(4) the book was not written by a non-European,
(5) was not written in a non-European language, and
(6), on the grounds of anachronism, was not written in a deliberately invented artificial language (but I don't mean to rule out a kind of spontaneously generated glossolalic sort of writing, or "outsider" art" writing, etc)."

という可能性も大いにありうる。
posted by ぶらたん at 22:58| Comment(0) | 書かれた言語

2010年12月07日

あとでグラフ書く

Relative frequencies of initial letters of lines

Relative frequencies of initial letters of paragraphs

全体の割合、

Language A,

Language B,

セクションごと

平均値からの乖離、グラフ化
posted by ぶらたん at 22:28| Comment(0) | テキストの性質

2010年12月05日

修正跡

ヴォイニッチには間違いを訂正した跡がないと言われていますが、
少しは変なのあるのよね。

> Likewise the strange thing that ends line 6 of folio 24v,
> which I would write in advanced Frogguy:

> s
>c-lj a 2 A-2

I would say: a correction. The writer forgot the s and inserted it later.
posted by ぶらたん at 18:04| Comment(0) | その他

日本語のエントロピー

1997/2/1, posted by Dennis Stallings

If you analyze Japanese written in Latin characters (romaji), you get a low entropy. This is because of the severe phonotactic constraints of Japanese. It's close to true that a Japanese syllable may begin with zero or one consonant, have one vowel, and end with -n or nothing.

However, Japanese can also be written in hiragana or katakana (syllabic characters), due to the very fact of the severe phonotactic constraints. You have about 26 Latin characters, plus perhaps the long vowels, giving 25-30 characters for romaji. You have 50-60 characters in each kana set (although as I recall the kana don't indicate vowel length).

With romaji I'm sure even the normalized second-order entropy would be low. With kana I'm sure it would be higher. How much higher depends on word frequencies in Japanese and any rules Japanese might have for combining syllables.

1999/7/23, posted by Dennis Stallings

"Understanding the Second-Order Entropies of Voynich Text", May 11, 1998: http://www2.micro-net.com/~ixohoxi/voy/mbpaper.htm

... I struggled with the entropy concept. My reasoning went: Voynichese ought to "mean the same thing" in Frogguy as in Currier, despite the fact that there's a big difference in their entropy profiles. Likewise, Japanese and Hawaiian "say the same thing" whether they are written in phonemic (romaji in Japanese) or syllabic notation (hiragana or katakana in Japanese).

I've finally realize that my reasoning was false. You *can* transmit more information per character by using a larger character set, as with 71 characters for Japanese kana versus 22 characters for romaji. The question is: how well is a character set of a given size being used?

I still think that the comparison of h1 and h2, which I used in my paper, is useful. I also think that one could define the size of the character set by taking the characters that constitute 0.5% or 1.0% and subtracting the number of such characters from the total number of characters. From that number, one could calculate an h0 that would be meaningful.

1999/7/23, posted by Gabriel Landini

There are further problems in Japanese. One thing is "reading" Japanese, and the other is "listening" to Japanese. For example the word "shu" (2 characters in hira/katakana) has many different Kanjis, all with different meanings. So while reading gives you the exact character, the phonetic alphabets do not (and of course the spoken language doesn't either). This is the "context sensitive" aspect of Japanese: how on earth do you know which of the tens of "shu" you are referring to? Answer: because of what comes before and after it. Of course that this does not concern us because we do not know what voynichese should sound like.

1999/8/16, posted by Karl Kluge

My understanding is that cryptographers don't use entropy because it doesn't have clean distributional results unlike the various standard statistics such as Index of Coincidence. Modulo that, while you may not be able to use entropy to determine what the letters, words, or language are, that doesn't mean that given a specific cryptographic hypothesis regarding the alphabet and cipher system that entropy can't serve as a test of such a hypothesis.
posted by ぶらたん at 15:15| Comment(0) | エントロピー

マレー・ポリネシア語系、フィリピン語

ちゅうことで、無頼は英語、フランス語に引き続きフィリピン語(タガログ語)をちょっとだけ勉強中。もしヴォイニッチが暗号でなければ、音の繰り返しとかは、似てる気はするよね。

中世の宣教師が、現地の言葉をアルファベットで書き表そうとした試みだったら
面白いのになぁ。

ちなみに、他の言語だとスペイン語と古典ギリシャ語それぞれ1年習ったことある。ほとんど忘れちゃったけど。

1996/12/10, posted by Bob Richmond

The language appears to have a small number of phonemes. The languages of the Malayo-Polynesian family, as many observers have suggested, are the most likely possibility. The many languages of the Philippines make that area I think a very likely candidate, since the Spanish began extensive colonization there around 1565, with many Roman Catholic priests in isolated outposts.

Imagine then a young priest posted to the Philippines in the late 16th century. He reduced a local language to writing, as was becoming a widespread practice then - he would have known, for instance, about literary Nahuatl (since the Philippines were a province of Mexico!). Isolated, though, he went native, succumbed to the pleasures of the flesh, and kept some sort of record, using his invented alphabet and the local language. Perhaps he simply recorded his amorous doings with his wife - such records can become very repetitious.

ちゅうか、よく考えたら、未知の言語を書き表すのに、わざわざ新しくアルファベットを作り出す必要があるはずもない。

1997/10/31, posted by Rene Zandbergen
When Malay was mentioned in two contexts, I did not realise just how many features Malay written in an Arabic script would have in common with Voynichese. The prefixes and suffixes, the short words, the full-word repetitions, the absense of repeated characters.

1997/10/31, posted by Dennis Stallings
Jacques discussed the Jawi (Arabic) script used for Malay in the Voylist archives. Jawi does represent vowels, but in a complex manner.

However, I think you would see the same thing with Malay even in Latin orthography. And I'm pretty sure Malay would be a low-entropy language. It's in the Malayo-Polynesian group, and visually it looks low-entropy.
posted by ぶらたん at 13:56| Comment(0) | 書かれた言語

2010年12月04日

カタリ派説

無視しましょう。

Leo Levitov published his purported solution of the Voynich Manuscript in *Solution Of The Voynich Manuscript: A Liturgical Manual For The Endura Rite Of The Cathari Heresy, The Cult Of Isis* (Aegean Park Press, 1987). Levitov claims that Catharism was actually a survival of the Greco-Roman-Egyptian cult of Isis and that the Voynich Manuscript is a liturgical manual of this cult. He further claims that the Voynich nymphs in the tubs are undergoing a Cathar sacrament called *Endura* - group suicide by opening veins in warm water.

1996/12/29, posted by CLARY Olivier

Niel says the association between Catharism and suicide has been propagated by Catholic sources and novel writers. The main origin of this claim is that groups of Perfects prefered to throw themselves into the fire singing psalms than make the smallest act against the wishes of the consolamentum, like pronouncing an oath or eating meat, and this could be viewed as a suicide. Also, Inquisition registers do mention endura ordered to some people, mainly women, by the diacon of their community, in very late Catharism (14th century), when Cathar churches had already disappeared long ago.
posted by ぶらたん at 21:26| Comment(0) | 書かれた言語

Voynich mini-FAQ

December 8, 1996 by Dennis Stallings

In 1912, Wilfrid M. Voynich (a book collector) bought a medieval manuscript (235 pages) written in an unknown script and what appears to be an unknown language or a cipher from the Jesuit College at the Villa Mondragone, Frascati, in Italy (near Rome). However, despite the efforts of many well known cryptologists and scholars, the book remains unread. Since 1969, it is at Yale University, at the Beinecke Rare Book Library with catalogue number MS 408.

It is known (from a letter of J. M. Marci in 1665/6) that the manuscript was bought by Emperor Rudolph II of Bohemia (1552-1612) for 600 ducats (an exorbitant sum in those days). The manuscript somehow passed to Jacobus de Tepenecz, the director of Rudolph's botanical gardens (his signature is present in folio 1r) and it is speculated that this must have happened after 1608, when Jacobus Horcicki received his title "de Tepenecz". Thus 1608 is the earliest definite date for the Manuscript.

The Voynich Manuscript, as it has come to be known, contains many drawings of plants, but the plants have not been identified, nor have the drawings been identified with known fanciful or distorted drawings of plants from the Middle Ages. There are what look like astrological drawings. There are curious drawing of little nude women bathing in baths with convoluted plumbing; nothing else like these drawings is known. The persons and costumes look generally European. The script seems to have been developed from early Arabic numerals and medieval Latin abbreviations, but composed of these elements in a unique manner; no other examples of the script or any like it are known. Nothing else about the Manuscript is even this definite; it is a completely unique artifact.

Computer analysis of the Voynich Manuscript has only deepened the mystery. One finding has been that there are two "languages" or "dialects" of Voynichese, which are called Voynich A and Voynich B. The repetitiousness of the text is obvious to casual inspection. Entropy is a numerical measure of the randomness of text. The lower the entropy, the less random and the more repetitious it is. The entropy of samples of Voynich text is lower than that of most human languages; only some Polynesian languages are as low.
posted by ぶらたん at 20:46| Comment(0) | その他

EKT Hypothesis

1996/8/6, posted by Dennis Stallings

My hypothesis is that the concealment system for the VMs is a word game, like Pig Latin. I have devised a homophonic word game that would be less detectable than Pig Latin and would account for the presence of Voynich A and B, the low variety of digraphs (the low second-order entropy of the text), and the (relative) absence of long repeated phrases.

*King Tut*

The system that interests me the most is called King Tut. One makes the following substitutions:

A - a I - i R - rur
B - bub J - jug S - sus
C - cut K - kam T - tut
D - dud L - lul U - u
E - e M - mum V - vuv
F - fuf N - num W - wuv
G - gug O - o Y - yec
H - hush P - pup Z - zuz

"The sunflower is a marvellous plant with powerful virtues that must needs be concealed from the ignorant and uninitiated."

becomes:

"Tuthushe susunumfuflulowuverur isus a mumarurvuvelullulousus puplulanumtut wuvituthush pupowuverurfufulul vuvirurtutuesus tuthushatut mumusustut numeedudsus bube cutonumcutealuledud fufruromum tuthushe igugnumoruranumtut anumdud unuminumitutiatutedud."

*Extended King Tut (EKT)*

With modifications, the King Tut system can account for other properties of the Voynich text. I shall call this modified system Extended King Tut (EKT).
posted by ぶらたん at 20:27| Comment(0) | 書かれた言語
HPへ戻る