2010年12月09日

Compare the Entropies of Known Repetitive and Non-repetitive texts

1997/3/23, posted by Dennis Stallings

One of the most striking characteristics of the VMs is the text's repetitiousness. From time to time, some have suggested that it is simply a very repetitious text.

decided to take samples of known repetitious texts (food recipes, religious texts, catalogs) and compare their second-order entropies with those of known texts that should be less repetitious (prose fiction, essays).

I looked at the following measures:

h0: zero-order entropy (log2 of the number of different
characters)
h1: first-order entropy
h2: second-order entropy
h1 - h2: difference between first- and second order entropies
% rel h2: (h1 - h2) as a percentage of h1, that is

% rel h2 = (h1 - h2) / h1 * 100

*Discussion of Results*

Some runs were on several portions of a large text, because of MONKEY's size limitation - 1 Kings, both KJV and Vulgate, and F. Bacon's Essays. The ranges of rel % h2 were:

23.7 - 22.7 = 1.0, King James 1 Kings
17.7 - 17.3 = 0.4, Latin Vulgae 1 Kings
20.6 - 20.4 = 0.2, Francis Bacon Essays.

These figures give some idea of how reproducible MONKEY results are on a text sample of 32,000 characters.

In the Jacobean English texts, the difference in percent points of rel % h2 between the highest and lowest values was 24.5 - 20.5 = 4.0. The same difference for modern English texts was 23.8 - 20.1 = 3.7. (The presumably repetitious cajun.txt was slightly less so (20.1) than the short story crane.txt (20.3). The numbers for Jacobean and modern English do not seem significantly different. The rel % h2's of the Latin Vulgate Bible texts were not significantly different from that of the presumably less repetitious Boethius text boecons.lat. The difference between non-repetitious English texts and Latin texts is about 20.4 - 17.5 = 2.9. The really significant differences are between Voynich and natural language texts. Even taking the most repetitious English text (KJV Joshua, 24.5) and the least repetitious Voynich text (Herbal-A in Currier, 34.7) gives a difference of 10.2, much greater than the range of repetitiousness between various English texts (3.7 or 4.0). Equally significant is the difference between the various Voynich transcription alphabets. The range from Frogguy to Currier is:

Voynich A: 43.9 - 34.7 = 9.2
Voynich B: 42.1 - 34.9 = 7.2

Once again, this is greater than the range of repetitiousness seen in English texts. EVA is more repetitious than Frogguy, which I would not have expected. Both Frogguy and EVA use combinations of characters to represent single Currier letters. The difference between Latin and English may be due to the same sort of thing: Latin represents much fewer phonemes by multiple characters than English does.


1997/11/7, posted by Rene Zandbergen

It has been pointed out in the recent past that the value of h2 by itself is a very incomplete 'characteristic' of the language in question, unlike, say, the values of mean and standard deviation for a normal distribution, which essentially tell you all you need. Now if one were to compute (sum(x^^3)/N)^^(1/3) for a normal distribution, one would not obtain a value that 'completely describes' it. However, if one finds a value which does not fit together with mean and sigma, one knows that 'something is wrong' and one should not discard this information, even if it's incomplete.

So, whereas I agree that entropy is not a very good measure of the properties of a particular text.
posted by ぶらたん at 22:29| Comment(0) | エントロピー

2010年12月05日

日本語のエントロピー

1997/2/1, posted by Dennis Stallings

If you analyze Japanese written in Latin characters (romaji), you get a low entropy. This is because of the severe phonotactic constraints of Japanese. It's close to true that a Japanese syllable may begin with zero or one consonant, have one vowel, and end with -n or nothing.

However, Japanese can also be written in hiragana or katakana (syllabic characters), due to the very fact of the severe phonotactic constraints. You have about 26 Latin characters, plus perhaps the long vowels, giving 25-30 characters for romaji. You have 50-60 characters in each kana set (although as I recall the kana don't indicate vowel length).

With romaji I'm sure even the normalized second-order entropy would be low. With kana I'm sure it would be higher. How much higher depends on word frequencies in Japanese and any rules Japanese might have for combining syllables.

1999/7/23, posted by Dennis Stallings

"Understanding the Second-Order Entropies of Voynich Text", May 11, 1998: http://www2.micro-net.com/~ixohoxi/voy/mbpaper.htm

... I struggled with the entropy concept. My reasoning went: Voynichese ought to "mean the same thing" in Frogguy as in Currier, despite the fact that there's a big difference in their entropy profiles. Likewise, Japanese and Hawaiian "say the same thing" whether they are written in phonemic (romaji in Japanese) or syllabic notation (hiragana or katakana in Japanese).

I've finally realize that my reasoning was false. You *can* transmit more information per character by using a larger character set, as with 71 characters for Japanese kana versus 22 characters for romaji. The question is: how well is a character set of a given size being used?

I still think that the comparison of h1 and h2, which I used in my paper, is useful. I also think that one could define the size of the character set by taking the characters that constitute 0.5% or 1.0% and subtracting the number of such characters from the total number of characters. From that number, one could calculate an h0 that would be meaningful.

1999/7/23, posted by Gabriel Landini

There are further problems in Japanese. One thing is "reading" Japanese, and the other is "listening" to Japanese. For example the word "shu" (2 characters in hira/katakana) has many different Kanjis, all with different meanings. So while reading gives you the exact character, the phonetic alphabets do not (and of course the spoken language doesn't either). This is the "context sensitive" aspect of Japanese: how on earth do you know which of the tens of "shu" you are referring to? Answer: because of what comes before and after it. Of course that this does not concern us because we do not know what voynichese should sound like.

1999/8/16, posted by Karl Kluge

My understanding is that cryptographers don't use entropy because it doesn't have clean distributional results unlike the various standard statistics such as Index of Coincidence. Modulo that, while you may not be able to use entropy to determine what the letters, words, or language are, that doesn't mean that given a specific cryptographic hypothesis regarding the alphabet and cipher system that entropy can't serve as a test of such a hypothesis.
posted by ぶらたん at 15:15| Comment(0) | エントロピー
HPへ戻る