2010年12月09日

Compare the Entropies of Known Repetitive and Non-repetitive texts

1997/3/23, posted by Dennis Stallings

One of the most striking characteristics of the VMs is the text's repetitiousness. From time to time, some have suggested that it is simply a very repetitious text.

decided to take samples of known repetitious texts (food recipes, religious texts, catalogs) and compare their second-order entropies with those of known texts that should be less repetitious (prose fiction, essays).

I looked at the following measures:

h0: zero-order entropy (log2 of the number of different
characters)
h1: first-order entropy
h2: second-order entropy
h1 - h2: difference between first- and second order entropies
% rel h2: (h1 - h2) as a percentage of h1, that is

% rel h2 = (h1 - h2) / h1 * 100

*Discussion of Results*

Some runs were on several portions of a large text, because of MONKEY's size limitation - 1 Kings, both KJV and Vulgate, and F. Bacon's Essays. The ranges of rel % h2 were:

23.7 - 22.7 = 1.0, King James 1 Kings
17.7 - 17.3 = 0.4, Latin Vulgae 1 Kings
20.6 - 20.4 = 0.2, Francis Bacon Essays.

These figures give some idea of how reproducible MONKEY results are on a text sample of 32,000 characters.

In the Jacobean English texts, the difference in percent points of rel % h2 between the highest and lowest values was 24.5 - 20.5 = 4.0. The same difference for modern English texts was 23.8 - 20.1 = 3.7. (The presumably repetitious cajun.txt was slightly less so (20.1) than the short story crane.txt (20.3). The numbers for Jacobean and modern English do not seem significantly different. The rel % h2's of the Latin Vulgate Bible texts were not significantly different from that of the presumably less repetitious Boethius text boecons.lat. The difference between non-repetitious English texts and Latin texts is about 20.4 - 17.5 = 2.9. The really significant differences are between Voynich and natural language texts. Even taking the most repetitious English text (KJV Joshua, 24.5) and the least repetitious Voynich text (Herbal-A in Currier, 34.7) gives a difference of 10.2, much greater than the range of repetitiousness between various English texts (3.7 or 4.0). Equally significant is the difference between the various Voynich transcription alphabets. The range from Frogguy to Currier is:

Voynich A: 43.9 - 34.7 = 9.2
Voynich B: 42.1 - 34.9 = 7.2

Once again, this is greater than the range of repetitiousness seen in English texts. EVA is more repetitious than Frogguy, which I would not have expected. Both Frogguy and EVA use combinations of characters to represent single Currier letters. The difference between Latin and English may be due to the same sort of thing: Latin represents much fewer phonemes by multiple characters than English does.


1997/11/7, posted by Rene Zandbergen

It has been pointed out in the recent past that the value of h2 by itself is a very incomplete 'characteristic' of the language in question, unlike, say, the values of mean and standard deviation for a normal distribution, which essentially tell you all you need. Now if one were to compute (sum(x^^3)/N)^^(1/3) for a normal distribution, one would not obtain a value that 'completely describes' it. However, if one finds a value which does not fit together with mean and sigma, one knows that 'something is wrong' and one should not discard this information, even if it's incomplete.

So, whereas I agree that entropy is not a very good measure of the properties of a particular text.
posted by ぶらたん at 22:29| Comment(0) | エントロピー
この記事へのコメント
コメントを書く
お名前:

メールアドレス:

ホームページアドレス:

コメント: [必須入力]

認証コード: [必須入力]


※画像の中の文字を半角で入力してください。
HPへ戻る