2011年07月22日

Numbercrunching "word" tuples

From: "Anders, Claus"
Date: Mon, 25 Mar 2002 15:22:59 +0100

2-Tuples (only the first of 2469 shown):

8 chol daiin
32 or aiin
27 shedy qokaiin
25 shey qokaiin
24 daiin daiin
23 chedy qokaiin
22 chol chol
21 qotey chedy
20 qokaiin chedy
20 ol shedy
20 ol chedy
20 daiin chey
19 shedy qokeedy
19 shedy qokedy
19 chedy qokeey
19 ar aiin
18 daiin chedy
17 qokaiin shedy
17 ar al
15 qokeedy qokeedy
15 qokal chedy
15 ol daiin

3-Tuples (only the first of 164 shown):

4 shey qokaiin chedy
4 ol shedy qokedy
4 chey qotey chedy
3 sheedy qokedy chedy
3 shedy qokedy qokeedy
3 shedy qokaiin shey
3 shedy qokaiin chedy
3 qotey chedy qokaiin
3 qokedy qokedy qokedy
3 ol shedy qokeey
3 daiin dy daiin
3 chol chol daiin
3 chedy qokeey qokeey
3 ar al saiin
2 ykeey okeey cheor
2 shy qokal chdy
2 shey qokar shedy
2 shey daiin chey
2 sheey qokol cheol
2 sheey or or
2 shedy qotey shedy
2 shedy qotaiin oteedy
2 shedy qokeedy qotedy
2 shedy qokeedy qokedy
2 shedy qokeedy dal
2 shedy qokedy shedy
2 shedy qokar shedy
2 shedy qokal shedy
2 shedy qokal chedy
2 shedy qokain dar
2 shedy qoeedy ol
2 shedy ol shedy
2 shedy chedy qotey

4-Tuples (all shown)

2 shedy qotey shedy qokaiin
2 ol shedy qokedy qokeedy

5-Tuples:
none (0) zero nothing ....

From: "Petr Kazil"
Date: Mon, 25 Mar 2002 18:20:57 +0100

Your list is very interesting. I
can't but wonder about coincidences like the following:

23 chedy qokaiin *
20 qokaiin chedy

24 daiin daiin **
22 chol chol
15 qokeedy qokeedy

19 shedy qokeedy ***
19 shedy qokedy
posted by ぶらたん at 22:58| Comment(0) | テキストの性質

2011年06月07日

<qokeey>という単語

# From: "Philip Neal"
# Date: Fri, 08 Feb 2002 11:09:44 +0000

If a paragraph contains the word qokeey, there is a 38% chance that the next paragraph will contain the word qokeey.
If a paragraph contains the word qokeey, there is a 40% chance that qokeey occurs more than once.
If the current word is qokeey, there is a 6% chance that the next word will be qokeey.

コンコーダンス結果みたら、確かに興味深い出現でした。
こういうことが起こるから、でたらめなランダムテキストではないのは分かる。
では何なのだ?

# From: Rene Zandbergen
# Date: Mon, 11 Feb 2002

--- Philip Neal wrote:
> The pattern is statistically highly significant.
> I attach a file analysing the distribution of qokeey
> in the paragraphs of folios 103r-116r
> [ ... ]
> There is a rough symmetry between folios on the
> same sheet of the quire, but this is hard to
> quantify.


Indeed. And I like the clear presentation.
My observation at the time was more a set of impressions:
- the bifolio boundaries are roughly observed
- the switch between frequent and infrequent qokeey seems to be not quite on page boundaries (also apparent from your table).
- in the first few occurrences in this section, the word tends to occur towards the end of each paragraph.

The second bullet led me to hypothesise that perhaps the paragraphs have been transcribed from an original document that had the pages in a wrong order, but maybe this is too farfetched.

In general, I have been thinking that this section might actually be a 'geography', each paragraph being a short description of a city. The word qokeey could have some geographical or political meaning that only belongs with some cities....
Note that there are further statistical discrepancies between the sets of pages that have either many qokeey or only few of them. The web page describing them is presently out of order, but one thing I remember is the ratio between occurrences of aiin and daiin.

Note further, again, that the odd distribution of qokeey is also seen _very_ clearly on f58r and f58v.
The word rarely occurs elsewhere.

# Subject: Doubled words
# From: Jorge Stolfi
# Date: Wed, 13 Feb 2002

The repetitions of "qokeey" are indeed exceptional, but they don't prove the concludion. After all, only a few VMS words behave like that. Moreover, repetitive names *do* occur in some languages: "Sing Sing", "Bora Bora", "Ping-Ping" (the name of a Chinese friend of mine), ...


Notes:

($) Here are some "chol" doublets in voyn/tak:

f1r.P3.15;H chor shey kol chol chol kor chal sho
f8v.P.5;H shealy daiin chary chol chol dar otchar etaiin
f8v.P.8;H ry okchol ksh chol chol chol cthaiin dain
f8v.P.8;H okchol ksh chol chol chol cthaiin dain shol
f15v.P.9;H shol daiin otcholocthol chol chol chody kan sor
f93v.P.4;H shdchy qokchol qokchody chol chol cty ykchy dar

Here are the 10 most common doublet words in voyn/tak, if I can
believe my scripts:

count word
----- --------
22 chol
20 daiin
19 qokeedy
14 qokedy
12 qokeey
11 chedy
10 ar
9 ol
8 dy
8 shedy

# From: Nick Pelling
# Date: Thu, 14 Feb 2002 12:25:38 +0000

my belief is that - for the most part - a word start frequently indicates either a change in encoding mechanism, or an actual word start in the plaintext, or just obfuscation (though I think the first two probably dominate the third).

So: EVA "ot-" could well be using one code mechanism, "qo-" another, etc.

Given that I also believe that we're looking at a verbose cipher (with many plaintext letters encoded as ciphertext pairs), this would also have the effect of reducing the average length of a ciphertext word, which would be desirable on the part of the encoder - nothing would betray a verbose cipher quicker than double-length words. :-)
posted by ぶらたん at 23:37| Comment(0) | テキストの性質

2011年04月29日

ZIP algorithm

これおもしろいな。

the is someone at an Italian University, who claims to identy an Author and/or his language by using the ZIP algorithm.
1. take any text greater than n Bytes, compress it with ZIP "known text"
2. Add more text and compress it too - this is the "unknown" text
3. compare difference of length of compressed text in step 1 and 2 . If you yield a minimum difference, they claim, the "unknown" text is derived form the "known" text's language or even from the same author. This procedure reminds me of the "entropy test", which was done on the VMS years ago.
Any comments?

# From: Nick Pelling
# Date: Tue, 29 Jan 2002 11:27:57 +0000

ZIP-era algorithms typically comprise two stages:-
(1) a pattern-matching stage, which converts an input stream into an output stream of both copy(-offset, length) commands and uncompressed literals; and
(2) a statistical (or entropy) encoder (like a Huffman or arithmetical encoder), which tries to compress the output of the first stage down to the entropy of that process' output stream.

Thus, the use of the ZIP algorithm in this "identify-the-author-and-his-language" algorithm you mention would carry out not only an entropy calculation, but also a pattern-matching calculation.

# From: Jacques Guy
# Date: Tue, 29 Jan 2002 23:01:42 -0000

The question: how small is "minimum"?

I would also say that producing the zipped files is unncessary, and, in fact, amounts to throwing out a great deal of information, since you end up with a single figure. It would be far more informative to compare the two Huffmann trees computed in the first stage of the algorithm.

posted by ぶらたん at 02:19| Comment(0) | テキストの性質

2010年12月24日

LSC (Letter Serial Correlation)

2000/1/23, posted by Mark Perakh

LSC test revealed in VMS features identical with meaningful texts we explored. On the other hand, if we assume that each voinichese symbol is a letter, then the letter frequency distribution in VMS is much more non-uniform than in any of 12 languages we tested. Furthermore, in one of my papers you can see the LSC results obtained for a gibberish which I created by hitting (suposedly randomly) the keys on a keyboard. It has some features of a meaningful texts, but also has some subtle differences from meaningful texts. You probably noticed that my conclusion was that, if we rely on LSC data, VMS can be either meaningful or a result of a very sophisticated effort to imitate a meaningful text, in which even the relative frequencies of vowels and consonants have been skilfully faked. I can hardly imagine such an extraordinarily talented and diligent forger, so I am inclined to guess VMS is a meaningful text, but some doubts remain. Moreover, if VMS symbols are not individual letters, all LSC results hang in the air.

2000/1/15, posted by Gabriel Landini

I think that the LSC depends heavily on the construction of words, But also think that word construction (because of Zipf's law) depends heavily on a sub-set of the word pool.

Long-range correlations in codes was discussed in DNA a couple of years ago in very prestigious Journals like Nature and Science, but to date I do not think that anybody had a convincing theory or explanation of the meaning and validity of the results.

If you think, really what is the relation (in any terms) of a piece of text which is many characters away from another? What is the large scale structure of a text? That would mean that there are events at a small scales and also at larger scales. I can imagine that up to the sentence level or so there may be patterns or correlations (what we call grammar?), but beyond that, I am not sure. Think of a dictionnary, there may not be any structure beyond 1 sentence or definition (still Roget's Thesaurus coforms Zipf's law for the more frequent words). Consequently I see no reason why there should be any large scale structures in texts. (I may be very wrong).

2000/1/16, posted by Mark Perakh

My comments related only to the question whether or not we can expect LSC to distinguish between meaningful and monkey texts. I believe the behavior of monkey texts from the standpoint of LSC is expected to be quite similar to that of permuted texts, therefore LSC is expected to work for monkeys as well as for permutations. I do not think LSC will distinguish between permuted and monkey texts. This is based of course on the assumption that the texts are long enough so the actual frequencies of letter occurences are quite close to their probabilities.

2000/1/17, posted by Rene Zandbergen

I agree with Gabriel that using a 3rd order word monkey would be even more interesting in terms of checking the capabilities of the LSC method in detecting meaningful text. On the other hand, getting meaningful word entropy statistics is even more difficult than getting 3rd order character entropy values, so the text from a 3rd order word monkey will repeat the source text from which the statistics have been drawn much more closely than should be the case. As before, a 1st order word monkey will be equivalent to a random permutation of words, and if it is true (in a statistically significant manner) that the LSC test distinguishes between one and the other, we do have another useful piece of evidence w.r.t. the Voynich MS text.

2000/1/20, posted by Mark Perakh

I believe we have to distinguish between four situations, to wit: 1) Texts generated by permutations of the above elements (as it was the case in our study). In this case there is a limited stock of the above elements, hence there is a negative correlation between elements# distributions in chunks, and therefore it is a case without replacement (hypergeometeric distribution). Our formula for Se was derived for that situation. 2) Monkey texts generated by using the probabilities of elements (letters, digraphs, etc) and also assuming that the stock of those elements is the same as that available for the original meaningful text. In this case we have again negative correlation and it is a no-replacement case (hypergeometric) so our formula is to be used without a modification. 3) The text generated as in item 2) but assuming the stock of letters is much-much larger (say 100,000 times larger) than that available in the original text, preserving though the ratios of elements occurrences as in the original text. This is a case with replacement (approximately but with increasing accuracy as the size of the stock increases). In this case our formula has to be modified (as indicated in paper 1) using multinomial variance. Quantitatively the difference is only in L/(L-1) coefficient which at L>>1 is negligible. 4) The text generated assuming the stock of elements is unfinitely large. In this case the distribution of elements is uniform, i.e. the probabilities of all elements become equal to each other (each equal 1/z where z is the number of all possible elements (letters, or digrams, etc) in the original text). In this case formula for Se simplifies (I derived it in paper 1 for that case as an approximation to roughly estimate Se for n>1). Quantitatively cases 1 through 3 are very close, but case 4 produces quantities measurably (but not very much) differing from cases 1 through 3 (see examples in paper 1).

2000/1/21, posted by Jorge Stolfi

Why should the LSC work?

In a very broad sense, the LSC and the nth-order character/word entropies are trying to measure the same thing, namely the correlation between letters that are a fixed distance apart.

People have observed before that correlation between samples n steps apart tends to be higher for "meaningful" signals than for "random" ones, even for large n. The phenomenon has been observed in music, images, DNA sequences, etc. This knowledge has been useful for, among other things, designing good compression and approximation methods for such signals. Some of the buzzwords one meets in that context are "fractal", "1/f noise", "wavelet", "multiscale energy", etc. (I believe that Gabriel has written papers on fractals in the context of medical imaging. And a student of mine just finished her thesis on reassembling pottery fragments by matching their outlines, which turn out to be "fractal" too.) As I try to show below, one can understand the LSC as decomposing the text into various frequency bands, and measuring the `power' contained in each band. If we do that to a random signal, we will find that each component frequency has roughly constant expected power; i.e. the power spectrum is flat, like that of ideal white light (hence the nickname `white noise'.) On the other hand, a `meaningful' signal (like music or speech) will be `lumpier' than a random one, at all scales; so its power spectrum will show an excess power at lower frequencies. It is claimed that, in such signals, the power tends to be inversely proportional to the frequency; hence the moniker `1/f noise'. If we lump the spectrum components into frequency bands, we will find that the total power contained in the band of frequencies between f and 2f will be proportional to f for a random signal, but roughly constant for a `meaningful' signal whose spectrum indeed follows the 1/f profile. Is the LSC better than nth-order entropy?

In theory, the nth-order entropies are more powerful indicators of structure. Roughly speaking, *any* regular structure in the text will show up in some nth-order entropy; whereas I suspect that one can construct signals that have strong structure (hence low entropy) but the same LSC as a purely random text.

However, the formula for nth-order entropy requires one to estimate z**n probabilities, where z is the size of the alphabet. To do that reliably, one needs a corpus whose length is many times z**n. So the entropies are not very meaningful for n beyond 3 or so.

The nth-order LSC seems to be numerically more stable, because it maps blocks of n consecutive letters into a single `super-letter' which is actually a vector of z integers; and compares these super-letters as vectors (with difference-squared metric) rather than symbols (with simple 0-1 metric). I haven't done the math --- perhaps you have --- but it seems that computing the n-th order LSC to a fixed accuracy requires a corpus whose length L is proportional to z*n (or perhaps z*n**2?) instead of z**n. Morever, one kind of structure that the LSC *can* detect is any medium- and long-range variation in word usage frequency along the text. (In fact, the LSC seems to have been designed specifically for that purpose.) As observed above, such variations are present in most natural languages, but absent in random texts, even those generated by kth-order monkeys. Specifically, if we take the the output of a k-th order `letter monkey' and break it into chunks whose length n >> k, we will find that the number of times a given letter occurs in each chunk is fairly constant (except for sampling error) among all chunks. For kth-order `word monkeys' we should have the same result as long as n >> k*w, where w is the average word length. On the other hand, a natural-language text will show variations in letter frequencies, which are due to changes of topic and hence vocabulary changes, that extend for whole paragraphs or chapters. Thus, although the LSC may not be powerful enough to detect the underlying structure in non-trivial ciphers, it seems well suited at distinguishing natural language from monkey-style random text.

In conclusion, my understanding of the Perakh-McKay papers is that computing the LSC is an indirect way of computing the power spectrum of the text. The reason why the LSC distinguishes meaningful texts from monkey gibberish is that the former have variations in letter frequencies at all scales, and hence a 1/f-like power spectrum; whereas the latter have uniform letter frequencies, at least over scales of a dozen letters, and therefore have a flat power spectrum. Looking at the LSC in the context of multiscale analysis suggests many possible improvements, such as using scales in geometric progression, and kernels which are smoother, orthogonal, and unitary. Even if these changes do not make the LSC more sensitive, they should make the results easier to evaluate. In retrospect, it is not surprising that the LSC can distinguish the original Genesis from a line-permuted version: the spectra should be fairly similar at high frequencies (with periods shorter than one line), but at low frequencies the second text should have an essentially flat spectrum, like that of a random signal. The same can be said about monkey-generated texts. On the otherhand, I don't expect the LSC to be more effetive than simple letter/digraph frequency analysis when it comes to identifying the language of a text. The most significant influence in the LSC is the letter frequency histogram --- which is sensitive to topic (e.g. "-ed" is common when talking about past) and to spelling rules (e.g. whether one writes "ue" or "ü"). The shape of the LSC (or Fourier) spectrum at high frequencies (small n) must be determined mainly by these factors. The shape of the specrtum at lower frequencies (higher n) should be determined chiefly by topic and style.

2000/1/22, posted by Jorge Stolfi

For one thing, while the LSC can unmask ordinary monkeys, it too can be fooled with relative ease, once one realizes how it works. One needs only to build a `multiscale monkey' that varies the frequencies of the letters along the text, in a fractal-like manner.

Of course, it is hard to imagine a medieval forger being aware of fractal processes. However, he could have used such a process without knowing it. For instance, he may have copied an arabic book, using some fancy mapping of arabic letters to Voynichese alphabet. The mapping would not have to be invertible, or consistently applied: as long as the forger mantained some connection between the original text and the transcript, the long-range frequency variations of the former would show up in the latter as well.

Moreover, I suspect that any nonsense text that is generated `by hand' (i.e. without the help of dice or other mechanical devices) will show long-range variations in letter frequencies at least as strong as those seen in meaningful texts.

Thus Mark's results do not immediately rule out random but non-mechanical babble or glossolalia. However, it is conceivable that such texts will show *too much* long-range variation, instead of too little. We really need some samples...
posted by ぶらたん at 20:26| Comment(0) | テキストの性質

2010年12月22日

Could the difference between A and B be due to different subject matter?

1998/11/25, posted by Rene Zandbergen

> Another thought. Could the difference between A and B be due to
> different subject matter?

It could be, but then we know that the text has nothing to do with the illustrations. Herbal-A and Herbal-B are the most different of all 'dialects', in my scatterplots (based on digraphs). Pharma and Astro/Cosmo are very similar, despite their probable subject- matter.

But more importantly, the same scatterplots show that the difference between A and B is of the type of a continuous change. As if the writer's style (spelling, cypher characteristics) gradually changed with time. From these plots, I can think of three possibilities.

1) One author, who started writing in B-style and gradually devellopped A-style. This could mean that the Herbal-A section is a cleaned-up copy of earlier Herbal-B-type scribbles, but this task was not completed. It would also mean that the zodiac section was written backwards.

2) One author, who started writing in A-style and gradually 'degraded' into B-style. This would mean that the Herbal-B pages in the first half of the Ms have been misplaced during the binding by an illiterate (like us :-) ). Note that the Herbal-B handwriting is the only part which is visibly different from the rest.

3) Two authors ('A' and 'B'). They started with a common style, 'A' doing pharma and 'B' doing astro-cosmo. 'A' then did herbal-A and 'B' then did the stars and bio sections. 'B' also did some herbal pages perhaps when 'A' was no longer able or willing to continue. The nice aspect of this theory is that 'A' did all the plant drawings, and 'B' did all the drawings involving stars and nymphs. The odd part is that the sections on which they started are all on foldout pages and only later they went to normal pages in normal-sized quires.

(3) is the more fascinating option, which explains most of the observed features, but (2) is the simpler explanation, which is also worth something.
posted by ぶらたん at 23:04| Comment(0) | テキストの性質

2010年12月16日

frequency

1998/3/1, posted by Karl Kluge

Here is the full frequency data for the transformed text using the mapping from Tiltman structures to individual characters given above:

Vowels identified by Sukhotin's algorithm: I J K E A L B G 4
Line Word
Letter Global Initial Final Initial Final
0 5.84927 0.52506 0.89153 1.54647 21.70076
1 7.56907 0.90692 2.00594 2.49814 15.81762
2 1.87020 0.23866 0.14859 1.14498 7.03303
3 2.68105 0.71599 0.37147 1.44238 7.83527
4 0.06866 0.04773 0.07429 0.05948 0.25404
5 0.26484 0.33413 0.00000 0.17844 0.45461
6 0.00000 0.00000 0.00000 0.00000 0.00000
7 0.00000 0.00000 0.00000 0.00000 0.00000
8 0.00000 0.00000 0.00000 0.00000 0.00000
9 0.00000 0.00000 0.00000 0.00000 0.00000
A 3.48210 5.25060 1.26300 4.90706 1.60449
B 2.89031 4.72554 1.04012 4.32714 1.55101
C 0.39235 0.14320 0.00000 0.75836 0.13371
D 0.12424 0.04773 0.07429 0.14870 0.05348
E 4.78339 10.54893 0.89153 14.98885 1.36382
F 0.05231 0.00000 0.00000 0.16357 0.01337
G 1.77538 5.34606 0.14859 4.72862 0.82899
H 0.15694 0.14320 0.00000 0.46097 0.04011
I 16.53752 16.84964 3.26895 20.66914 11.49886
J 7.42194 10.73986 0.52006 8.28253 3.65022
K 11.84241 19.18854 13.15007 14.92937 15.48335
L 3.28592 11.31265 6.53789 3.58364 1.29696
M 4.51855 3.15036 7.20654 2.24535 1.93876
N 9.12539 6.49165 11.73848 7.83643 2.68753
O 0.24849 0.00000 2.52600 0.04461 0.05348
P 2.39333 0.28640 6.61218 0.65428 0.45461
Q 5.25421 1.24105 20.20802 1.56134 1.39056
R 0.14059 0.00000 0.74294 0.08922 0.02674
S 3.48210 1.24105 9.21248 1.47212 1.23011
T 0.40543 0.19093 1.26300 0.16357 0.20056
U 0.12424 0.04773 0.14859 0.07435 0.09360
V 0.00000 0.00000 0.00000 0.00000 0.00000
W 3.21726 0.28640 9.88113 1.01115 1.31034
X 0.02616 0.00000 0.00000 0.01487 0.00000
Y 0.01635 0.00000 0.07429 0.01487 0.00000
Z 0.00000 0.00000 0.00000 0.00000 0.00000

Entropy 4.00770 3.47030 3.59539 3.65581 3.46832
- ------------------------------------------------------
Digraphs whose max frequency global, line initial, etc. > 2.500000%:
line word
global initial final initial final wf/wi
0E 1.4085 0.0000 0.0000 0.0000 0.0000 6.2592
0I 0.7781 0.0513 0.0000 0.0000 0.0321 3.3754
0N 0.6196 0.0000 0.0899 0.0000 0.0160 2.6217
1E 0.6880 0.0000 0.0000 0.0000 0.0000 2.9985
1I 0.9726 0.1540 0.0000 0.2967 0.5451 2.7691
1K 1.1131 0.1027 0.1799 0.3894 1.6194 2.5397
E0 0.7745 1.0267 0.1799 3.1337 3.2227 0.0164
E2 0.6268 1.3347 0.0000 2.5032 2.6615 0.0000
I0 2.4063 2.4641 0.4496 3.8569 9.6841 0.0164
I1 3.8940 3.0287 0.7194 5.8780 9.2512 0.2622
II 1.0410 0.8727 0.1799 0.3894 0.4489 2.9330
IK 2.3558 2.1561 1.3489 2.9112 5.0024 2.4742
J0 1.6282 1.1807 0.0899 2.1509 6.5416 0.0328
J1 2.1181 2.1047 0.5396 2.6516 5.0505 0.0983
KI 1.4805 5.1848 0.2698 0.9271 0.9139 2.8347
KJ 0.6952 2.8747 0.0000 0.4636 0.2565 1.2125
KP 0.6664 2.0534 2.7878 0.8159 0.2245 0.0655
KQ 2.4675 4.4661 12.8597 4.3019 0.8498 0.1147
KS 1.0807 1.3860 4.2266 2.0026 0.4008 0.2458
KW 0.9474 0.5133 3.9568 2.0026 0.5451 0.0655
LN 0.4431 2.8747 0.3597 0.3894 0.0802 0.0164
NK 1.7543 0.6160 2.9676 1.0013 2.2607 0.4752

h2 3.57014 3.23198 2.96356 3.27572 2.79567 3.47274

Any suggestions on how to proceed with testing of this hypotheis regarding the nature of the encoding (and, more to the point, finding the correct mappings of Voynich character combinations to plaintext characters if this is the type of cipher we're dealing with)?
posted by ぶらたん at 23:09| Comment(0) | テキストの性質

C89 ratio

1998/2/9, posted by Denis Mardle

my remark that implied Herbal B1 and B2 were the same language/hand despite the pretty fit to the quires from Karl's work which could still be valid since I was only looking at the Currier O89 to C89 ratio. This ratio is very good at sorting out sets. For instance Herbal B1 has 23.7% O89 ( 41 to 132 ) and Herbal B2 is 20.0% ( 67 to 269 ). These figures fit into the range for the Stars B sets f104r,f105r,f106r,f107r ( see my later "quires ... " ) which are in the 16-26% range. The f103r and f108r sets have only 4.7%, closer to the Bio - B figure of only 0.5%, but I suspect significantly different. Herbal A is very different again with the ratio at 98.5% ( 270 to 4 ). My conclusion is that neither Herbal B1 nor B2 can go with Bio - B and the O89 to C89 ratio test does not split them. I will accept another statistic to show a B1 to B2 significance, but I need to see the figures. The O89 to C89 ratio ( at 98.5% ) will not split Herbal A sets.
posted by ぶらたん at 22:10| Comment(0) | テキストの性質

2010年12月13日

主題とテキストの性質

1997/10/31, posted by Rene Zandbergen

> Couldn't the differences in vocabulary and style between the
> astronomical and herbal sections be due to the differing
> subject matter?

Very unlikely. Just remember that Herbal-A and Herbal-B have presumably the same subject-matter, yet are very different in style. Or put the other way around: (ob-smiley omitted)

Theorem 1: if the differences are due to subject-matter, then the pictures do not belong with the text.

And this leads to another important theorem:

Theorem 2: if the differences are due to subject-matter, then the person who collated the pages, bound the Ms and numbered it *could not read* the VMs.

Proof: He went by the pictures.

I would favour the 'different dialect' explanation. Note that subject-matter-related differences may still be identified, but I expect them to be of a smaller scale than the A-vs-B differences.
posted by ぶらたん at 22:12| Comment(0) | テキストの性質

2010年12月10日

ラベル/本文中の単語頻度

1997/4/10, posted by Karl Kluge

こういうふうに、ラベルと本文中の文字の使用頻度が違うから、ヴォイニッチがランダムなでたらめで無意味な文章であることを否定したくもなる。

label-initial/final and running text line-initial/final stats differ: using a label corpus from f68r1, f70v2, f72r2, f88r, and f100r

4O is line initial 17.9% in the mss, < 1% in the labels
AM is line final 11.5% in the mss, 3.6% in the labels
OF is line initial 2.1% in the mss, 22.2% in the labels
OP is line initial 3.5% in the mss, 33.9% in the labelsinitial 3.5% in the mss, 33.9% in the labels


1997/4/28, posted by Denis Mardle

I have also produced some interesting counts on positions of labels in words. OE89 for instance ( a label on f99v) is always the end of a word or more often a word on its own when, of 16 occurrences 10 are at the end of a line and one at the end of para 2 on f99v whereas text ( less plant labels ) ends the page with ZOE89 on f82v; SOE89 similarly on f89v1 and OEFCCOE89 on f99v.
posted by ぶらたん at 21:18| Comment(0) | テキストの性質

2010年12月07日

あとでグラフ書く

Relative frequencies of initial letters of lines

Relative frequencies of initial letters of paragraphs

全体の割合、

Language A,

Language B,

セクションごと

平均値からの乖離、グラフ化
posted by ぶらたん at 22:28| Comment(0) | テキストの性質

2010年11月29日

ぼやき

1996/5/14, posted by Guy Thibault

I ran many statistical analysis on the VMS (as does anyone) and it does seems that there is too many repetitions of letters too close apart to be some how meaningful... NULLS maybe...

私はヴォイニッチ手稿について、多くの統計解析を行い、似たような文字の繰り返しが非常に多いことを見て、何か意味があるとしても、いやたぶん何もないんだろう。

厳しいですねぇ。
posted by ぶらたん at 15:16| Comment(0) | テキストの性質

2010年07月27日

THE DISTRIBUTION OF LETTERS AND IN THE VOYNICH MANUSCRIPT: EVIDENCE FOR A REAL LANGUAGE?

1994/3/24, posted by Jacques Guy

Language AとBのそれぞれの中で、頻度が違うというのも不思議ですよね。
俺なんか単にでっち上げの結果じゃないかと思ったりするんだけどね。

A is characterized by the extreme frequency of letter , which occurs more than eight times as often as in B. B on the other hand is characterized the very high frequency of letter which occurs three times as often as in A. We observe 2393 occurrences of in A, and 3053 of in B. Corpus B being 1.5 times the size of Corpus A, we would expect about 3590 occurrences of if there was an exact, systematic correspondence between and in those environments. The correspondence if not exact, only very strong. But note that letter occurs with the same relative frequency as . It consists of two linked and is considered by Currier and most to be a single letter. It is probably so but the linkage of its two strokes is loose and it may, at least sometimes, be in fact two consecutive occurrences of . In that case the correspondence might yet be exact.

Sukhotin's algorithm identifies letters and as vowels. They resemble our Roman o and e, and Greek omicron and epsilon, and we may hypothesize that they have been used by the Voynich authors with similar phonetic values, perhaps substituted one for the other in a simple cipher. The substitution of e for o is common in natural languages. Thus Standard English and Scots English (lord, laird), Japanese and Japanese teenagers' slang (uso yo, use ye "that's a lie!")

We may, then, have in the Voynich Manuscript either the same language in two different ciphers (very probably simple substitution), or two dialects of the same language. Since the frequency counts above do not show any such strong correspondences for other letter pairs I would incline towards the second hypothesis. William Friedman came to think that the manuscript was not a cipher proper, but a text in an artificial language such as were elaborated by George Dalgarno, or Bishop John Wilkins. It is hardly conceivable that such a language could develop a dialect, as none is known to have been used extensively in writing, let alone spoken. The Voynich Manuscript is perhaps, then, written in a natural language.
posted by ぶらたん at 22:14| Comment(0) | テキストの性質

2010年07月22日

Hand A and B

1992/02/23, posted by jbaez

筆記者AとBは本当にいるのかしら。
筆記者A、Bと言語A、Bは完全に対応しているのかな。

Hand A writes rather large letters, loosely spaced, and is given to fancy flourishes in his gallow letters. Hand B on the other hand :-) writes in a rather cramped way, with smaller letters, and is not so much given to flourishes. The final folios, 103-116, which are the only ones completely devoted to text (unless one counts the pictures of stars, which seem more like paragraph headers than illustrations), are in hand B. (Note: this what it looks like to me, I haven't checked the "official" record, since I want to learn the difference myself). Hand A seems much more eager to leave lots of blank space on the page. All this would seem to portray A as a free-wheeling, expansive fellow and B as a perfectionistic, constipated sort, BUT it seems to be the case that ALL THE NYMPHS OCCUR ON B's part of the text. One especially dramatic instance of this is on the last page, with its mysterious "key" -- I guess this is folio 117v -- which occurs right after the folios completely devoted to text. It was probably written by B, since the writing looks like B's and it occurs after a bunch of B. AND, it has one last little nymph on it!
posted by ぶらたん at 13:04| Comment(0) | テキストの性質

2010年07月21日

クーリエの統計結果に対する意見

1992/01/28, posted by Robert Firth

当然普通の自然言語でも、前後の単語間の相関はあるわけで、ヴォイニッチだけの性質ではないし、むしろヴォイニッチがでたらめの無意味ではない証拠にもなると思う。

1. Letter correlations.

Currier, and now we, have found correlations between the final "letter" of a "word", if that is they are letters and words, and the initial letter of the next word. Granted. However, I have some problems with the interpretation. Currier claims he knows of no language with this feature; I think he's very wrong.

First, note that in many languages (welsh for instance) the phoneme (sound) at the start of a word is modified by the previous word. Some systems of writing reflect this change (modern welsh I believe does), and some do not. Secondly, note that in some languages there are grammatical rules that lead to such correlations. In english the chain of causation runs from right to left (a possibility Currier
overlooks): "a" changes to "an" before a vowel, and possessives in "-y" change to "-ine". Likewise, both french and italian elide heavily, and some writing systems reflect this. Finally, I struggled through enough Dante at one time to know that in italian poetry of the time, endings and beginnings were highly correlated, not because of orthography or grammar, but because of euphony.

So, yes, the statistical patterns exist, and they are real, but I have two problems. (a) are they unusual? - would we not find the same with known european languages, and (b) can we reason back from the effect to the cause, given that there might be many or multiple causes, at very different levels of language.
I am similarly skeptical of preferred initial letters. After all, in latin "qu" is never final, and "x" is never initial, and I'm sure parallels can be found in many languages. So this isn't unusual. Preferred paragraph-initial letters, frankly, impress me even less. In Euclid's "Elements", for example, paragraphs usually begin with "Axiom", "Theorem", "Corollary", "Lemma", ... in other words a very small set. The peculiar letter pattern is a consequence of the peculiar word pattern, and that is a consequence of the style of the author. Nothing can be deduced about the language from this effect.
posted by ぶらたん at 00:08| Comment(0) | テキストの性質

2010年07月20日

EFFECTS OF THE ENDINGS OF ONE "WORD" ON THE BEGINNING OF THE NEXT "WORD"

You remember I mentioned that some "word"-finals have an
obvious and statistically-significant effect on the initial
symbol of a following "word." This is almost exclusively to
be found in "Language" B, and especially in "Biological B"
material.

--------------------End of Quote----------------------------

前の単語の最後の文字が、次の単語の最初の文字に影響を与えているという仮説の1つの答え。

1992/01/28, posted by Jacques Guy

Let me give you an example. Imagine that I were to write a space after every t when I writ e in English, and merge t he rest oft heremain ingwo rdsrat her ran domlyli keI've just beendo ing
right now. Since "th" is a very frequent digraph, you would
observe a strong correlation between words ending with "t" and words beginning with "h". In fact, I would not have to write a space after each and every single "t": a strong tendency to do so
would be enough to bring out the type of pattern observed by
Currier.
posted by ぶらたん at 16:16| Comment(0) | テキストの性質

2010年07月19日

letter frequencies

1991/01/26, posted by Jacques Guy

qyとかin、iinがそれぞれ一文字かもしれないということ。
繰り返しも多いし、それらが一文字だったとすると、情報量はますます減っていく。

Quite true. The trouble is: we really do *not* know what the *true* letters of the Voynich alphabet are. For instance, <4> is almost always followed by <o> (about 90% of the time). Perhaps <4o> is a single Voynich letter. We just don't know. Another example: everybody (including me) believes that Currier's <N> and <M> (my <iv> and <iiv>) are single letters. Well, it seems obvious that they are, but is it true? In a few minutes I shall send an article entitled "pronounceable Voynich". Have a look at it. To make the language pronounceable I hit upon this idea: let Currier's <D> be "u", but consider his <N> as made up of <I> and <D>, and read it "iu", and his <M> as made up of <II> and <D> and read it "nu" (<II> in the Voynich letters, does look a lot like German cursive "n" after all). Of course, I just wanted to make it pronounceable and did not believe one moment at the time that that could anything more than a convenient way of recording things. But it turned out to work so well, and to wipe out so many of the problems I had had in trying to make the Voynich pronounceable that I am starting to wonder: Currier's <D> does look like a "v", "v" and "u" are not distinguished in medieval manuscripts, so...? And then I have found in Bischoff's treatise of Latin paleography an example of Beneventan "n" that looks surprisingly like two Voynich <I>'s. Could it be? Could it be?
posted by ぶらたん at 23:32| Comment(0) | テキストの性質

9=A?

1991/01/24, posted by Jacques Guy

EVAで言うと、dyとdaのことですね。
後で調べたいのでメモ。

Currier has remarked: "Final 89 is very high in language B,
almost non-existent in Language A".

That had me worried: "89" being extremely frequent, it would
mean that we have two very different "dialects" in Languages
A and B.

I split the Voynich file into VOYNICH.A and VOYNICH.B, and
did a frequency count, disregarding spaces but not
end-of-lines:

A B
89 Observed: 446 2844
Expected: 144 387
Ratio: 3.10 7.35

Indeed. But I also noticed that the discrepancies were
reversed for 8a:

A B
8a Observed: 993 768
Expected: 103 223
Ratio: 9.64 3.44

I remembered that, in my article in Cryptologia, I had
hypothesized that <9> could well be a word-final variant of
<a>, <o>, <c>, or <cc>. And that, in one of my postings to
this group, I wrote that Currier's finding that the ending
of one word strongly affects the beginning of the next
suggested to me that spaces between words had only an
aesthetic function. What if <9> were but a variant of <a>?

Let us see:

A B
8[a9] Observed: 1439 3612
Expected: 247 610
Ratio: 5.83 5.92

That confirmed my suspicion. But was there any additional
evidence? Yes. Looking at my frequency tables, I found that
<9> and <a> occurred in nearly perfect mutually exclusive
distribution, conditioned by the following letter.

Here are the statistics for Language A (my transcription
again):

a o i v c x z 2
a - 2 1358 60 12 245 338 7
9 8 222 4 1 713 11 6 80

4 8 9 g q l = #
a 1 7 5 - 3 5 4 -
9 230 425 72 - 302 353 375 63


And for Language B:

a o i v c x z 2
a 1 8 1506 16 22 739 728 7
9 36 790 8 2 757 332 125 140

4 8 9 g q l = #
a 1 8 6 - 5 10 2 -
9 1431 440 101 - 253 315 547 26

<a> occurs before <i>, <v>, <x> and <z>, <9> before other
letters, and at the end of lines (=) and paragraphs (#).
But note that the constraint is considerably relaxed in
Language B before <x> and somewhat before <z>.

This pretty well convinces me that <a> and <9> are two
variants of the same letter conditioned by the shape of the
following letter.

There is another conditioning factor: <9> occurs
line-initially. Here are the statistics:

a 9
Language A 2 149
Language B 8 134

Again, we observe a slight relaxation of this "rule" in
Language B. This makes me think that Author B wrote less
confidently than Author A.
posted by ぶらたん at 23:06| Comment(0) | テキストの性質

クーリエの1976論文より

Currier observed that letter frequencies varied accordingly
to their position in the line, being quite different
line-finally from word-finally, line-initially from
word-initially. He gives an instance of this phenomenon,
based on frequency counts from Herbal A (roughly 6500 words
in 1000 lines):

"word"-initial total frequency
symbols "word"-initially line-initially

<cqpt> 118 3
<coqpt> 212 26
<c'qpt> 24 0
<c'oqpt> 45 10


There is indeed something quite strange going on there. With
an average of 6.5 words or so per line, we should expect to
see about 15% of those word-initials occurring at the
beginning of a line. Thus:

"word"-initial total frequency
symbols "word"-initially line-initially

<cqpt> 118 18
<coqpt> 212 33
<c'qpt> 24 4
<c'oqpt> 45 7

The discrepancies between expected and observed frequencies
do not worry me much, except in the case of <cqpt>: 3
occurrences observed when 18 are expected is enough to catch
my attention too. Consider also that we should expect those
four common word-initials, totalling 399 cases, to occur 61
times or so line-initially. We find them there only 39
times.

Currier also found that "some "word"-finals have an obvious
and statistically-significant effect on the initial symbol
of a following "word." Thus:


"word" beginning with:

is preceded by <4o> <x> <ct>
"word" ending in: or <2> or <c't>

<x> series 13 7 91
<2> series 10 2 68
<v> series 23 0 275
<9> series 592 184 168


Currier comments:

"Words" ending in the <9> sort of symbol, which is very
frequent, are followed about four times as often by
"words" beginning with <4o>. That is a fact, and it
holds true throughout the entire twenty pages of
"Biological B." It's something that has to be considered
by anyone who does any work on the manuscript. These
phenomena are *consistent*, *statistically significant*,
and hold true throughout those areas of text where they
are found. I can think of no linguistic explanation for
this sort of phenomenon, not if we are dealing with
words or phrases, or the syntax of a language where
suffixes are present.
posted by ぶらたん at 22:10| Comment(0) | テキストの性質

2010年07月18日

Currier's Language A/Bの特定

posted by Jim Gillogly, 1992/1/10

Currier's rules are:

a) Final 89(dy in EVA) is very high in Language B; almost non-existent in Language A.
b) SOE and SOR(chol and chor in EVA) are very high in A, often repeated; low in B.
c) The symbol groups SAN and SAM(chain and chain in EVA) rarely occur in B; medium frequency in A.
d) Initial SOP(chot in EVA) high in A, rare in B.
e) Initial Q(cth in EVA) very high in A, very low in B.
f) Unattached finals scattered throughout Language B.

I didn't know how to quantify f), so I ignored it for now. For the others, I calculated an ad hoc A Language score as follows:
a) Score 1 if final 89 < 10% of total words
b) Score 1 if SOE and SOR together > 5% of total words
Score 1 more if either SOE or SOR is repeated on the page
c) Score 1 if SAN and SAM together > 2% of total words
d) Score 1 if initial SOP > 2% of total words
e) Score 1 if initial Q > 2% of total words

Here's the output for f1r (page 001):

Page 001
nwords = 215.
Test a) word-final 89 is 3 of 215
Test b) SOE is 10 of 215.
Test b) SOR is 2 of 215.
Test b) SOE repeated: 1.
Test b) SOR repeated: 0.
Test c) SAM or SAN separately: 0.
Test d) Initial SOP 2.
Test e) Initial Q 16.
Overall score for page 001: 5.

言語Aと言語Bには明白な統計的違いがあり、それは筆記者A、Bに基づくというクーリエの主張です。
スコア化してプログラムかければAかBは明らかになりますが、本当に筆記者が二人いるのでしょうか?もしこれが証明されればでたらめの証明ですけどね。
posted by ぶらたん at 17:45| Comment(0) | テキストの性質

2010年07月16日

D'Imperio's book, section 4.4.1

ヴォイニッチがでっち上げだったという仮説に対して、
常に上げられるのが、ヴォイニッチはランダムなテキストではなく、規則を持った文章であるということです。


4.4.1 どんな理論によっても説明されなければならない本文の現象

(1) しばしば現れる記号の基本的なアルファベットは少ない。(ある研究者によれば最小で15 個、そしておそらく多くても25 個は越えることはないであろう。)
(2) 基本形が組み合わさり、さらに複雑な記号を作る。
(3) 記号は組み合わされ「単語」となる。それはスペースによって分離されている。
(しかしこのスペースの存在に疑問を持っている研究者もいる。)
(4) いくつかの単語は驚くほど限られている。
(5) 「単語」は平均して4, 5 文字からなり、短い。7-8 文字以上からなる単語、そして1文字からなる単語はほとんどない。同時に2文字からなる単語も少数である。(英語の文章についても述べておかなければならない。その単語の平均は5 文字である。しかし普通の文章中に1, 2 文字からなる単語も、また10-15 文字からなる単語も非常に多くある。
つまりヴォイニッチの文章とは全く異なるパターンを示す。)
(6) 同じ「単語」がしばしば2, 3回もしくはそれ以上、しかも連続して繰り返される。
(7) 多くの「単語」はお互いに1, 2文字しか違わないような「単語」がしばしば連続して繰り返される。
(8) ある記号はそれぞれ単語の最初に、中間に、最後に現れる特徴を持ち、そしてある傾向を持って並ぶ。
(9) ある記号はとても出現頻度が少なく、しかもある特定のページにしか現れない。このことはそれがある特別な機能もしくは意味を持っていると考えられる。
(10) 二重字はとても少ない。(同じ文字が連続して二回繰り返されること。)これらの文字は"e"そして"i"が主であり、ときには"y", "d", "o"が繰り返されるときもある。
(11) 文章中で一文字だけの記号はとても少ない。(一文字の「単語」。)これらは主に"s"と"y"である。
(12) 接尾辞様の要素はある「単語」の初めに付加されるが、それなしでもその単語は現れる。そんな接尾辞の要素としては"qo", "o", "y"がある。
(13) 記号"q"は常に"o"が後に続く。それらは"q"の横棒が伸びて結合され、その結果合成された記号は、そのほとんどが単語の初めの位置に現れる。
(14) ほとんどの草本セクションページでは、第一段落の第一行目は主に"t", "k","p", "f"の僅かな記号で始まる。これらの後には普通"ch", "Sh", "o", "y","aiin", "dy"が続く。多くの初期草本図が植物名アルファベット順に並んでいるが、期待されるようなアルファベットの順番らしきものは見られない。
(15) 星、「薬草の壺」、植物画、その他絵の隣にはラベルとして単語が書かれている。そこにはほとんど4 つの輪を持つ記号では始まらず、代わりにしばしば"o", "d", "y"で始まる。まれに"s","ch"で始まるときもある。
posted by ぶらたん at 11:40| Comment(0) | テキストの性質
HPへ戻る