2010年12月26日

medieval French would be a good candidate

2001/4/26, posted by Dennis Stallings

Jorge once said, "Many of us believe that Voynichese is a monosyllabic language in a complex script". To that I add, it may be such a representation of a common European language broken into syllables, ie. the words are actually syllables of a common European language.

I think medieval French would be a good candidate for this. Consider:

1) In spoken medieval and modern French, words are not distinguished separately; the stress on each syllable is about the same.

2) French poetry is not the weak-STRONG, weak-STONG, etc. iambic pentameter of English, nor the LONG-short-short, LONG-short-short, etc. dactylic hexameter of ancient Greek (and ancient Latin under Greek influence); no, a French verse is a fixed number of *syllables*! The alexandrine verse, the rough equivalent of heroic couplets in English, is a rhymed couplet of two lines of eleven syllables.

3) Louis XIV'x Royal Cipher was never broken in his lifetime, and records of it were lost afterward. When the late-19th-century crippie Ètienne Bazeries finally broke the cipher, which was expressed in groups of three numerals, he found that French *syllables* were enciphered, not single letters.

4) I believe that by the time of the VMs' origin (ca. 1480), French had become the language of communication of Europe's upper classes. In 1290, Marco Polo dictated his story of his travels - in French.

Of course, it could be a dialect of Italian. After all, the Renaissance was going on there at the time. René noted that the Vat. 1291, showing the nymphs Voynich-style, was in northern Italy at about the time. Toresella suggests Venice because of the prevalence of the "alchemical herbals" around there. Certainly Venice was a crossroads of many cultures then.

Jorge has done admirable work on the structure of all Voynich words. But let's not forget that Tiltman came up with a paradigm that explains 55-60% of Voynich "words". Even better, Robert Firth came up with a paradigm that explains 75-80% of Voynich "words":

http://www.research.att.com/~reeds/voynich/firth/24.txt

So. Choose three French texts of ca. 1480 (Rabelais and Montaigne are later, ~ 1550, so perhaps Marco Polo?), manually break them down into syllables, and get counts of the syllables. Then compare the top 280 syllables to the 280 Voynichese "words" that fit the Firth paradigm.

Yes, Voynichese may be homophonic, offering several alternatives for a given number of syllables; thus the top 280 Voynichese words may represent the top 100 French syllables. Yes, 8000 other Voynichese words represent the remaining 20%. But, instead of an empty volume, what I suggest might give us a piece of Swiss cheese whose holes we could fill later.
posted by ぶらたん at 17:43| Comment(0) | 書かれた言語

同じ植物の繰り返し

2001/3/5, posted by Gabriel Landini

I did, most repetitions are with the recipes section:
f1v # Same plant as f102r1[3,2] ? (Stolfi)
f18v # Same plant as f102r2[3,1] ? (FSG, Stolfi)
f19r # Same plant as f102v1[2,2] ? (Stolfi)
f23r # Same plant as f102r2[3,1] (FSG, GL)
f32v # Same plant as f102r2[1,2] ? (FSG, Stolfi)
f37v # Same plant as f102r1[3,1] ? (Stolfi)
f39r # Same plant as f95r2 ? (Stolfi)
f47v # Same plant as f102r2[1,1] ? (Petersen)
f48r # Same plant as f89v2[3,4] ? (Stolfi)
f47v # Same plant as f89v[1,4] ? (Stolfi)
f90v1 # Same plant as f100r[1,3] ? (Stolfi)
f96v # Same plant as f99r[4,1] ? (GL)
posted by ぶらたん at 16:13| Comment(0) | 植物

基本に戻ろう

2001/2/22, posted by Adam McLean

It seems to me that many of the skilled cryptographers on this group have puzzled and worked over the Voynich now for many years and yet seem no nearer to cracking the code.

It also seems unlikely to me that someone in the 16th century could devise a code that could defeat 21st century methods.

But how else can we proceed ?

I know I must sound like an old bore, always coming back to the same theme, but it seems to me that we have not yet exhausted an approach based on seeing the context of the manuscript - and relating it to other similar material. There may not be a Rosetta stone for the Voynich, but there may be some manuscripts out there that might help us see the context of the Voynich. Recently Dana Scott seems to have spent many hours surfing the net looking for images and parallels in manuscripts. A valient effort, however, I suspect only 0.01 % or less of medieval manuscript material has been scanned and placed on web sites. We really need some primary research done in libraries and special collections of such material, or to tap the knowledge of someone who has studied such material in depth.
posted by ぶらたん at 14:33| Comment(0) | その他

2010年12月25日

Antoine Casanova's research

2000/10/16, posted by Adam McLean

I have just reread Antoine Casanova's posting on 6th March 2000, based on his thesis, which reveals a structure within the individual 'tokens' in the Voynich language. These he shows as a series of rules, and from these he concludes that the language of the Voynich is not a natural language but has the characteristic signature of an artificial language.

2000/10/17, posted by Jorge Stolfi

In my view, the most significant feature of Antoine's substitution patterns is that the first letter of a Voynichese word seem to have more "inflectional freedom", while the final letters are relatively invariant. These patterns are precisely oposite to what we would expect to see in Indo-European languages (at least Romance and Germanic), where grammaticalinflection usually modifies letters near the end of the word.
Presumably this is what Antoine has in mind whe he says that Voynichese words are "built from synthetic rules which exclude ... natural language". Anyway, I think that this conclusion is unwarranted. After all, there are non-IE natural languages, which I do not dare to mention by name 8-), that do seem to have `substitution patterns' similar to those of Voynichese.
Thus I don't accept Antoine conclusion that Voynichese must be an artificial language, or at best a code based on "progressive modification [similar to] the discs of Alberti". It cannot be just some IE language with a funny alphabet, sure; but we already knew that.
I find it interesting also that his analysis yield a very anomalous pattern for n = 8, namely P_8 = ( 6 8 1 2 3 4 7 5 ). While that pattern may be just a noise artifact, it may also be telling us that the rare 8-letter words are mostly the result of joining a 2-letter word to a 6-letter one.
I am not sure what to make of Antoine's rules for generating P_n from P_{n+1}. For one thing, they seem to be a bit too complicated given the limited amount of data that they have to explain. Moreover, the counts s_2,.. s_{n-2} seem to be fairly similar, and the differences seem to be mostly statistical noise; therefore, their relative ranks do not seem to be very significant. Indeed, applying Antoine's method to Currier's transcription we get P_6 = ( 1 4 2 6 5 3 ), whereas from Friedman's we get P_6 = ( 1 5 2 4 6 3 ). Moreover, the latter would change to P_6 = ( 1 5 3 4 6 2 ) if we omitted just two words from the input text.
But the main limitation I see in Antoine's method is that he considers the absolute position of each letter in the word to be a significant parameter for statistical analysis. I.e., he assumes implicitly that an n-letter word contains exactly n "inflectional", slots, each each of them containing exactly one letter. This view seems too simplistic when one considers the patterns of inflection of natural languages, where each morphological "slot" can usually be filled by strings of different lengths, including zero. To uncover the inflection rules of English, for example, one would have to compare words of different lengths, because the key substitution patterns are

dog / dogs / dog's / dogs'
dance / dances / danced / dancing / dancer / dancers / ...
strong / stronger /strongest / strongly

and so on.

Another problem of Antoine's method is that the most important structural features of words in natural languages are usually based on *relative* letter positions, and may not be visible at all in an analysis based on absolute positions. For example, in Spanish there is a particularly strong alternation of vowels and consonants, so that if words were aligned by syllables one would surely find that the "even" letter slots have very different substitution properties than the "odd" slots. But since Spanish words may begin with either vowel or consonant, and may contain occasional VV and CC clusters, the 3rd and 4th letters in a 6-letter word should be about as likely to be VC as CV; and, therefore, will probably have very similar substitution statistics.

Indeed, aligning words letter-by-letter is a bit like classifying fractional numeric data like 3.15 and -0027 into classes by the number of characters, and then analyzing the statistics of the ith character within each class, without regards for leading zeros, omitted signs, or the position of the decimal point. While some statistical features of the data may still have some visible manifestation after such mangling, we cannot expect to get reliable and understandable results unless we learn to align the data by the decimal point before doing the analysis.
posted by ぶらたん at 21:07| Comment(0) | その他

意味がある vs 意味がない

2000/1/23, posted by Rene Zandbergen

In order to have real, strong evidence that the VMs contains meaningful text, we need to know how one can create a 'meaningless' text that still exhibits the same properties as meaningful text. More to the point: we need to find a mechanism that could have been applied 400-500 years ago.

Jacques already pointed out that we don't actually know how to define meaningful and meaningless. This may well prove to be a serious problem. When trying to generate meaningless texts which the LSC would classify as mneaningful, or vice versa, we're likely to end up in the no-man's land bordering on the two.... Take a meaningful text and start removing words (every 10th, every 2nd, at random...). When does the text stop being meaningful? How does the LSC curve behave?

2000/1/24, posted by Jorge Stolfi

Consider that an ideal text compression algorithm should take "typical" texts and turn them into random-looking strings of bits. Of course this transformation preserves meaning (as long as one has the decompression algorithm!); but, for maximum compression, the program should equalize the bit probabilities and remove any correlations. Modern compressors like PKZIP go a long way in that direction. The compressed text, being shorter than the original, will actually have more meaning per unit length; but it will look like perfect gibberish to LSC-like tests.

Or, consider a meaningful plaintext XORed with the binary expansion of pi. The result will have uniform bit probabilities, and no visible correlations; but it will still carry the original meaning, which can be easily recovered. It would take a very sophisticated algorithm (one that knows that pi is a "special" number) to notice that the text is not an entirely random string of bits.

So the LSC and possible variants are not tests of `meaning' but rather of `naturalness.' They work because natural language uses its medium rather inefficiently, but in a rather peculiar way: it uses symbols with unequal frequencies (a feature that mechanical monkeys can imitate), but changes those frequencies over long distances (something which simple monkeys won't do).

However, with slightly smarter monkeys one *can* generate meaningless texts that fool the LSC; and the same applies for any "meaning detector" that looks only at the message. Conversely, one can always encode a meaninful text so as to make it look "random" to the LSC. In short, a naturally produced (and natural-looking) text can be quite meaningless, while a meaningful text may be (and look) quite unnatural.
posted by ぶらたん at 10:54| Comment(0) | その他

2010年12月24日

LSC (Letter Serial Correlation)

2000/1/23, posted by Mark Perakh

LSC test revealed in VMS features identical with meaningful texts we explored. On the other hand, if we assume that each voinichese symbol is a letter, then the letter frequency distribution in VMS is much more non-uniform than in any of 12 languages we tested. Furthermore, in one of my papers you can see the LSC results obtained for a gibberish which I created by hitting (suposedly randomly) the keys on a keyboard. It has some features of a meaningful texts, but also has some subtle differences from meaningful texts. You probably noticed that my conclusion was that, if we rely on LSC data, VMS can be either meaningful or a result of a very sophisticated effort to imitate a meaningful text, in which even the relative frequencies of vowels and consonants have been skilfully faked. I can hardly imagine such an extraordinarily talented and diligent forger, so I am inclined to guess VMS is a meaningful text, but some doubts remain. Moreover, if VMS symbols are not individual letters, all LSC results hang in the air.

2000/1/15, posted by Gabriel Landini

I think that the LSC depends heavily on the construction of words, But also think that word construction (because of Zipf's law) depends heavily on a sub-set of the word pool.

Long-range correlations in codes was discussed in DNA a couple of years ago in very prestigious Journals like Nature and Science, but to date I do not think that anybody had a convincing theory or explanation of the meaning and validity of the results.

If you think, really what is the relation (in any terms) of a piece of text which is many characters away from another? What is the large scale structure of a text? That would mean that there are events at a small scales and also at larger scales. I can imagine that up to the sentence level or so there may be patterns or correlations (what we call grammar?), but beyond that, I am not sure. Think of a dictionnary, there may not be any structure beyond 1 sentence or definition (still Roget's Thesaurus coforms Zipf's law for the more frequent words). Consequently I see no reason why there should be any large scale structures in texts. (I may be very wrong).

2000/1/16, posted by Mark Perakh

My comments related only to the question whether or not we can expect LSC to distinguish between meaningful and monkey texts. I believe the behavior of monkey texts from the standpoint of LSC is expected to be quite similar to that of permuted texts, therefore LSC is expected to work for monkeys as well as for permutations. I do not think LSC will distinguish between permuted and monkey texts. This is based of course on the assumption that the texts are long enough so the actual frequencies of letter occurences are quite close to their probabilities.

2000/1/17, posted by Rene Zandbergen

I agree with Gabriel that using a 3rd order word monkey would be even more interesting in terms of checking the capabilities of the LSC method in detecting meaningful text. On the other hand, getting meaningful word entropy statistics is even more difficult than getting 3rd order character entropy values, so the text from a 3rd order word monkey will repeat the source text from which the statistics have been drawn much more closely than should be the case. As before, a 1st order word monkey will be equivalent to a random permutation of words, and if it is true (in a statistically significant manner) that the LSC test distinguishes between one and the other, we do have another useful piece of evidence w.r.t. the Voynich MS text.

2000/1/20, posted by Mark Perakh

I believe we have to distinguish between four situations, to wit: 1) Texts generated by permutations of the above elements (as it was the case in our study). In this case there is a limited stock of the above elements, hence there is a negative correlation between elements# distributions in chunks, and therefore it is a case without replacement (hypergeometeric distribution). Our formula for Se was derived for that situation. 2) Monkey texts generated by using the probabilities of elements (letters, digraphs, etc) and also assuming that the stock of those elements is the same as that available for the original meaningful text. In this case we have again negative correlation and it is a no-replacement case (hypergeometric) so our formula is to be used without a modification. 3) The text generated as in item 2) but assuming the stock of letters is much-much larger (say 100,000 times larger) than that available in the original text, preserving though the ratios of elements occurrences as in the original text. This is a case with replacement (approximately but with increasing accuracy as the size of the stock increases). In this case our formula has to be modified (as indicated in paper 1) using multinomial variance. Quantitatively the difference is only in L/(L-1) coefficient which at L>>1 is negligible. 4) The text generated assuming the stock of elements is unfinitely large. In this case the distribution of elements is uniform, i.e. the probabilities of all elements become equal to each other (each equal 1/z where z is the number of all possible elements (letters, or digrams, etc) in the original text). In this case formula for Se simplifies (I derived it in paper 1 for that case as an approximation to roughly estimate Se for n>1). Quantitatively cases 1 through 3 are very close, but case 4 produces quantities measurably (but not very much) differing from cases 1 through 3 (see examples in paper 1).

2000/1/21, posted by Jorge Stolfi

Why should the LSC work?

In a very broad sense, the LSC and the nth-order character/word entropies are trying to measure the same thing, namely the correlation between letters that are a fixed distance apart.

People have observed before that correlation between samples n steps apart tends to be higher for "meaningful" signals than for "random" ones, even for large n. The phenomenon has been observed in music, images, DNA sequences, etc. This knowledge has been useful for, among other things, designing good compression and approximation methods for such signals. Some of the buzzwords one meets in that context are "fractal", "1/f noise", "wavelet", "multiscale energy", etc. (I believe that Gabriel has written papers on fractals in the context of medical imaging. And a student of mine just finished her thesis on reassembling pottery fragments by matching their outlines, which turn out to be "fractal" too.) As I try to show below, one can understand the LSC as decomposing the text into various frequency bands, and measuring the `power' contained in each band. If we do that to a random signal, we will find that each component frequency has roughly constant expected power; i.e. the power spectrum is flat, like that of ideal white light (hence the nickname `white noise'.) On the other hand, a `meaningful' signal (like music or speech) will be `lumpier' than a random one, at all scales; so its power spectrum will show an excess power at lower frequencies. It is claimed that, in such signals, the power tends to be inversely proportional to the frequency; hence the moniker `1/f noise'. If we lump the spectrum components into frequency bands, we will find that the total power contained in the band of frequencies between f and 2f will be proportional to f for a random signal, but roughly constant for a `meaningful' signal whose spectrum indeed follows the 1/f profile. Is the LSC better than nth-order entropy?

In theory, the nth-order entropies are more powerful indicators of structure. Roughly speaking, *any* regular structure in the text will show up in some nth-order entropy; whereas I suspect that one can construct signals that have strong structure (hence low entropy) but the same LSC as a purely random text.

However, the formula for nth-order entropy requires one to estimate z**n probabilities, where z is the size of the alphabet. To do that reliably, one needs a corpus whose length is many times z**n. So the entropies are not very meaningful for n beyond 3 or so.

The nth-order LSC seems to be numerically more stable, because it maps blocks of n consecutive letters into a single `super-letter' which is actually a vector of z integers; and compares these super-letters as vectors (with difference-squared metric) rather than symbols (with simple 0-1 metric). I haven't done the math --- perhaps you have --- but it seems that computing the n-th order LSC to a fixed accuracy requires a corpus whose length L is proportional to z*n (or perhaps z*n**2?) instead of z**n. Morever, one kind of structure that the LSC *can* detect is any medium- and long-range variation in word usage frequency along the text. (In fact, the LSC seems to have been designed specifically for that purpose.) As observed above, such variations are present in most natural languages, but absent in random texts, even those generated by kth-order monkeys. Specifically, if we take the the output of a k-th order `letter monkey' and break it into chunks whose length n >> k, we will find that the number of times a given letter occurs in each chunk is fairly constant (except for sampling error) among all chunks. For kth-order `word monkeys' we should have the same result as long as n >> k*w, where w is the average word length. On the other hand, a natural-language text will show variations in letter frequencies, which are due to changes of topic and hence vocabulary changes, that extend for whole paragraphs or chapters. Thus, although the LSC may not be powerful enough to detect the underlying structure in non-trivial ciphers, it seems well suited at distinguishing natural language from monkey-style random text.

In conclusion, my understanding of the Perakh-McKay papers is that computing the LSC is an indirect way of computing the power spectrum of the text. The reason why the LSC distinguishes meaningful texts from monkey gibberish is that the former have variations in letter frequencies at all scales, and hence a 1/f-like power spectrum; whereas the latter have uniform letter frequencies, at least over scales of a dozen letters, and therefore have a flat power spectrum. Looking at the LSC in the context of multiscale analysis suggests many possible improvements, such as using scales in geometric progression, and kernels which are smoother, orthogonal, and unitary. Even if these changes do not make the LSC more sensitive, they should make the results easier to evaluate. In retrospect, it is not surprising that the LSC can distinguish the original Genesis from a line-permuted version: the spectra should be fairly similar at high frequencies (with periods shorter than one line), but at low frequencies the second text should have an essentially flat spectrum, like that of a random signal. The same can be said about monkey-generated texts. On the otherhand, I don't expect the LSC to be more effetive than simple letter/digraph frequency analysis when it comes to identifying the language of a text. The most significant influence in the LSC is the letter frequency histogram --- which is sensitive to topic (e.g. "-ed" is common when talking about past) and to spelling rules (e.g. whether one writes "ue" or "ü"). The shape of the LSC (or Fourier) spectrum at high frequencies (small n) must be determined mainly by these factors. The shape of the specrtum at lower frequencies (higher n) should be determined chiefly by topic and style.

2000/1/22, posted by Jorge Stolfi

For one thing, while the LSC can unmask ordinary monkeys, it too can be fooled with relative ease, once one realizes how it works. One needs only to build a `multiscale monkey' that varies the frequencies of the letters along the text, in a fractal-like manner.

Of course, it is hard to imagine a medieval forger being aware of fractal processes. However, he could have used such a process without knowing it. For instance, he may have copied an arabic book, using some fancy mapping of arabic letters to Voynichese alphabet. The mapping would not have to be invertible, or consistently applied: as long as the forger mantained some connection between the original text and the transcript, the long-range frequency variations of the former would show up in the latter as well.

Moreover, I suspect that any nonsense text that is generated `by hand' (i.e. without the help of dice or other mechanical devices) will show long-range variations in letter frequencies at least as strong as those seen in meaningful texts.

Thus Mark's results do not immediately rule out random but non-mechanical babble or glossolalia. However, it is conceivable that such texts will show *too much* long-range variation, instead of too little. We really need some samples...
posted by ぶらたん at 20:26| Comment(0) | テキストの性質

2010年12月23日

繰り返しの単語から解読の可能性を探る3

1999/1/18, posted by Jorge Stolfi

> [Takeshi:] Isn't it difficult that we assume plants and human
> have common properties?
>
> e.g.
> zod f70v2.S2.13 ACKV =otaldy=
> pha f101v2.R1.2 AHV =otaldy=
>
> By the way, I thought one person represent one day in the zodiac
> calendars. But it is not true, right? (I mean, there are 30
> women in each zodiac calendars. But some women have the same
> label.)

Well, have you seen my "Chinese theory" page? Perhaps the correct
reading of <otaldy> (as the VMS author intended it) is <chang>, but
one of those <otaldy>s is <chàng> and the other is <cháng>...

> What do their labels mean in the zodiac calendars? What do you
> think kind of property they have? Their name? their birthday?
> where they live? They have a same kind of star? who and who are
> relatives by blood and marriage? etc.

I have no satisfactory theory for what the "zodiac" diagrams and the
numphs are supposed to be. If they indeed represent the zodiac signs,
why do they all have 30 "stars"? Why are Aries and Taurus split in
two?

Even the zodiac symbols at the center are a bit suspect; it is
possible (although, I admit, unlikely) that the central circles were
originally empty, and the signs were added later, by someone who just
guessed they were related to the zodiac. Or perhaps the guess was made
by the VMS author himself, as he copied the diagrams from some other
book.

If the nymphs are real or imaginary individuals (not just decoration),
then the labels are likely to be their names; in which case it is not
that strange to see repetitions.

> occurrence count
> ---------------------
> okaly H 4 (A 5)
> okoly 2
> otal dar 2
> okam 2
> okaldy 2
> okeoly 2
> okalar 2
> oteolar 2
> okeey ary 2
> otaly 2
> okal 2
> otaraldy 2

I hadn't noticed that there were so many repetitions in the Zodiac.
Very strange! Why is no label repeated three times? Is there any
pattern to these repetitions (such as position of labels in
diagram, etc?)

> Is it possible to think that <okal> or <otal> itself have a
> meanings and +<y> or +<dy>?

I wish I knew the answer....

If the language is Chinese, this is somewhat unlikely (although the
<y> or <dy> could be tone marks, and I believe that in Chinese
there are some rules that say that tone X changes to tone Y when
it comes before a word with tone Z.)

On the other hand, if the language is Chinese then those resemblances
are not surprising, and they do not mean anything: "ching" and "chi"
are not related...
posted by ぶらたん at 23:28| Comment(0) | その他

繰り返しの単語から解読の可能性を探る2

1999/1/15, posted by Jorge Stolfi

繰り返しは植物の特性などで、名前ではない。

Besides <otoldy> we can look at the similar words <otaldy>, <opaldy>,
<ytaldy>, <ytoldy>, etc., which could be alternative spellings of the
same word.

I count 9 occurrences of those words as labels, and 17 as words in the text.

Here are those occurrences, extracted from the concordance I posted
recently. I have split them into labels and text, then sorted by
section and page. (I have kept only the "majority" version ("A")
of each occurrence. There were only a few dissenting votes, usually
by the FSG/SSG transcriptions.)

Labels:

sec location trans occurrence
--- ------------ ------ ----------------------------------------------------
cos f67r1.S.1 ACHV =otaldy=
zod f70v2.S2.13 ACKV =otaldy=
bio f82v.L3.14 AHV =otoldy=
pha f88r.m.1 AHLV =otaldy=
pha f89r1.t.4 AHKLV =otoldy=
pha f89r2.L2.0 AHUV =otoldy=
pha f99r.L1.12 AHU =otoldy=
pha f99v.L1.1 AHUV =otoldy=
pha f101v2.R1.2 AHV =otaldy=

Text:

sec location trans occurrence
--- ------------ ------ ----------------------------------------------------
bio f78v.P.26 ACFHV shedy sol fchedy otaldy/lol *ar shr r ol
bio f79r.P.28 AFHV dai*n yteey chyteey otoldy lchey/lcheey qochey
bio f79v.P.1 AFHV olk*ry qotolol otaldy otedol or olorol/
hea f22r.P.11 ACFH yckhody qokchy oky otoldy yty dol or-dachy
hea f28r.P.7 AFH shockhy shocthy otoldy-dshor dol dar/oschotshl
hea f2r.P.6 ACFH daind-dkol sor-ytoldy-dchol dchy cthy/
hea f44v.P.1 ACFH shol tol qotshol otoldy/yolkol cheol qokchain
hea f52r.P.2 AH dar yty/oty shor ytoldy qoky koldal oteees
hea f52r.P.3 ACFH tchody qotam oky-ytoldy/lshopchy qoky qotchy
hea f53v.P.13 ACH -*dam/ycthodaiin otoldy=
hea f9v.P.9 ACFHU tor chyty dary-ytoldy/oty kchol chol
heb f43v.P.1 ACFGU r araiin otedy opoldy/shedy octhy otedy
heb f48r.P.1 AFH ykeeody olaiin opaldy/daiin yteeol choody
heb f48v.P.9 ACFGH loldy lol-otchdy otoldy ytam otedy/tol
heb f95v1.P.4 AFH qokal oty shekshey otaldy okshey ytshedy
pha f89r2.P3.8 AFHLU che* oldy sheodal ytoldy/daiin cheok o keol
str f58r.P.30 AFH chody cheol okolchy otaldy/odshchol taiin

Note that almost all occurrences of <otoldy> and friends *as labels*
are in the pharma section, and almost all occurrences *as text* are in
the herbal section. Of these, 8 are in herbal-A (all <otoldy> or
<ytoldy>), 4 in herbal-B (one each of <otoldy>, <opoldy> <otaldy>,
<opaldy>; so these may be bogus).

Note also that <ytoldy> tends to occur right after gaps in the text
due to intruding plants (marked by "-" above). I take this fact as
evidence that <y> is (always? often? sometimes?) a calligraphic
variant of <o>, used at end-of-word and sometimes at
beginning-of-line.

There is one occurrence of <ytoldy> in the *text* of pharma page f89r2.
Coincidentally that is the only page with *two* occurrences of
<otoldy> as a label.

Moreover, there is some evidence that <k> and <t>, while distinct, were
interchangeable to some extent. Indeed the distribution of <okoldy>
and its variants is somewhat similar to that of <otoldy>:

Labels:

sec location trans occurrence
--- ------------ ------ ----------------------------------------------------
cos f68r1.S.14 AHUV =okoldy=
zod f72r2.S2.5 AHV =okaldy=
zod f72v3.S1.18 AHUV =okaldy=
bio f82r.L2.5 AUV =okaldy=
bio f82v.L3.14 U =okoldy=
pha f88r.b.3 AHKV =ofaldo=
pha f89r2.L2.0 L =okoldy=

Text:

sec location trans occurrence
--- ------------ ------ ----------------------------------------------------
hea f18r.P.8 AFH qokchor ckhol olody okaldy-dary/chol chcthal
hea f36r.P.7 ACH -dan/qotol cthol okol dy okchy-ytorory-sold/
hea f3v.P.4 ACH **s eey kcheol okal do r chear een/y**ear
hea f54v.P.9 ACH qockhey qodal ytam okal dy/kol c*kaiin chckhy
heb f33r.P.2 ACFH ytchedy qokar cheky okaldy qokaldy otor oldar
heb f40r.P.4 ACFH okaiin okar oky okoldy ol/lokar qokar
heb f43r.P.2 ACFGHU chety dar aiir okaldy daral otchdy daiin
heb f43v.P.2 ACFHU ches***y okeody oky okaldy kchdy okar/tody
cos f57v.R1.1 AHU daram qokar okal okal d o l shkeal dydchs
bio f75r.P.27 ACFHV qoty pshar shedy okaldy-dar otar otedy
bio f75r.P.44 ACFHV okedy qokedy otedy okoldy otar otam olaiin
bio f82v.P.12 ACFHV qokchey qokain okal dy lchedam/orain shedy
pha f88r.P3.11 AHL lkeey cthol poldy s-okoldy/qokol chol qokol
pha f89v1.P1.12 AHL kaiin ykchol qockhy okalda otal dal chodar
pha f99r.P2.7 AFH cheody qokol okoly okoldy qokoly qokal okchol
pha f99v.P1.2 AFH qokchol qokeol okoldy-q*kholdy t*ly daiin/
str f105v.P.3 AFHT okair qotol dol okoldy qokedy opched oteedy
str f113v.P.48 AGH qokeeedy lkaiin okal dy/yshey teeo oteedy
str f58v.P1.10 AHU qokal* qokaiin okal okaldy ory/tchol shol
str f58v.P2.27 AHU okey okal o*aly okaldy okeor sheey=

Note that there are three <okoldy>s in the *text* of the pharma pages,
all of them on pages where <otoldy> occurs as a label (f88r, f99r,
f99v) --- one more bit of evidence for a close relationship between
<k> and <t>. (Grammatical inflection, perhaps?)

> Is it evidence that the manuscript itself is meaningless?

On the contrary, I think that the highly skewed distributions of
<otoldy> and <okoldy> confirm (once more) that the VMS is *not*
random text.

> I don't think these four plants are the same. Why do different
> plants have same name? ... After someone succeeds in identifying
> what language or code is in the Voynich MS, can we explain these
> repetitions? I don't think so...

The labels may indicate properties of the plants, not their names. The
properties could be the plant's usage (e.g. "poison", "tonic",
"emetic", "diuretic", "too strong", "doesn't work"), its
smell/flavor/color/size, which parts of the plant are used, how it is
repared ("infusion", "poultice", etc), the season for picking, the
place or date of the finding, the country where it grows, the dealer
who sells the plant, the sympathetic star, the name of the daemon who
summoned by consuming the plant, etc. etc.

Someone suggested that the labels may be meaningless tags, used
just as we would use (a), (b), (c) or (1), (2), (3), etc.

Or perhaps (ahem!...) those <okoldy>s were distinct but
similar-sounding words in an unfamiliar language, and the author was
unable to hear the difference.

In any case, I am almost convinced that the drawings in the pharma
section are "field notes", where the author recorded plants as he
"found" them; and the herbal and bio pages are later elaborations on
those notes. The pattern of <otoldy> occurrences above seems at least
compatible with this theory.

The main evidence for this theory is the fact that some pharma
drawings are repeated in the herbal pages --- enlarged and done with
more care, sometimes with fancy flowers, but in the same pose and with
the same details (i.e. definitely a copy of the same drawing, not just
an independent drawing of the same plant.)

Note that the pharma plants may have been "found" in a pharmacist's
shop, in a library, in the teachings of a master/guru/shaman/explorer,
in conversations with natives, etc.. However, since many pharma plants
are unlabeled, or bear repetitive labels, I think it is slightly more
likely that they were found in the wild by the author, and he did not
know their names (except for a few plants, eg. the maidenhair fern.)

Since we are on the subject: I think that the "containers" in the
pharma pages were added after the whole section was complete, as an
afterthought. Note that they are all squeezed in the margin (except in
one instance where the container lies between two plants).

The containers could represent plant categories, of course; but
perhaps they are (also?) "thumb marks" for quick page finding...

In any case, it seems that the plants in the pharma section seem to
have been sorted by some criterion; not only by inference from the
presence of the containers, but also from looking at the drawings
themselves. So they probably aren't the primary field sketches, but
clean-copies made sometime later at the "office".

However the relative realism of the pharma drawings says to me that
they were made by someone who had seen the plants --- which cannot be
said of the herbal drawings. In fact I would bet that the herbal
drawings were done by assistants or hired illustrators.
posted by ぶらたん at 22:59| Comment(0) | その他

繰り返しの単語から解読の可能性を探る

1998/11/25, posted by Jorge Stolfi

--- okal (VERY COMMON) -------------------------------------------

A very common word in the VMS. It could mean "Sun", or perhaps "Moon". (Or "water"; the planet Mercury is literally "Water-Star" in Chinese... 8-)

--- opchol dy (RARE) ---------------------------------------------

The words "opchol dy" or "opcholdy" do not seem to occur elsewhere, but there are half a dozen near misses:

"otshol dy" occurs in Herbal-A text (f7v).

"qokchol dy" ditto (f18r).

"okchaldy" ditto (f23v).

"opchaldy" ditto (f45r).

"okchol do" ditto (f52r).

"ofsholdy" is a Zodiac star label (Cancer, f72r3).

"ypcholdy" is a Pharma plant label (f102v1).

"yteeoldy" mentioned in Pharma text (f101r1).

--- ytoaiin (RARE) -----------------------------------------------

My concordance finds no exact recurrences, but does find half a dozen near missses:

"ykoaiin" in the text under the same diagram (f67r2, line 1), and in an early herbal-A page (f3v, line 1).

"qokoaiin" in the text under the same diagram (f67r2, line 3), and part of a label in the Cosmo diagram overleaf (f67v2).

"otoaiin" in text around a "Sun face" on a nearby page (f68r2), and in an early herbal-A page (f1v, line 5).

"okoaiin" in a Pharma text (f89v1, line 10).

"opyaiin" in a herbal-A page (f23r, line 1).

--- dolchsody (VERY RARE) ------------------------------------------------

Occurs just once more (split and without the "s" plume), on page f66r, line 19:

...daiin daiin dal DOL CHEODY dairaly dairal...

--- okain am (VERY RARE?) ------------------------------------------------

Occurs once more (with the "q" prefix), on f111v, line 9:

...okeey qokeey qokey qOKAIN AM- soiin shed qoksheo...

But "am", like most words ending with "m", is almost surely an abbreviation (note its occurrence at end-of-line). The word "okain" alone is extremely common.

--- yfain (VERY RARE? VERY COMMON?) -------------------------------------

The word "yfain" itself does not seem to occur elsewhere. On the other hand some `equivalent' words like "okain", "ykain", "otain", "ytain" etc. are exceedingly common.

--- ofar oeoldan (VERY RARE) --------------------------------------------

The word "ofar" is very common, but "oeoldan" does not seem to occur elsewhere, not even in disguise ("ysoldon", "araldan", etc.).

(BTW, the whole phrase "ofar oeoldan" did not make it into the index because of a bug in my code, in the handling of comma-spaces. One more thing to fix for the next release...)

--- doaro (VERY RARE) ---------------------------------------------------

The concorance shows no other occurences of this word or its `equivalents' ("dyary", "doary", "daosy", etc.).
posted by ぶらたん at 20:57| Comment(0) | その他

2010年12月22日

Could the difference between A and B be due to different subject matter?

1998/11/25, posted by Rene Zandbergen

> Another thought. Could the difference between A and B be due to
> different subject matter?

It could be, but then we know that the text has nothing to do with the illustrations. Herbal-A and Herbal-B are the most different of all 'dialects', in my scatterplots (based on digraphs). Pharma and Astro/Cosmo are very similar, despite their probable subject- matter.

But more importantly, the same scatterplots show that the difference between A and B is of the type of a continuous change. As if the writer's style (spelling, cypher characteristics) gradually changed with time. From these plots, I can think of three possibilities.

1) One author, who started writing in B-style and gradually devellopped A-style. This could mean that the Herbal-A section is a cleaned-up copy of earlier Herbal-B-type scribbles, but this task was not completed. It would also mean that the zodiac section was written backwards.

2) One author, who started writing in A-style and gradually 'degraded' into B-style. This would mean that the Herbal-B pages in the first half of the Ms have been misplaced during the binding by an illiterate (like us :-) ). Note that the Herbal-B handwriting is the only part which is visibly different from the rest.

3) Two authors ('A' and 'B'). They started with a common style, 'A' doing pharma and 'B' doing astro-cosmo. 'A' then did herbal-A and 'B' then did the stars and bio sections. 'B' also did some herbal pages perhaps when 'A' was no longer able or willing to continue. The nice aspect of this theory is that 'A' did all the plant drawings, and 'B' did all the drawings involving stars and nymphs. The odd part is that the sections on which they started are all on foldout pages and only later they went to normal pages in normal-sized quires.

(3) is the more fascinating option, which explains most of the observed features, but (2) is the simpler explanation, which is also worth something.
posted by ぶらたん at 23:04| Comment(0) | テキストの性質
HPへ戻る