2010年12月24日

LSC (Letter Serial Correlation)

2000/1/23, posted by Mark Perakh

LSC test revealed in VMS features identical with meaningful texts we explored. On the other hand, if we assume that each voinichese symbol is a letter, then the letter frequency distribution in VMS is much more non-uniform than in any of 12 languages we tested. Furthermore, in one of my papers you can see the LSC results obtained for a gibberish which I created by hitting (suposedly randomly) the keys on a keyboard. It has some features of a meaningful texts, but also has some subtle differences from meaningful texts. You probably noticed that my conclusion was that, if we rely on LSC data, VMS can be either meaningful or a result of a very sophisticated effort to imitate a meaningful text, in which even the relative frequencies of vowels and consonants have been skilfully faked. I can hardly imagine such an extraordinarily talented and diligent forger, so I am inclined to guess VMS is a meaningful text, but some doubts remain. Moreover, if VMS symbols are not individual letters, all LSC results hang in the air.

2000/1/15, posted by Gabriel Landini

I think that the LSC depends heavily on the construction of words, But also think that word construction (because of Zipf's law) depends heavily on a sub-set of the word pool.

Long-range correlations in codes was discussed in DNA a couple of years ago in very prestigious Journals like Nature and Science, but to date I do not think that anybody had a convincing theory or explanation of the meaning and validity of the results.

If you think, really what is the relation (in any terms) of a piece of text which is many characters away from another? What is the large scale structure of a text? That would mean that there are events at a small scales and also at larger scales. I can imagine that up to the sentence level or so there may be patterns or correlations (what we call grammar?), but beyond that, I am not sure. Think of a dictionnary, there may not be any structure beyond 1 sentence or definition (still Roget's Thesaurus coforms Zipf's law for the more frequent words). Consequently I see no reason why there should be any large scale structures in texts. (I may be very wrong).

2000/1/16, posted by Mark Perakh

My comments related only to the question whether or not we can expect LSC to distinguish between meaningful and monkey texts. I believe the behavior of monkey texts from the standpoint of LSC is expected to be quite similar to that of permuted texts, therefore LSC is expected to work for monkeys as well as for permutations. I do not think LSC will distinguish between permuted and monkey texts. This is based of course on the assumption that the texts are long enough so the actual frequencies of letter occurences are quite close to their probabilities.

2000/1/17, posted by Rene Zandbergen

I agree with Gabriel that using a 3rd order word monkey would be even more interesting in terms of checking the capabilities of the LSC method in detecting meaningful text. On the other hand, getting meaningful word entropy statistics is even more difficult than getting 3rd order character entropy values, so the text from a 3rd order word monkey will repeat the source text from which the statistics have been drawn much more closely than should be the case. As before, a 1st order word monkey will be equivalent to a random permutation of words, and if it is true (in a statistically significant manner) that the LSC test distinguishes between one and the other, we do have another useful piece of evidence w.r.t. the Voynich MS text.

2000/1/20, posted by Mark Perakh

I believe we have to distinguish between four situations, to wit: 1) Texts generated by permutations of the above elements (as it was the case in our study). In this case there is a limited stock of the above elements, hence there is a negative correlation between elements# distributions in chunks, and therefore it is a case without replacement (hypergeometeric distribution). Our formula for Se was derived for that situation. 2) Monkey texts generated by using the probabilities of elements (letters, digraphs, etc) and also assuming that the stock of those elements is the same as that available for the original meaningful text. In this case we have again negative correlation and it is a no-replacement case (hypergeometric) so our formula is to be used without a modification. 3) The text generated as in item 2) but assuming the stock of letters is much-much larger (say 100,000 times larger) than that available in the original text, preserving though the ratios of elements occurrences as in the original text. This is a case with replacement (approximately but with increasing accuracy as the size of the stock increases). In this case our formula has to be modified (as indicated in paper 1) using multinomial variance. Quantitatively the difference is only in L/(L-1) coefficient which at L>>1 is negligible. 4) The text generated assuming the stock of elements is unfinitely large. In this case the distribution of elements is uniform, i.e. the probabilities of all elements become equal to each other (each equal 1/z where z is the number of all possible elements (letters, or digrams, etc) in the original text). In this case formula for Se simplifies (I derived it in paper 1 for that case as an approximation to roughly estimate Se for n>1). Quantitatively cases 1 through 3 are very close, but case 4 produces quantities measurably (but not very much) differing from cases 1 through 3 (see examples in paper 1).

2000/1/21, posted by Jorge Stolfi

Why should the LSC work?

In a very broad sense, the LSC and the nth-order character/word entropies are trying to measure the same thing, namely the correlation between letters that are a fixed distance apart.

People have observed before that correlation between samples n steps apart tends to be higher for "meaningful" signals than for "random" ones, even for large n. The phenomenon has been observed in music, images, DNA sequences, etc. This knowledge has been useful for, among other things, designing good compression and approximation methods for such signals. Some of the buzzwords one meets in that context are "fractal", "1/f noise", "wavelet", "multiscale energy", etc. (I believe that Gabriel has written papers on fractals in the context of medical imaging. And a student of mine just finished her thesis on reassembling pottery fragments by matching their outlines, which turn out to be "fractal" too.) As I try to show below, one can understand the LSC as decomposing the text into various frequency bands, and measuring the `power' contained in each band. If we do that to a random signal, we will find that each component frequency has roughly constant expected power; i.e. the power spectrum is flat, like that of ideal white light (hence the nickname `white noise'.) On the other hand, a `meaningful' signal (like music or speech) will be `lumpier' than a random one, at all scales; so its power spectrum will show an excess power at lower frequencies. It is claimed that, in such signals, the power tends to be inversely proportional to the frequency; hence the moniker `1/f noise'. If we lump the spectrum components into frequency bands, we will find that the total power contained in the band of frequencies between f and 2f will be proportional to f for a random signal, but roughly constant for a `meaningful' signal whose spectrum indeed follows the 1/f profile. Is the LSC better than nth-order entropy?

In theory, the nth-order entropies are more powerful indicators of structure. Roughly speaking, *any* regular structure in the text will show up in some nth-order entropy; whereas I suspect that one can construct signals that have strong structure (hence low entropy) but the same LSC as a purely random text.

However, the formula for nth-order entropy requires one to estimate z**n probabilities, where z is the size of the alphabet. To do that reliably, one needs a corpus whose length is many times z**n. So the entropies are not very meaningful for n beyond 3 or so.

The nth-order LSC seems to be numerically more stable, because it maps blocks of n consecutive letters into a single `super-letter' which is actually a vector of z integers; and compares these super-letters as vectors (with difference-squared metric) rather than symbols (with simple 0-1 metric). I haven't done the math --- perhaps you have --- but it seems that computing the n-th order LSC to a fixed accuracy requires a corpus whose length L is proportional to z*n (or perhaps z*n**2?) instead of z**n. Morever, one kind of structure that the LSC *can* detect is any medium- and long-range variation in word usage frequency along the text. (In fact, the LSC seems to have been designed specifically for that purpose.) As observed above, such variations are present in most natural languages, but absent in random texts, even those generated by kth-order monkeys. Specifically, if we take the the output of a k-th order `letter monkey' and break it into chunks whose length n >> k, we will find that the number of times a given letter occurs in each chunk is fairly constant (except for sampling error) among all chunks. For kth-order `word monkeys' we should have the same result as long as n >> k*w, where w is the average word length. On the other hand, a natural-language text will show variations in letter frequencies, which are due to changes of topic and hence vocabulary changes, that extend for whole paragraphs or chapters. Thus, although the LSC may not be powerful enough to detect the underlying structure in non-trivial ciphers, it seems well suited at distinguishing natural language from monkey-style random text.

In conclusion, my understanding of the Perakh-McKay papers is that computing the LSC is an indirect way of computing the power spectrum of the text. The reason why the LSC distinguishes meaningful texts from monkey gibberish is that the former have variations in letter frequencies at all scales, and hence a 1/f-like power spectrum; whereas the latter have uniform letter frequencies, at least over scales of a dozen letters, and therefore have a flat power spectrum. Looking at the LSC in the context of multiscale analysis suggests many possible improvements, such as using scales in geometric progression, and kernels which are smoother, orthogonal, and unitary. Even if these changes do not make the LSC more sensitive, they should make the results easier to evaluate. In retrospect, it is not surprising that the LSC can distinguish the original Genesis from a line-permuted version: the spectra should be fairly similar at high frequencies (with periods shorter than one line), but at low frequencies the second text should have an essentially flat spectrum, like that of a random signal. The same can be said about monkey-generated texts. On the otherhand, I don't expect the LSC to be more effetive than simple letter/digraph frequency analysis when it comes to identifying the language of a text. The most significant influence in the LSC is the letter frequency histogram --- which is sensitive to topic (e.g. "-ed" is common when talking about past) and to spelling rules (e.g. whether one writes "ue" or "ü"). The shape of the LSC (or Fourier) spectrum at high frequencies (small n) must be determined mainly by these factors. The shape of the specrtum at lower frequencies (higher n) should be determined chiefly by topic and style.

2000/1/22, posted by Jorge Stolfi

For one thing, while the LSC can unmask ordinary monkeys, it too can be fooled with relative ease, once one realizes how it works. One needs only to build a `multiscale monkey' that varies the frequencies of the letters along the text, in a fractal-like manner.

Of course, it is hard to imagine a medieval forger being aware of fractal processes. However, he could have used such a process without knowing it. For instance, he may have copied an arabic book, using some fancy mapping of arabic letters to Voynichese alphabet. The mapping would not have to be invertible, or consistently applied: as long as the forger mantained some connection between the original text and the transcript, the long-range frequency variations of the former would show up in the latter as well.

Moreover, I suspect that any nonsense text that is generated `by hand' (i.e. without the help of dice or other mechanical devices) will show long-range variations in letter frequencies at least as strong as those seen in meaningful texts.

Thus Mark's results do not immediately rule out random but non-mechanical babble or glossolalia. However, it is conceivable that such texts will show *too much* long-range variation, instead of too little. We really need some samples...
posted by ぶらたん at 20:26| Comment(0) | テキストの性質
この記事へのコメント
コメントを書く
お名前:

メールアドレス:

ホームページアドレス:

コメント: [必須入力]

認証コード: [必須入力]


※画像の中の文字を半角で入力してください。
HPへ戻る