2011年04月29日

ZIP algorithm

これおもしろいな。

the is someone at an Italian University, who claims to identy an Author and/or his language by using the ZIP algorithm.
1. take any text greater than n Bytes, compress it with ZIP "known text"
2. Add more text and compress it too - this is the "unknown" text
3. compare difference of length of compressed text in step 1 and 2 . If you yield a minimum difference, they claim, the "unknown" text is derived form the "known" text's language or even from the same author. This procedure reminds me of the "entropy test", which was done on the VMS years ago.
Any comments?

# From: Nick Pelling
# Date: Tue, 29 Jan 2002 11:27:57 +0000

ZIP-era algorithms typically comprise two stages:-
(1) a pattern-matching stage, which converts an input stream into an output stream of both copy(-offset, length) commands and uncompressed literals; and
(2) a statistical (or entropy) encoder (like a Huffman or arithmetical encoder), which tries to compress the output of the first stage down to the entropy of that process' output stream.

Thus, the use of the ZIP algorithm in this "identify-the-author-and-his-language" algorithm you mention would carry out not only an entropy calculation, but also a pattern-matching calculation.

# From: Jacques Guy
# Date: Tue, 29 Jan 2002 23:01:42 -0000

The question: how small is "minimum"?

I would also say that producing the zipped files is unncessary, and, in fact, amounts to throwing out a great deal of information, since you end up with a single figure. It would be far more informative to compare the two Huffmann trees computed in the first stage of the algorithm.

posted by ぶらたん at 02:19| Comment(0) | テキストの性質

キルヒャー書簡

キルヒャー書簡公開中だったのか・・・。

Athanasius Kircher Correspondence Project at Stanford University:
(contains copies of correspondence from Marci and Baresch to Kircher)
http://www-sul.stanford.edu/depts/hasrg/hdis/kircher.html

ソフトウェアをインストールすると、画像にアクセス可能?
ラテン語読めないと駄目なんだろうな。
posted by ぶらたん at 02:14| Comment(0) | その他
HPへ戻る