You are not logged in Log in Join
You are here: Home » Members » Bjorn Stabell » ZCTextIndex splitter that works with Chinese, Japanese, and Korean text

Log in
Name

Password

 
 

Folder icon ZCTextIndex splitter that works with Chinese, Japanese, and Korean text

CJKSplitter - Chinese, Japanese, Korean word splitter for ZCTextIndex

CJKSplitter is a ZCTextIndex splitter for CJK (Chinese-Japenese-Korea) text stored as Unicode. It uses a simple, but workable, "hack" instead of trying to do real word splitting from dictionaries. Compared to a dictionary based word splitter, this results in a bigger index and more matches than necessary, but it is a cheap price to pay for the reduced complexity.

Changes Summary

  • Version 0.2 [email protected] improves on the previous in a number of ways: uses Unicode internally (not UTF-8), replaces configuration file with lookups using unicodedata module for looking up CJK characters and symbols, adds unit tests, and detailed English instructions for installation etc.
  • Version 0.1 [email protected] original version.

Known Problems

  • Text must (well, should) be stored as Unicode.
  • Cannot search single characters.
  • Could do a better job at identifying CJK characters.
  • May match more than is strictly necessary due to algorithm used. (See source code for details.)

Please join the zopeasia project on SourceForge to participate in the development

 Title   Type   Size   Modified   Status 
 CHANGES Edit object Document 2 K 2003-03-09 published
 CJKSplitter v0.2 Edit object Software Release   2003-03-09 published
 INSTALLATION Edit object Document 2 K 2003-03-09 published