You are not logged in Log in Join
You are here: Home » Members » Bjorn Stabell » ZCTextIndex splitter that works with Chinese, Japanese, and Korean text

Log in
Name

Password

 
 

ZCTextIndex splitter that works with Chinese, Japanese, and Korean text

CJKSplitter - Chinese, Japanese, Korean word splitter for ZCTextIndex

CJKSplitter is a ZCTextIndex splitter for CJK (Chinese-Japenese-Korea) text stored as Unicode. It uses a simple, but workable, "hack" instead of trying to do real word splitting from dictionaries. Compared to a dictionary based word splitter, this results in a bigger index and more matches than necessary, but it is a cheap price to pay for the reduced complexity.

Changes Summary

  • Version 0.2 bjorn@exoweb.net improves on the previous in a number of ways: uses Unicode internally (not UTF-8), replaces configuration file with lookups using unicodedata module for looking up CJK characters and symbols, adds unit tests, and detailed English instructions for installation etc.
  • Version 0.1 panjy@zopechina.ods.org original version.

Known Problems

  • Text must (well, should) be stored as Unicode.
  • Cannot search single characters.
  • Could do a better job at identifying CJK characters.
  • May match more than is strictly necessary due to algorithm used. (See source code for details.)

Please join the zopeasia project on SourceForge to participate in the development

Latest Release: 0.2
Last Updated: 2003-03-09 21:15:21
Author: ZopeOrgSite
Categories: Internationalization, SoftwareProduct, ZCatalog, catalog, i18n
Maturity: Stable

Available Releases

Version Maturity Platform Released
0.2 Stable   2003-03-09 21:15:21
  CJKSplitter-0.2.tgz (4 K) All