You are not logged in Log in Join
You are here: Home » Members » Bjorn Stabell » CJKSplitter v0.2

Log in
Name

Password

 

CJKSplitter v0.2

New Release of CJKSplitter - Chinese, Japanese, Korean word splitter for ZCTextIndex

CJKSplitter is a ZCTextIndex splitter for CJK (Chinese-Japenese-Korea) text stored as Unicode. It uses a simple, but workable, "hack" instead of trying to do real word splitting from dictionaries. Compared to a dictionary based word splitter, this results in a bigger index and more matches than necessary, but it is a cheap price to pay for the reduced complexity.

Version 0.2 improves on the previous in a number of ways: uses Unicode internally (not UTF-8), replaces configuration file with lookups using unicodedata module for looking up CJK characters and symbols, adds unit tests, and detailed English instructions for installation etc. It may even work for Korean / Japanese (untested).