CJKSplitter - Chinese, Japanese, Korean word splitter for ZCTextIndex
CJKSplitter is a ZCTextIndex splitter for CJK (Chinese-Japenese-Korea) text
stored as Unicode. It uses a simple, but workable, "hack" instead of trying
to do real word splitting from dictionaries. Compared to a dictionary based
word splitter, this results in a bigger index and more matches than necessary,
but it is a cheap price to pay for the reduced complexity.
- Version 0.2
improves on the previous in a number of ways:
uses Unicode internally (not UTF-8), replaces
configuration file with lookups using
unicodedata module for looking up CJK
characters and symbols, adds unit tests, and
detailed English instructions for installation
- Version 0.1
- Text must (well, should) be stored as Unicode.
- Cannot search single characters.
- Could do a better job at identifying CJK characters.
- May match more than is strictly necessary due to algorithm used.
(See source code for details.)
Please join the
zopeasia project on SourceForge
to participate in the development
||Internationalization, SoftwareProduct, ZCatalog, catalog, i18n