Proposal on ZCatalogs/TextIndexes

This document tries to address some limitations of the text indexes in Zope and gives some proposals and goal how to improve ZCatalogs. The primary target of this document are the text indexes of the ZCatalog component.

Current limitations of ZCatalogs

designed for English and US-English
limited support for other languages (e.g. european languages)
limited support for other character sets (Unicode, UTF-8, ISO-8859-X)
missing NEAR search functionality
better scoring of search results (e.g. based on the location of the search words inside the document)
no stemming of words
not support for custom file formats
no/poor supporrt for searching in structured documents (XML)

Problems

The current implementation of the text indexes uses a (simple) splitter to split a document into words/tokens. The splitter uses algorithmes that are suitable for the english language but not for other languages. Usually every language must be treated different. Another problem is the handling of documents with different from ASCII. The splitter seems to have limited capabilities to handle single-byte character sets through the LOCALE system. Multi-byte character sets like UCS-2, UFT-8 can not be handled (will this limitation automatically go away when Zope will use Python 2.X with unicode support ?)

Proposed architecture

To get rid of these limitations I propose a more flexible architecture. This is mainly based on the architecture of Oracle Intermedia.

Any document treated by a text index should pass a plugable pipeline of components. Each component must be configurable with specific settings. Some components of the pipline are optional, others are mandatory

Components of the pipline:

Filter: converts the document into plain text by removing format specific stuff
Sectioner: document type specific component to extract hierarchy informations from the document. This is neccessary for searching in structured documents like XML documents to keep track of the appearance tokens into sections.
Lexer: makes a grammatical analyze of the text to find word boundaries, ends of sentences.
Splitter: takes the lexers analyze creates a list of tokens
Stemmer: reduces each token to its grammatical base form

Component specific parameters:

Filter: document type vs. automatic detection
Sectioner: type of sectioner (XML,SGML,....). The type is document specific
Lexer: input language, character set
Splitter: input language, handling of punctuation
Stemmer: language, translation sets

The creation of a text index is bound to a number of components that are used for building the index and a set of preferences of each of these components.

Problems and risks

some components require linguistic specific knowledge - some components are specific for one language (e.g. stemmer). - indexing process will be slowed down due to the increased number of components within the pipeline -

Unresolved issues

to be written

Conclusion

This proposed framework blows up the functionality and complexity of ZCatalogs. It provides an open architecture for third-party modules that provide specific functionalities that DC can not provide. Also it strongly improves the chances of Zope to be used in non-US environments. As a content management system Zope must provide a strong and flexible searching functionality.