You are not logged in Log in Join
You are here: Home » Members » Caseman » FieldedTextIndex » README » View Document

Log in
Name

Password

 

README

FieldedTextIndex: A Zope plug-in index for ZCatalog

FieldedTextIndex is a derivative of ZCTextIndex, the built-in full-text indexer for Zope. As such, it has many of the same features such as relevance ranking, boolean queries, wildcards (globbing) and phrase matching.

Note:
Indexes made with version 0.1 of FieldedTextIndex cannot be used with version 0.2. You must recreate your indexes after upgrading to 0.2

What problems does it solve?

In Zope sites it is common to have many different types of content objects whose data is stored in various attributes (or fields) in the object. Schema driven content types are also becoming more and more common making it easier to create myriad content types with different data fields in a single site.

It is also common for Zope sites to offer a full-text search for their content. This is often achieved by creating a method such as SearchableText which aggregates the content fields into a single source which can be fed to a text index. Although this works well for a simple search across all textual fields of the objects in the site, you cannot narrow the search to specific fields, you can only search all fields using the SearchableText index.

The obvious solution to this is to create new text indexes for each field you want to search individually. This creates three big issues:

  1. Every text index adds considerable overhead for indexing which naturally limits limits the number you can have.
  2. You must determine which fields are interesting to search individually when designing your application and changing this later is a software change and requires reindexing.
  3. Due to limitations in the ZCatalog query API, it is difficult to perform searches across multiple text indexes such as "Casey Duncan" in [first_name, last_name]

FieldedTextIndex solves these problems by extending the standard ZCTextIndex so that it can receive and index the textual data of an object's field attributes as a mapping of field names to field text. The index itself performs the aggregation of the fielded data and allows queries to be performed across all fields (like a standard text index) or any subset of the fields which have been encountered in the objects indexed.

Additionally, FieldedTextIndex can weight individual fields so that search terms found in those fields affect the result score differently. This allows you to make certain fields influence the relevance of results more or less than others.

Creating a FieldedTextIndex

FieldedTextIndexes require three pieces of information to construct:

  • An index id, which must be unique for the ZCatalog
  • A source name, which is the name of an attribute, method or script which returns the index source mapping from the content objects.
  • The id of the ZCTextIndex Lexicon which processes and stores the words that are indexed. This lexicon can be shared amongst several indexes (both FieldedTextIndexes and ZCTextIndexes) if desired.

Creating an index source

The source name of the index specifies the name of an attribute, method or script which returns a mapping of field name to field text. This mapping can be a dictionary or any dictionary-like object which supports the items() method. It can also be a iterable sequence of pairs such as a list of two-tuples ("field name", "field text"). The order of the sequence is not important, however each field name should occur only once.

An easy way to add an index source to an existing application or framework (such as the CMF), is to create a Python script with the same id as the source name of the index. When an object is indexed, it will be bound to the context variable in the Python script. The script can use the context to create a dictionary mapping the names to the values of each field. Here is a simple example for the default dublin core fields that basic CMF content objects implement:

      ## Script (Python) "dc_fields"
      ##title=Source for FieldedTextIndex for CMF DublinCore objects
      source = {}
      for field in ("Title", "Creator", "Subject", "Description", 
                    "Publisher", "Contributors", "Type"):
          source[field] = getattr(context, field)()
      return source

FieldedTextIndex is designed to work with whatever schema system you may be using. By creating a simple script that collects the desired fields and returns the requisite mapping, you can index those fields using the index.

The above script doesn't really take advantage of the full capabilities of the index, however, since every object has the same indexed fields. The real power of the index is in its ability to index an unlimited number of different fields of different objects which have arbitrary schemas. A more advanced script might introspect the schema to determine which fields should be indexed for an object, or allow the fields to be specified by the object directly. As new objects with different fields are encountered, these fields will automatically be added and become searchable. No changes to the catalog configuration are necessary.

Querying the index

To perform a search across all indexed fields, you can simply call the catalog passing the search string as the value for a keyword argument which matches the source name of the index. For example, to search all the fields of the index for dc_fields you can use:

      result = catalog(dc_fields="Some search string")

This makes it possible to use a FieldedTextIndex as a drop-in replacement for a ZCTextIndex. The query above returns the same results for both indexes (assuming they index the same data of course).

To perform a search limited to specific fields, use a dictionary as the argument value instead of a string. The dict should contain the keys query and fields. query contains the search string and fields contains a list of the field names to be searched:

      result = catalog(dc_fields={"query":"Some search string",
                                  "fields":["Title", "Description"]})

This would return only objects where the query terms occurred in the fields Title or Description.

Specifying field weights in queries (New in 0.2)

It is also possible to weight individual fields differently in a query so that hits on certain fields affect the relevance score more than others. In practical terms, this allows you to make search hits on particular fields push the cooresponding objects higher in search results. It allows you to make hits on certain fields more important than others.

The field_weights key in the query dictionary is used to specify the weights to apply to each field. The value of field_weights is a dictionary with each field name and its integer weight as its respective keys and values. The relevance score for the intermediate query results for each field are multiplied by the weight before being combined with the results for other fields:

      result = catalog(dc_fields={"query":"Some search string",
                                  "field_weights":{"Title":3,
                                                   "Subject":2}})

This would return objects where the query is found in any field. Matches on Title have their score multiplied by 3. Subject matches are multiplied by 2.

You can specify field_weights independently of fields. The value of field_weights does not affect the fields searched. If fields is not specified, then all fields are searched regardless of the value of field_weights. Fields not assigned a weight by field_weights are assigned a weight of one by default. If you specify weights for fields that do not appear in the fields list or are not the names of fields known to the index, they are ignored:

      result = catalog(dc_fields={"query":"Some search string",
                                  "fields":["Title", "Description"],
                                  "field_weights":{"Title":3,
                                                   "Subject":2}})

In this case, Title is searched with a weight of 3 and Description a weight of 1 (the default). Subject is not searched since it does not appear in fields.

You can also specify zero or negative weights if desired. Zero weighted fields will be used to filter the results, but will not affect the score. Negatively weighted fields will reduce the score of results where terms occur in them. This can be used as a way to tweak the order of results to common queries. If undesired content is appearing high in the results of a query, a negatively weighted field with anti-keywords matching the query could be used to move the content down.

Specifying default weights

You can also specify weights to apply by default to all queries that do not specify a value for field_weights. To do this, go to the Indexes tab of the ZCatalog and click on the FieldedTextIndex. Use the Default Field Weights tab to set the defaults for the index. Weights are applied at query-time, so you do not need to reindex for the weights to take affect.

Queries that specify their own value for field_weights override any defaults. Queries can pass an empty dictionary for field_weights to reset all field weights to one.

Creating a query form

Queries can also be generated directly from the web request like other indexes. A query string or post-data can provide the query data structure by using Zope's record marshaling. Here is an example which lets you search any combination of Title, Description or Creator :

      <form action="search_results">
        <input name="SearchableFields.query:record" /><br />
        <div tal:repeat="name python:('Title', 'Description', 'Creator')">
          <input type="checkbox" name="SearchableFields.fields:record:list"
            tal:attributes="value name; id name;" />
          <label tal:attributes="for name" tal:content="name">Name</label>
        </div>
        <input type="submit" />
      </form>

Note that fields must always be a list, hence the :list at the end of the checkbox names. The search_results template can use a standard ZCatalog query, which simply calls ZCatalog passing it the web request formatting the result set as desired.

You can also determine the names of the fields that the index has encountered by using ZCatalog's uniqueValuesFor() method. Here is a variation of the form which creates a multi-select box populated with all of the searchable fields:

      <form action="search_results"
        tal:define="fields python:here.portal_catalog.uniqueValuesFor('SearchableFields')">
        <input name="SearchableFields.query:record" /><br />
        <select name="SearchableFields.fields:record:list" multiple="multiple">
          <option 
            tal:repeat="name fields"
            tal:attributes="value name" 
            tal:content="name">Name</option>
        </select><br />
        <input type="submit" />
      </form>

Conclusion

I hope you find this software useful. If you have a question, comment, feature request or find a bug please contact me at [email protected].

Copyright (c) 2003, Casey Duncan and Zope Corporation