CustomizingTheDocumentProcessor

Customizing the document processor

The document processor is driven by two tables. The first table, named paragraph_types, is a sequence of callable objects or method names for coloring paragraphs. If a table entry is a string, then it is the name of a method of the document processor to be used. For each input paragraph, the objects in the table are called until one returns a value (not None). The value returned replaces the original input paragraph in the output. If none of the objects in the paragraph types table return a value, then a copy of the original paragraph is used. The new object returned by calling a paragraph type should implement the ReadOnlyDOM?, StructuredTextColorizable?, and StructuredTextSubparagraphContainer? interfaces. See the Document.py source file for examples.

A paragraph type may return a list or tuple of replacement paragraphs, this allowing a paragraph to be split into multiple paragraphs.

The second table, text_types, is a sequence of callable objects or method names for coloring text. The callable objects in this table are used in sequence to transform the input text into new text or objects. The callable objects are passed a string and return nothing (None) or a three-element tuple consisting of:

a replacement object,
a starting position, and
an ending position

The text from the starting position is (logically) replaced with the replacement object. The replacement object is typically an object that implements that implements the ReadOnlyDOM?, and StructuredTextColorizable? interfaces. The replacement object can also be a string or a list of strings or objects. Replacement is done from beginning to end and text after the replacement ending position will be passed to the character type objects for processing.

To create a new StructuredText format based on the document processor, simply subclass the document processor's class and override the processing tables or the methods that the processing table references. The class of the document processor can be found in the DocumentClass module of the StructuredText package.

Example 1, Disabling use of single quotes for literal inline text

Many people don't like the ClassicStructuredTextRule? that causes single-quoted strings to be translated to literal text (e.g. HTML code tags). We can disable this in two ways. First, we can modify the text_types table to remove this text type. The original text_type table in the DocumentClass class looks like:

      text_types = ![
         'doc_href',
         'doc_strong',
         'doc_emphasize',
         'doc_literal',
         ]

We can create our own document processor class with a different table:

      import StructuredText, StructuredText.DocumentClass, re

      class myDocumentClass(StructuredText.DocumentClass.DocumentClass):

          text_types = filter(lambda t: t != 'doc_literal', 
                              StructuredText.DocumentClass.DocumentClass.text_types)

      Document=myDocumentClass()

      src=open('mydata').read()        # get some source text
      basic=StructuredText.Basic(src)  # convert it to a basic document
      doc=Document(basic)              # convert it to a document-style
      html=StructuredText.HTML(doc)    # generate HTML

Note that we created the subclass table with a filter so that we can still pick up new text stypes as they are added to the base class. Another approach would be to replace the method that detects literal text with one that does nothing:

      class myDocumentClass(StructuredText.DocumentClass.DocumentClass):

          def doc_literal(self, s): pass

Example 2, Provide an alternate literal format

Rather than disable the ability to provide literal text, we could simply change it by providing a function that implements a different rule. For example, we might want to allow literal inline text to be spelled with double backward and forward single quotes as in:

       We can use expressions in the DTML var tag as 
       in ``<dtml-var "x+'.txt'">''

In this case, we simply override the method that recognizes literal text with one that implements this rule:

      class myDocumentClass(StructuredText.DocumentClass.DocumentClass):

          def doc_literal(
             self, s,
             expr=re.compile(
               "(?:\s|^)``"           # open
               "([^\n]+?)"            # contents
               "''(?:\s|[,.;:!?]|$)"  # close
               ).search):

             r=expr(s)
             if r:
                start, end = r.span(1)
                return (
                    StructuredText.DocumentClass.StructuredTextLiteral(
                      s[start:end]),
                    start-2, end+2)
             else:
                return None