You are not logged in Log in Join
You are here: Home » Members » Toby Dickenson » howto » Zope 2.6 Unicode Changes

Log in
Name

Password

 

Zope 2.6 Unicode Changes

Zope 2.6 includes better support for using unicode. This support is based on the patches by Toby Dickenson, previously distributed at http://www.zope.org/Members/htrd/wstring.

Changes to ZPublisher

ZPublisher has been changed to handle a unicode response slightly differently to non-Unicode responses. If the response is not unicode then it behaves exactly as before. However, if (and only if) the response is Unicode then it applies the character encoding specified by the charset property in the Content-Type header. (This applies to all text/* content-types)

If the Content-Type header does not include a charset property (or if it is blank - ZPublisher guesses 'text/html') then the unicode string is encoded into latin-1 using Python's replace policy, which replaces all non-latin-1 characters with a question mark.

Changes to DTML

Unicode strings can be mixed freely with plain strings in DTML. DTML will return a unicode string if any of its constituents are Unicode, otherwise it will return a plain string as before.

When Unicode strings are mixed with plain strings, the plain string is converted to unicode assuming that it contains latin-1 characters. Note that this is different to what happens when you mix Unicode and plain strings in python, where a UnicodeError exception is quite likely. DTML never raises a UnicodeError.

If you expect that your pages might include Unicode data, change your standard_html_header to something like the following example:

 
<html>
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
 <dtml-call "RESPONSE.setHeader('Content-Type','text/html; charset=UTF-8')">
 <title><dtml-var title_or_id></title>
 <dtml-var "u''">
 </head>
 <body>

Changes for Properties

Property pages and property sheets now include extra types ustring, utokens, utext, and ulines. These are Unicode equivalents of string, tokens, text, and lines.

Changes for Forms

ZPublisher has processing for field names of the form "name:type" (for example "age:int""address:string"). ZPublisher uses these extra tags to marshal the form values into the correct type.

This mechanism has been extended to include a specification of the character encoding used by the response. You need to know which encoding will be used by the browser and include an appropriate tag, such as "age:utf8:int" or "address:utf8:string". The tag parser insists that tags must only use alphanumberic characters or an underscore, so you might need to use a short form of the encoding name (such as UTF8 rather than UTF-8).

Four extra type converters have been added: Unicode equivalents of the existing string types. 'ustring', 'utokens', 'utext', and 'ulines'. If the field name does not include a character encoding tag, then it assumes the form was submitted in latin-1.

Character Encoding Used In Form Responses

As explained above, you need to know which character encoding will be used by the browser submitting responses to your forms, and include the name of that encoding in the name of your form controls.

The encoding used by a browser depends on the encoding used by the page containing the form, and the type of form.

  1. Forms submitted using GET, or using POST with "application/x-www-form-urlencoded" (the default)
    1. Page uses an encoding of unicode:
      Forms are submitted using UTF8, as required by RFC 2718 2.2.5
    2. Page uses another regional 8 bit encoding:
      Forms are often submitted using the same encoding as the page. If you choose to use such an encoding then you should also verify how browsers behave.
  2. Forms submitted using "multipart/form-data":
    According to HTML 4.01 (section 17.13.4) browsers should state which character encoding they are using for each field in a Content-Type header, however this is poorly supported. The current browsers appear to use the same encoding as the page containing the form.

You are right to think that this is harder than it really should be. A no-brainer policy is to use UTF8 for every page, in which case form responses are also always UTF8.

Changes to the ZMI (Zope Management Interface)

Previously the ZMI did not specify a character encoding used in its management interface, leaving it up to the individual browser to guess. From Zope 2.6 the default character encoding for the ZMI is latin-1. In future it may change to utf-8.

Product authors can overide this default character encoding for their own unicode-aware management pages by setting the XXXXXTBD REQUEST header before calling XXXXX, as shown in the following example. This technique is currently used by the Properties page, to correctly display the value of unicode properties.

EXAMPLE TBD

Pages That Do Not Expect Unicode

There are many DTML pages that are not currently unicode aware, including most of Zope's management interface. Many of these pages use their own choice of character encoding, with encoded character data stored in plain strings. These Unicode changes have been designed to allow these DTML pages to remain unchanged, provided a unicode property is not used on the page.

Problem Areas

The following issues remain a problem.

  • It is currently possible to get a BTree-bases index (such as ZCatalog) into a wedged state if it contains certain combinations of Unicode and plain strings. For now it is safest to avoid mixing these types in the same index.
  • These changes do not work well for sites which mix Unicode values with encoded character data stored in plain strings. This is a crazy thing to do deliberately. It can happen by accident, for example if input validation does not correctly exclude unicode values from submitted forms, then uses those unicode values in a DTML page consisting of encoded character data.

Changes Since The Last Release

The support in Zope 2.6 is based on the patches previously distributed at http://www.zope.org/Members/htrd/wstring. The following changes have been made since the last release of that patch:

  1. PythonScript print has reverted back to using the standard python string mixing rules, not the more tolerant DTML rules.
  2. Exception objects are now less likely to raise a UnicodeError when rendered in a standard_error_message.