Unicode0.4Readme.txt

Zope Unicode support version 0.4

This modification to Zope provides support for python 2.0 Unicode strings in ZPublisher, property pages, and property sheets.

Permission to use this software in any way is granted without fee, provided that the copyright notice above appears in all copies. This software is provided "as is" without any warranty.

Send comments to Toby Dickenson, [email protected]

Installation ------------ This patch was developed with Zope 2.2.0 and the python 2.0 cvs as of 2000-09-04. It might work with later versions.

It does not work with Python 1.5.2, or with some earlier revisions of 2.0.

Zope uses some dirty tricks to ensure that it uses it's own version of cPickle and cStringUI, since the versions supplied with python 1.5.2 are too old. However, the version supplied with python 2.0 is newer still, and you need to subvert Zope's dirty tricks. After building Zope, you need to manually delete any cPickle.so and cStringIO.so (or .pyd) files in all Zope subdirectories.

Changes since 0.3 -----------------

Fixed a nasty memory leak. Im confident it is now leak free.
Fixed numerous unicode incompatabilities in Zope's error reporting code that previously made it very hard to debug any unicode related errors.

Changes for Content Managers ---------------------------- Property pages and property sheets now include extra types ustring, utokens, utext, and ulines. These are unicode equivalents of string, tokens, text, and lines.

Unicode strings can be mixed freely with plain strings in DTML. DTML will return a unicode string if any of its constituents are unicode, otherwise it will return a plain string as before.

When unicode strings are mixed with plain strings, the plain string is converted to unicode assuming that it contains characters in Zope's Default Character Encoding, discussed below.

ZPublisher has been changed to handle a unicode response. If the response is not unicode then it behaves exactly as before. However, if is Unicode then it applies the character encoding specified by the charset property in the Content-Type header. (This applies to all text/* content-types)

If you expect that your pages might include Unicode data, change your standard_html_header to something like:

Content-Type,text/html; charset=UTF-8)"> <dtml-var title_or_id>

The line is necessary to force the response into unicode. Without this there is a chance that ZPublisher's response would not be unicode, and the character encoding mechanism would not be trigered.

If the Content-Type header does not include a charset property (or if it is blank - ZPublisher guesses text/html) then the unicode string is encoded using the Zope's Default Character Encoding.

Changes for Forms ----------------- ZPublisher has special processing for field names of the form "name:type" (for example "age:int", or "address:string"). ZPublisher uses these extra tags to marshal the form values into the correct type.

This mechanism has been extended to include a specification of the character encoding used by the response. You need to know which encoding will be used by the browser and include an appropriate tag. "age:utf8:int" or "address:utf8:string". The tag parser insists that tags must only use alphanumberic characters or an underscore, so you might need to use a short form of the encoding name (such as UTF8 rather than UTF-8).

Four extra type converters have been added: Unicode equivalents of the existing string types. ustring, utokens, utext, and ulines. If the field name does not include a character encoding tag, then the Default Character Encoding is assumed.

Character Encoding Used In Form Responses ----------------------------------------- As explained above, you need to know which character encoding will be used by the browser to submit responses to your forms, and include the name of that encoding in the name of your form controls.

The encoding used by a browser depends on the encoding used by the page containing the form, and the type of form.

Forms submitted using GET, or using POST with "application/x-www-form-urlencoded" (the default)
1. Page uses an encoding of unicode
  Forms are submitted using UTF8, as required by RFC 2718 2.2.5
2. Page uses another regional 8 bit national encoding
  Forms are often submitted using the same encoding as the page. If you choose to use such an encoding then you should also verify how browsers behave.
Forms submitted using "multipart/form-data"
According to HTML 4.01 (section 17.13.4) browsers should state which character encoding they are using for each field in a Content-Type header, however I have never seen a browser actually do this.

The current browsers appear to use the same encoding as the page containing the form.

This all seems to be harder than it really should be. A no-brainer policy is to use UTF8 for every page, in which case from response are also always UTF8.

Zope's Default Encoding ----------------------- Zope allows you to mix plain strings and unicode strings. This will automatically do the right thing if the plain strings are using a latin-1 character encoding (or a subset of latin-1, such as ascii).

This default encoding is used when: unicode strings are mixed with plain strings in DTML the response is a unicode string, but the content-type does not include a charset * a browser submits a form in unicode, but the parameter is marshalled to string, lines, tokens, or text (or any other marshalling type converter that is not unicode-aware)

This is less strict than basic Python, which will raise an exception when combining unicode strings with plain strings that contain characters outside the ascii range.

Extensions to the DTML namespace -------------------------------- The DTML namespace (named _ in DTML expressions) now contains the following extra symbols, which are Python's new builtin functions of the same. unicode unichr * ustr (the name of this function has not yet been finialised)

Pages That Do Not Expect Unicode -------------------------------- There are many DTML pages that are not currently unicode aware, including most of Zope's management interface. These changes have been designed to allow these DTML pages to remain unchanged if they never see unicode data, and to degrade gracefully if they should encounter unicode data accidentally.

The following issues should not be a problem:

If a unicode property containing characters outside the latin-1 range is used on a page that is not unicode-aware, those character will be replaced by a question mark. This currently allows standard zope properties (such as title) to be unicode, without updating all pages in the management interface that use it.
There may be problems with using unicode properties on a page that does not contain latin-1 data, but which also does not set an appropriate content-type header.
The properties management tab only uses UTF8 if an existing property uses unicode. this means that the initial value for the first unicode property may only contain latin-1 characters. Of course, the property may be changed to use any unicode character immediately after creation.
In some circumstances, Zope modifies the returned html to include a tag. This modification will only worth with character encodings that are a superset of ascii. (ie. Not UTF16).

The following problems remain unresolved:

Python will throw an exception if non-ascii plain strings are compared to unicode strings. This will cause problems for ZCatalog if one index contains both non-ascii plain strings, and unicode strings. A workaround for this problem is to provide an external method which returns that property in unicode, then index the external method. Note that I think ZCatalog is already relying on dangerous ground in this area: http://classic.zope.org:8080/Collector/1219/view
xml-rpc does not support unicode. (Thanks to Martijn Pieters for pointing this out)
Python code that uses DTML may be broken when it returns a unicode string.