You are not logged in Log in Join
You are here: Home » Members » Toby Dickenson » Zope Unicode Support » Unicode0.3Readme.txt

Log in
Name

Password

 

Unicode0.3Readme.txt

Zope Unicode support version 0.3

This modification to Zope provides support for python 2.0 Unicode strings in ZPublisher, property pages, and property sheets.

Copyright (c) 1999, 2000 Toby Dickenson

Permission to use this software in any way is granted without fee, provided that the copyright notice above appears in all copies. This software is provided "as is" without any warranty.

Send comments to Toby Dickenson, [email protected]

Installation ------------ This patch was developed with Zope 2.2 beta and and python 2.0 alpha. It might work with later versions.

Zope currently needs an older version of a module in the python standard library. Keep an eye on http://classic.zope.org:8080/Collector/1413/view and http://sourceforge.net/bugs/?func=detailbug&bug_id=110911&group_id=5470

Some alpha versions of Python 2.0 set the default encoding based on locale. For this patch the default encoding must be ascii. (It is expected that the final python 2.0 will enforce this constraint). If you are using a version that does this, modify python's site.py using the patch in the Appendix. If you are not sure, try this....

Python 2.0b1 (#0, Jul 10 2000, 12:36:06) [MSC 32 bit (Intel)] on win32 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam Copyright 1995-2000 Corporation for National Research Initiatives (CNRI) >>> import sys >>> sys.getdefaultencoding() ascii

If you get an exception (AttributeError: getdefaultencoding) then you don't need to patch.

If you see ascii then you don't need to patch.

Anything else, and you do.

Changes for Content Managers ---------------------------- Property pages and property sheets now include extra types ustring, utokens, utext, and ulines. These are unicode equivalents of string, tokens, text, and lines.

Unicode strings can be mixed freely with plain strings in DTML. DTML will return a unicode string if any of its constituents are unicode, otherwise it will return a plain string as before.

When unicode strings are mixed with plain strings, the plain string is converted to unicode assuming that it contains characters in Zope's Default Character Encoding, discussed below.

ZPublisher has been changed to handle a unicode response. If the response is not unicode then it behaves exactly as before. However, if is Unicode then it applies the character encoding specified by the charset property in the Content-Type header. (This applies to all text/* content-types)

If you expect that your pages might include Unicode data, change your standard_html_header to something like:

Content-Type,text/html; charset=UTF-8)"> <dtml-var title_or_id>

The line is necessary to force the response into unicode. Without this there is a chance that ZPublisher's response would not be unicode, and the character encoding mechanism would not be trigered.

If the Content-Type header does not include a charset property (or if it is blank - ZPublisher guesses text/html) then the unicode string is encoded using the Zope's Default Character Encoding.

Changes for Forms ----------------- ZPublisher has special processing for field names of the form "name:type" (for example "age:int", or "address:string"). ZPublisher uses these extra tags to marshal the form values into the correct type.

This mechanism has been extended to include a specification of the character encoding used by the response. You need to know which encoding will be used by the browser and include an appropriate tag. "age:utf8:int" or "address:utf8:string". The tag parser insists that tags must only use alphanumberic characters or an underscore, so you might need to use a short form of the encoding name (such as UTF8 rather than UTF-8).

Four extra type converters have been added: Unicode equivalents of the existing string types. ustring, utokens, utext, and ulines. If the field name does not include a character encoding tag, then the Default Character Encoding is assumed.

Character Encoding Used In Form Responses ----------------------------------------- As explained above, you need to know which character encoding will be used by the browser to submit responses to your forms, and include the name of that encoding in the name of your form controls.

The encoding used by a browser depends on the encoding used by the page containing the form, and the type of form.

  1. Forms submitted using GET, or using POST with "application/x-www-form-urlencoded" (the default)
    1. Page uses an encoding of unicode

      Forms are submitted using UTF8, as required by RFC 2718 2.2.5

    2. Page uses another regional 8 bit national encoding

      Forms are often submitted using the same encoding as the page. If you choose to use such an encoding then you should also verify how browsers behave.

  2. Forms submitted using "multipart/form-data"

    According to HTML 4.01 (section 17.13.4) browsers should state which character encoding they are using for each field in a Content-Type header, however I have never seen a browser actually do this.

    The current browsers appear to use the same encoding as the page containing the form.

This all seems to be harder than it really should be. A no-brainer policy is to use UTF8 for every page, in which case from response are also always UTF8.

Zope's Default Encoding ----------------------- Zope allows you to mix plain strings and unicode strings. This will automatically do the right thing if the plain strings are using a latin-1 character encoding (or a subset of latin-1, such as ascii).

This default encoding is used when: unicode strings are mixed with plain strings in DTML the response is a unicode string, but the content-type does not include a charset * a browser submits a form in unicode, but the parameter is marshalled to string, lines, tokens, or text (or any other marshalling type converter that is not unicode-aware)

This is less strict than basic Python, which will raise an exception when combining unicode strings with plain strings that contain characters outside the ascii range.

Extensions to the DTML namespace -------------------------------- The DTML namespace (named _ in DTML expressions) now contains the following extra symbols, which are Python's new builtin functions of the same. unicode unichr * ustr (the name of this function has not yet been finialised)

Pages That Do Not Expect Unicode -------------------------------- There are many DTML pages that are not currently unicode aware, including most of Zope's management interface. These changes have been designed to allow these DTML pages to remain unchanged if they never see unicode data, and to degrade gracefully if they should encounter unicode data accidentally.

The following issues should not be a problem:

  • If a unicode property containing characters outside the latin-1 range is used on a page that is not unicode-aware, those character will be replaced by a question mark. This currently allows standard zope properties (such as title) to be unicode, without updating all pages in the management interface that use it.
  • There may be problems with using unicode properties on a page that does not contain latin-1 data, but which also does not set an appropriate content-type header.
  • The properties management tab only uses UTF8 if an existing property uses unicode. this means that the initial value for the first unicode property may only contain latin-1 characters. Of course, the property may be changed to use any unicode character immediately after creation.
  • In some circumstances, Zope modifies the returned html to include a tag. This modification will only worth with character encodings that are a superset of ascii. (ie. Not UTF16).

The following problems remain unresolved:

  • Python will throw an exception if non-ascii plain strings are compared to unicode strings. This will cause problems for ZCatalog if one index contains both non-ascii plain strings, and unicode strings. A workaround for this problem is to provide an external method which returns that property in unicode, then index the external method. Note that I think ZCatalog is already relying on dangerous ground in this area: http://classic.zope.org:8080/Collector/1219/view
  • The current version of xmlrpclib does not support unicode
  • Python code that uses DTML may be broken when it returns a unicode string.

Appendix --------

Appendix A: Disable locale-dependant character encoding

Index: site.py =================================================================== RCS file: /cvsroot/python/python/dist/src/Lib/site.py,v retrieving revision 1.12 diff -c -r1.12 site.py * site.py 2000/06/28 14:48:01 1.12 --- site.py 2000/07/10 13:51:13 *********** 134,147 *** except LookupError: sys.setdefaultencoding(ascii)

! if 1: # Enable to support locale aware default string encodings. locale_aware_defaultencoding() elif 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. sys.setdefaultencoding(undefined) ! elif 0: # Enable to hard-code a site specific default string encoding. sys.setdefaultencoding(ascii)

--- 134,147 ---- except LookupError: sys.setdefaultencoding(ascii)

! if 0: # Enable to support locale aware default string encodings. locale_aware_defaultencoding() elif 0: # Enable to switch off string to Unicode coercion and implicit # Unicode to string conversion. sys.setdefaultencoding(undefined) ! elif 1: # Enable to hard-code a site specific default string encoding. sys.setdefaultencoding(ascii)