Client Cache Tracing

An important question for ZEO users is: how large should the ZEO client cache be? ZEO 2 (as of ZEO 2.0b2) has a new feature that lets you collect a trace of cache activity and tools to analyze this trace, enabling you to make an informed decision about the cache size.

Don't confuse the ZEO client cache with the Zope object cache. The ZEO client cache is only used when an object is not in the Zope object cache; the ZEO client cache avoids roundtrips to the ZEO server.

Enabling Cache Tracing

To enable cache tracing, set the environment variable ZEO_CACHE_TRACE to the name of a file to which the ZEO client process can write. ZEO will append a hyphen and the storage name to the filename, to distinguish different storages. If the file doesn't exist, the ZEO will try to create it. If there are problems with the file, a log message is written to the standard Zope log file. To start or stop tracing, the ZEO client process (typically a Zope application server) must be restarted.

The trace file can grow pretty quickly; on a moderately loaded server, we observed it growing by 5 MB per hour. The file consists of binary records, each 24 bytes long; a detailed description of the record lay-out is given in stats.py. No sensitive data is logged.

Analyzing a Cache Trace

The stats.py command-line tool is the first-line tool to analyze a cache trace. Its default output consists of two parts: a one-line summary of essential statistics for each segment of 15 minutes, interspersed with lines indicating client restarts and "cache flip events" (more about those later), followed by a more detailed summary of overall statistics.

The most important statistic is probably the "hit rate", a percentage indicating how many requests to load an object could be satisfied from the cache. Hit rates around 70% are good. 90% is probably close to the theoretical maximum. If you see a hit rate under 60% you can probably improve the cache performance (and hence your Zope application server's performance) by increasing the ZEO cache size. This is normally configured using the cache_size keyword argument to the ClientStorage() constructor in your custom_zodb.py file. The default cache size is 20 MB.

The stats.py tool shows its command line syntax when invoked without arguments. The tracefile argument can be a gzipped file if it has a .gz extension. It will read from stdin (assuming uncompressed data) if the tracefile argument is '-'.

Simulating Different Cache Sizes

Based on a cache trace file, you can make a prediction of how well the cache might do with a different cache size. The simul.py tool runs an accurate simulation of the ZEO client cache implementation based upon the events read from a trace file. A new simulation is started each time the trace file records a client restart event; if a trace file contains more than one restart event, a separate line is printed for each simulation, and line with overall statistics is added at the end.

Example, assuming the trace file is in /tmp/cachetrace.log:

$ python simul.py -s 100 /tmp/cachetrace.log
  START TIME  DURATION    LOADS     HITS INVALS WRITES  FLIPS HITRATE
Sep  4 11:59     38:01    59833    40473    257     20      2  67.6%
$

This shows that with a 100 MB cache size, the cache hit rate is 67.6%. So let's try this again with a 200 MB cache size:

$ python simul.py -s 200 /tmp/cachetrace.log
  START TIME  DURATION    LOADS     HITS INVALS WRITES  FLIPS HITRATE
Sep  4 11:59     38:01    59833    40921    258     20      1  68.4%
$

This showed hardly any improvement. So let's try a 300 MB cache size:

$ python2.0 simul.py -s 300 /tmp/cachetrace.log
ZEOCacheSimulation, cache size 300,000,000 bytes
  START TIME  DURATION    LOADS     HITS INVALS WRITES  FLIPS HITRATE
Sep  4 11:59     38:01    59833    40921    258     20      0  68.4%
$ 

This shows that for this particular trace file, the maximum attainable hit rate is 68.4%. This is probably caused by the fact that nearly a third of the objects mentioned in the trace were loaded only once -- the cache only helps if an object is loaded more than once.

The simul.py tool also supports simulating different cache strategies. Since none of these are implemented, these are not further documented here.

Cache Flips

The cache uses two files, which are managed as follows:

  • Data are written to file 0 until file 0 exceeds limit/2 in size.
  • Data are written to file 1 until file 1 exceeds limit/2 in size.
  • File 0 is truncated to size 0 (or deleted and recreated).
  • Data are written to file 0 until file 0 exceeds limit/2 in size.
  • File 1 is truncated to size 0 (or deleted and recreated).
  • Data are written to file 1 until file 1 exceeds limit/2 in size.

and so on.

A switch from file 0 to file 1 is called a "cache flip". At all cache flips except the first, half of the cache contents is wiped out. This affects cache performance. How badly this impact is can be seen from the per-15-minutes summaries printed by stats.py. The -i option lets you choose a smaller summary interval which shows the impact more acutely.

The simul.py tool shows the number of cache flips in the FLIPS column. If you see more than one flip per hour the cache may be too small.