ZCatalog Sort Enhancements for Zope 2.6.1 and Beyond

ZCatalog Improvements for Zope 2.6.1 and beyond

I don't know what it is. Maybe we just came to accept the fact that the Catalog was as slow as it was and never questioned it. Maybe its that scary code at the bottom of Catalog.py. Maybe we just assumed it couldn't be made any better. Let's just say that we were all dead wrong, and that's a good thing!

Now I don't mean to imply that the Catalog is a slow search engine. In fact, most actual search times inside the Catalog, that is the time it takes to apply the indexes and intersect the results is actual extremely fast. For non-text indexes it is damn near instantaneous.

The problem lies in the sort phase. Few of you are content to present your query records in random order. Sorting is so ubiquitous that pretty much the only times we don't sort the result is when using a text index (it has a score that it sorts by) and when we know the query will only return one record. So it pretty much doesn't matter how fast you can find the results if sorting more than a few hundred of them takes a dog's age.

Ok, so lets take a set back for a minute to talk a little bit about optimization, or simply the art of making things freaking fast. There are many tips for optimizing, but here are some basic principles:

Doing less work is faster than doing more work.
Doing work earlier or later than now can be faster than doing it now.
Doing no work is fastest of all.

The last point is probably the most obvious, but ironically the most overlooked. In practice this is achieved by either short circuiting code at opportune moments, moving as much work as possible out of loops or (my favorite) simply deleting code altogether, possibly replacing it with something that gives the same result and does less work (the last part is sometimes, gratifyingly, optional).

The unfortunate thing about optimizing is that although general algorithms (the one size fits all variety) are rarely the fastest for any case, but are often much faster than "optimized" code in certain "worst" cases. What this means is that one general (but possibly slow) algorithm often must be replaced by several algorithms that are each fastest for a particular case and make certain assumptions about the task at hand. Assumptions that can be foiled turning normally fast code into a veritable tar pit for your application. So, with this in mind, another tip for would be optimizers is:

Test your "optimized" code in many different cases to flesh out pathological behavior as well as bugs.

Speeding Up the Full Sort

I have intended for some time to introduce some optimizations to the Catalog that could take cues from your application and get done faster based on that. For example, it is quite common for a web page to display a fixed number of results in batches, or a news box to display a small number of items. So it makes sense to be able to tell the catalog "I really only care about the top 10 results" and let the catalog only do the work it takes to find the top 10.

When I was called upon to improve the Catalog's performance I figured the above was a good place to start. Once I got into the guts of the Catalog though, I realized that even the general sort routine was doing far too much work. It was reconstructing a full list of the result set from scratch several times and instantiating one lazy result object per result record. The result was huge times and even larger memory consumption.

Lets look at some metrics. On a 700MHz machine with 512Mb of RAM, the 2.6.0 Catalog code can sort a result set of 30,000 records in a little under 5 seconds. Much less then the time it takes for your would be site visitors to click off to another site.

After eliminating the creation of those lazy objects and reducing the number of times it reconstructs lists of results, I was able to get the same sort to execute in about 1.1 seconds. Since it constructs far fewer objects and lists it uses a lot less memory too, It's not often you can save both of these commodities at the same time so I was quite pleased.

Limited Sorting

Happy at finding that big of a savings in the full sort, I returned to my original plan. In my participation with PythonLabs in developing ZCTextIndex I was exposed to a clever sorting algorithm called N-Best. It is much more efficient at finding the top items of an unordered list then any full sort is. Because web applications commonly only display the first few items on a page anyway, it is an ideal fit with ZCatalog.

Another great benefit of N-Best is that, unlike the full sort, only a small list of sorted results is constructed from the result set. This means another significant memory savings.

Remembering our discussion earlier, it bears pointing out where N-Best falls down. It is slowest when the list is already in order (not a problem here) and when N is larger then a small fraction of the total result size. In other words, N-Best is much slower than a full sort at finding the first 15,000 results of a 20,000 record result set. But since I know I can sort all 20,000 in less than a second, I can just select a full sort when I know N-Best is going to be slow.

So how much faster is this N-Best thing anyway? For the same 30,000 results above, the N-Best returns the top (or bottom) 20 result records in about 0.4 seconds, less than half the time it takes to do the newly optimized full sort (and no less than an order of magnitude better than when we started). It continues to perform well for larger limit values as well.

In order to set a limit for your queries, you can pass a new "sort-limit" value with your catalog query. This argument only affects querys with "sort-on" specified, it is ignored otherwise. It works with either a forward or reverse sort order. Here is an example:

      results = Catalog(Type='News', sort_on='date', 
                        sort_order='reverse', sort_limit=10)

This would return you the top 10 news items sorted descending by date.

The value you pass for the limit must be be an integer 1 or greater. In general this value could be the upper limit of the current batch you are displaying to a user. For instance, given a batch size of 20, pass a sort-limit value of 20 for the first page, 40 for the second page and so on. The sort code is designed so that if your sort-limit value is too big to be efficient (or is larger than the result set itself) it will fall back to the normal full sort. So you don't have to worry about specifying high sort-limit values.

An important point to realize is that sort-limit is only an optimization hint for the Catalog. The Catalog does not promise to always return a result set that size or smaller. In other words, you must still use the standard batching machinery to display your pages.

Ok, so here is a problem with using limits. It is common to tell the user how many results were found on the search results page. Using a sort limit defeats this because the length of the result set (measured with len()) is not necessarily the total actual number of results that were found. To solve this the catalog adds an attribute named "actual_result_count" to the result set. This attribute contains the actual number of results that the catalog found before the limit operation. To display this value in a page template you could use something like:

      <span tal:replace="results/actual_result_count">#</span> item(s) found.

Faster Still

All of the above optimizations were achieved without changing the API or data structures of the Catalog or indexes. This was pretty much a hard requirement for putting this into the 2.6.x maintenance stream. For Zope 2.7 I have no such restriction, so I tempted myself with the question, "Can we do better?" And the answer is "Hell yes!".

For example, a small change to the PlugInIndex api which eliminates a method call inside the sort loop and replaces it with a getitem buys about another 15% on the above numbers. This change is currently in what will become Zope 2.7, but its really only small potatoes compared to what is really possible.

For example, a change to the data stored in the indexes used for sorting results in a 50%+ savings to both the full sort and N-Best times. Which in case your counting brings us to about half a second for a full sort of 30,000 results and 0.16 seconds for N-Best limit sorts. Not bad considering the creation of the sort list is still done in a Python for loop!

So that begs the obvious question, what about rewriting it in C? A C version of this loop that does a bit less memory allocation only buys about another 0.05 seconds for the full sort in our test case. It seems our loop is fully optimized as even the Python version only takes a small fraction of the 0.5 second time. The thing using most of the time now is the Python list.sort method. We can't speed it up and keep the capability to sort by any type of object, so the only hope is to eliminate the call entirely (remember, this is my favorite part), but how?

Fastest of All

Part of the reason why the N-Best sort is so fast (even written in Python) is that it only sorts result records that are better than the ones it has already seen, and it only keeps the N best results in a list so sorting in a new value is extremely fast using an insertion sort. When the Python loop exits, the N-Best list is already in order so we don't need to sort it.

The full sort code is not so lucky. It creates a new list of results adding in the sort keys. Unfortunately this list is not in order by key when we are done making it, so we must call the Python sort method to sort it. As fast as that is, its still far slower than not having to do it at all.

In order to do better, we must explore a different strategy for sorting our list of results. Then, it came to me: the sort index already has a full list of results in sorted order by sort key! These sort indexes actually contain two indexes internally: A mapping from key to record id(s) and another mapping from record id to key. The latter is the reverse index, and it is what the current full sort uses to get the sort keys. It is ordered by record id, but the forward index is ordered by sort key already!

This results in a wonderful opportunity. We can turn our worst case (a full sort of the entire catalog) into a best case since we already have all the records in order inside the index! It also allows us to optimize the sorting of any result set containing a large fraction of all the records in the catalog. This also makes possible an alternative to N-Best for limit operations on large result sets.

A prototype of a sort limit algorithm using the forward index (replacing N-Best for large result sets) can find the top twenty records of our 30,000 record set in 0.0026 seconds. Not bad. It continues to outperform N-Best down to result set sizes that are around a quarter the size of the entire catalog.

The full sort of the entire catalog would be theoretically instantaneous since it can simply create a lazy result set around the index itself and return. For result sets somewhat smaller than that, we can create a slightly smarter wrapper that walks down the index finding the next result in order in the record set. It would mean the sorting would actually be done lazily and the application would only spend time sorting those records it actually retrieved. And the sort times would not increase no matter how big the catalog got. For small result sets the lazy sorting would be slow, so we would just fall back on our newly optimized full sort and N-best algorithms.

And it gets better. Lets say you have a really big catalog with millions of records. Using either the full sort or N-Best on the entire catalog means loading the entire reverse index from the database. This uses a lot of memory and takes a lot of time if the objects in the index must be fetched from disk. All of the above numbers assume everything is in memory, they would increase dramatically if the disk became involved. Both our new algorithms only access records in the index proportional to the records in the result set that the application actually uses. So it could be a huge win for scalability of sites making heavy use of ZCatalog. It also frees up space in the ZODB cache which is a huge benefit for the rest of the application.

And one more thing. The python list sort compares objects to each other at sort time. This can execute arbitrary code depending on the objects which makes its time highly variable. I was sorting long integers which are relatively fast to sort. Using the already sorted forward index eliminates the need to compare objects, making it no slower to sort complex objects with custom comparison methods than integers.

Looking Forward

Well, thats it. Thanks for hanging in there! I plan to incorporate the lazy sorting algorithms into the ZCatalog for Zope 2.7 soon. So you can look forward to an even faster catalog in the near future!

Happy sorting!

-Casey