Zope.org - Debug a spinning Zope

www.zope.org

old.zope.org
- /Products
- /Members

Log in

Debug a spinning Zope

Created by . Last modified on 2003/09/12.

"Spinning" is when a request causes a running Zope to consume all available CPU indefinitely. This is usually caused by some kind of infinite loop or deadlock, and is painful to debug. Under Linux, at least, I've been able to use gdb to solve one spinning problem.

I've only tried this on a Mandrake 8.1 Linux installation, with a multi-threaded, zdaemoned Zope 2.5.1 running under Python 2.1.3. I have no experience debugging any other configuration this way.

Attach to Zope with the Gnu Debugger
Don't know how to use gdb? Neither do I, but I was able to muddle through.
- Look in your "var/Z2.pid" file and get the second pid listed there.
- Run gdb with the name of your python executable. For example, with Python 2.1.3, I ran "gdb python2.1".
- At the "(gdb)" prompt, type "attach ", using the pid you found earlier.
- If all goes well, you should have to page through several screens worth of "Reading symbols" spew. Hit return until it's done.
Find the spinning thread
- Type "info threads" at the "(gdb)" prompt.
- Unless your Zope is very busy, most of the threads should be in sigsuspend(), poll(), or select(). You should be able to spot the troublemaker here. Failing that, check "top" for the pid of the thread that's using all the CPU time, and look for "(LWP )" in the thread list.
- Supposing our culprit is listed as 4 Thread 2051 (LWP 8236) ..., we now switch to thread 4 with the command "thread 4".
Get a traceback
Now for the fun part, thanks to a post by Barry Warsaw.
- Type the following at the prompt:
```
    call PyRun_SimpleString("import sys, traceback; sys.stderr=open('/tmp/tb','w',0); traceback.print_stack()")
```
- Look in "/tmp/tb" for a complete Python traceback of the current call stack of the thread.
Figure out where the loop/deadlock is.
I can't give step-by-step instructions on this one. Try repeating step 3 several times; you should see a pattern. In my case, the thread was always in the __read() method of an NMB connection, and I discovered that it was being called with no timeout value.