You are not logged in Log in Join
You are here: Home » Members » glpb » solaris » Optimising Python on multi-processor machines

Log in
Name

Password

 

Optimising Python on multi-processor machines

Last updated 3rd September 2002

Various snippets culled from the mailing list archives. For Solaris-specific advice see here.

In what follows it seems important to be clear about whether it is "processes" or "processors" that are being discussed.

-----------------------------------
On Tue, 2002-08-20 at 12:20, [email protected] wrote:
> I'll second support for use of Debian.  We use it in a lot of production
> tasks on a bunch of servers and for development workstations.  That said,
> with some of our new boxes, we are using Debian, but compiling from source
> RedHat enterprise kernels (easier than patching stock kernel) to get better
> hardware monitoring, and more importantly, to get the O2 scheduler, which
> does have some crude CPU affinity support.  I believe this is one area where
> Linux has been a bit behind some other Unix varieties.
>
> Though we haven't used it yet, as I understand it, CPU affinity is important
> to Python performance on SMP machines.   I expect that we will write or find
> a simple user-space utility utilizing the new system calls to bind a group
> of processes to a single CPU.  I think, in theory, this will allow us to
> successfully run two Zope instances on an inexpensive 2CPU machine, each
> instance bound to a respective CPU.

Sean,

You should look into the RedHat Advanced Server Kernel. The scheduler in
it is far superior to the standard, and includes CPU affinity. I'm not
sure how often python does context switches, but the RHAS kernel shows a
4X improvement in that regard. :)

I understand some/many of those changes can also be had in the -AA
patches to the 2.4. kernel as well. It also appears that if you want to
use XFS (my personal, and professional recommendation) you would be
better off using the -AA series. I have yet to complete the XFS
w/RHAS2.1 document. It is not trivial, yet.

Of course, you can get this by downloading just the kernel RPM, but if
you are aiming at a fast, high-availability production (SMP-)server, you
would gain much benefit by using the full RHAS2.1, either by purchasing
it, or building it from source rpms, whichever fits your budget better.
:)

Bill

--
Bill Anderson, RHCE
Linux in Boise Club                  http://www.libc.org
-----------------------------------
Guido van Rossum <[email protected]> wrote:
> > I don't think you are missing anything. The Python GIL is a bit of
> > a show-stopper - I've been suprised that this isn't more widely
> > known. Hoping I'm wrong nonetheless ....
>
> I wonder if Guido has any comment on this?

I haven't seen the rest of the thread, so I don't know the context.

The GIL *is* widely known, and there's nothing that can be done about
it (without redesigning all of Python's runtime from scratch, anyway).

To use Python on multiple processors, the best thing to do is to
run multiple processes, rather than multiple threads.
-----------------------------------
Dennis Allison <[email protected]> wrote:

I have been running Zope successfully on a SMP Athlon system having
ignored Matt's advice given below.  There do not appear to be correctness
issues in a SMP environment, only potential performance issues with
Python/Zope.  I wanted the multiple CPUs available for non-Python tasks
and was willing to accept the performance hit.

I do think that there is a need for a SMP friendly Python runtime.  The
cost/performance ratios for small SMPs make them very attractive
platforms.
-----------------------------------
Guido van Rossum <[email protected]> wrote:

> I do think that there is a need for a SMP friendly Python runtime.

Feel free to submit patches to Python.  This was tried before, making
many of the internal data structures thread-safe and adding
fine-grained locks where necessary.  The net effect was a 50%
slow-down on uniprocessor machines running Linux.  On Windows it was a
bit better (Windows has more efficient low-level locks than Linux) but
still a significant slowdown.

So whether there's a need or not, I believe we'll all have to cope.
The multi-process approach works well.  For certain specialized
applications, it also works to write an extension module in C that
releases the GIL around CPU intensive calculations (as long as those
calculations don't touch any Python objects).
-----------------------------------
Paul Winkler <[email protected]> wrote:

> Related to the topic, I recieved a 2P Athlon MP machine running RH7.3.
> So how would I go about getting Zope to bind to one processor (if
> possible) or atleast get the performance on par with a normal 1P machine??

quickest & easiest treatment would be the sys.setcheckinterval()
thing, which you can do just by setting the -i flag to
Z2.py.  Edit your zope start script so that it looks something like:

exec /usr/bin/python \
     $INST_HOME/z2.py \
     -i NNN \
    ...

where NNN is pystones / 50.
How do you get pystones for your machine?
Run the pystones.py script which will be somewhere
like /usr/lib/python2.1/test/pystone.py
-----------------------------------
Dennis Allison <[email protected]> wrote:

> Related to the topic, I recieved a 2P Athlon MP machine running RH7.3.
> So how would I go about getting Zope to bind to one processor (if
> possible) or atleast get the performance on par with a normal 1P machine??
>
> I dont really have a choice regarding the machine. I did read the
> article whose link is provided below, however not being an OS person
> didnt really understand half of it.
>
> TIA
> AM

That is precisely the configuration I run without problem.

I have not (yet) looked at the Python code, but I am reasonably sure my
intuition is correct.  (Matt or Guido -- correct me if I am wrong...)

First, safety is not an issue modulo thread safety in the uniprocessor
machine and the correctness of the SMP implementation. Multiple threads
allocated to different processors function correctly.  The problem is with
performance since the GIL serializes everything and blocks all processors,
not just the processor on which the thread is running.  This means that
the second processor does not contribute to the execution as it could, so
the effective CPU available is closer to 1.0 than 2.0.
-----------------------------------
Dieter Maurer <[email protected]> wrote:

 > > I do think that there is a need for a SMP friendly Python runtime.
 >
 > Feel free to submit patches to Python.  This was tried before, making
 > many of the internal data structures thread-safe and adding
 > fine-grained locks where necessary.  The net effect was a 50%
 > slow-down on uniprocessor machines running Linux.  On Windows it was a
 > bit better (Windows has more efficient low-level locks than Linux) but
 > still a significant slowdown.
 >
 > So whether there's a need or not, I believe we'll all have to cope.
 > The multi-process approach works well.  For certain specialized
 > applications, it also works to write an extension module in C that
 > releases the GIL around CPU intensive calculations (as long as those
 > calculations don't touch any Python objects).

I think, there are two issues:

  It is well known that the GIL prevents a multi-threaded application
  to use the full potential of a multi-processor architecture.
  A Multi-process architecture should be used instead.

  However, what seem to be new: Even a multi-process architecture
  is far from optimal, unless the processes are explicitly bound
  to a single processor. Rumours say that otherwise the process
  is (unnecessarily) move to and fro the different processors
  which significantly reduce performance.
  Personnaly, I do not believe this out of hand. But, apparently,
  others have strong indications to this effect.
-----------------------------------
"Matthew T. Kromer" <[email protected]> wrote:

>
>First, safety is not an issue modulo thread safety in the uniprocessor
>machine and the correctness of the SMP implementation. Multiple threads
>allocated to different processors function correctly.  The problem is with
>performance since the GIL serializes everything and blocks all processors,
>not just the processor on which the thread is running.  This means that
>the second processor does not contribute to the execution as it could, so
>the effective CPU available is closer to 1.0 than 2.0.
>
>

Well, in worst case, it can actually give you performance UNDER 1X.  The latency switching the GIL between CPUs comes right off your ability to do work in a quanta.  If you have a 1 gigahertz machine capable of doing 12,000 pystones of work, and it takes 50 milliseconds to switch the GIL(I dont know how long it takes, this is an example) you would lose 5% of your peak performance for *EACH* GIL switch.  Setting sys.setchechinterval(240) will still yield the GIL 50 times a second.  If the GIL actually migrates only 10% of the time its released, that would 50 * .1  * 5% = 25% performance loss.  The cost to switch the GIL is going to vary, but will probably range between .1 and .9 time quantas (scheduler time intervals) and a typical time quanta is 5 to 10ms.

The 'saving grace' of the linux scheduler is that when a thread gives up the GIL, it almost immediately gets it back again, rather than having another thread acquire it.  This is bad for average response time, but good for throughput -- it means the threads waiting on the GIL are woken up, but will fail to get the GIL and go back to sleep again.

However, I have directly observed a 30% penalty under MP constraints when the sys.setcheckinterval value was too low (and there was too much GIL thrashing).

Very little in Zope is capable of releasing the GIL and doing work independantly; some of the database adapters can do that but that ususally does not represent a large number.  Curious side remark:  when you have a LARGE number of threads, you usually do not have enough database threads!  The number of database threads is a default parameter to an initialization method, and is set to 7.  When you DO actually have lots of concurrent work occuring without GIL thrashing, you need to bump up the number of Zope database threads.  Sites that do a lot of XML-RPC or other high latency I/O (network IO needed to fulfill a request, not just send back the response) usually need to bump up the number of database threads.  Otherwise, they block waiting on a database thread in Zope, which is bad.
-----------------------------------