Consider these two CPU intensive functions written in Python
while(n > 0):
Let’s do a small benchmark on sequential and parallel execution of these functions
from threading import *
while n >0:
n -= 1
COUNT = 10000000
start = time.time()
print "Sequential version time: %s" % (time.time()-start)
start = time.time()
print "Threaded version time: %s" % (time.time()-start)
Execution on my local 4 core machine produces the following result
Sequential version time: 1.05200004578
Threaded version time: 2.15500020981
There are few surprising things about the output. Though the sequential and parallel versions perform the same amount of work, multi threaded version is 2x slower than sequential version !
Isn’t multi threading supposed to parallelize work ? What might have impacted the performance of the threaded version ?
The answer lies in Global Interpreter Lock (GIL)
What is Global Interpreter Lock (GIL) ?
Here is a quote from wiki explaining what GIL is
Global Interpreter Lock is a mechanism used in computer language interpreters to synchronize the execution of threads so that only one thread can execute at a time. An interpreter which uses GIL will always allow exactly one thread to execute at a time, even if run on a Multi-core processor. Some popular interpreters that have GIL are CPython and Ruby MRI.
That explains partially why threaded version is not performing better than single thread version. Python/CPython uses GIL which essentially makes it single threaded only. Though there are multiple threads and multiple CPU’s it really can’t take advantage of them.
Superficially it might seem pointless writing multi threaded programs in languages that use GIL when you know that they aren’t executed concurrently, but its not the case. Though GIL runs only one thread at a time many language implementations perform long-running operations such as I/O outside the GIL. So you would still benefit from multi-threading if your program does lot of background IO, but will become a bottleneck when most of the tasks are CPU intensive.
But that doesn’t explain the 2x performance impact !
Ignoring the little time spent on context switching, both single threaded and multi-threaded version does same amount of work so its quite natural to expect similar performance from both versions, not a 2x performance impact !. So there must be something impacting the performance. Turns out that GIL (again !) is culprit here. To know why this happens we need to know a little more about GIL and thread scheduling
Thread scheduling Mechanism
Python/Ruby (v > 1.9) threads are pThreads/Windows threads. All the thread scheduling is delegated to OS. As a consequence of leaving thread scheduling to OS, the lag between signaling and executing can be significant since it routes through OS and is dependent on OS. An aside to note here is that the OS tries to maximize core utilization by running as many threads as possible.
Thread Execution flow
Behind the scenes (py 2.7), the Interpreter frequently checks the GIL state to ensure that a single CPU intensive thread isn’t stealing entire CPU time. During the check it releases the lock from the current thread holding GIL so that other ‘READY’ state threads can acquire the lock. Whenever GIL lock is released a signal is sent to OS indicating that GIL is free, threads which receive signal and are ready to run compete to acquire GIL. Only one thread successfully acquires GIL, all the remaining threads on failing to acquire GIL wait till they receive another signal to compete for GIL.
So whats happens on a multi-core system ? GIL is released and a signal is sent to OS, This signal & execution lag is significant and in most of the cases the original thread that released the lock almost immediately acquires the lock back. The other threads being woken up by signal now try to acquire GIL but will fail, these repeated awake-acquire-fail-wait sequences cause a additional load on the system there by causing the performance impact.
To summarize, the performance impact we observed is mainly due to differences in ideologies of the systems involved, Python tries to run only one thread at a time where as OS tries to run maximum number of threads utilizing all the available processors.
Why use GIL in the first place ?
Seeing all these performance bottlenecks on multi-core systems, You might ask, why is GIL there in the first place when it limits the amount of parallelism achievable through Threads ? The primary reason is that the underlying memory management and core mutable data structures such as lists and maps are not thread safe. Since they are not thread safe, GIL ensures data safety by allowing only a single thread to run at a given point of time.
The Single threaded guarantee enforced by GIL also eases development of extensions, in-fact there are many extensions that are dependent on GIL functionality.
The way out
There are few ways out of GIL if you are after concurrency
- Since GIL is unique for each single interpreter process, a common way to achieve concurrency is to run multiple processes
- Implementations such as Jython, MacRuby do not use GIL so you might as well use them.
- Twisted, EventMachine and many frameworks use Non blocking IO’s/Reactor patterns/Actors to provide a form of concurrency.
Latest versions of Python (3.2+) offer a improved GIL implementation which fixes the issue highlighted here but it causes excessive thrashing and response times in few scenarios, so yes, GIL is a nuance in multi-core environments but it will stay.
Threads are very useful concurrency constructs and can offer good performance even with GIL, just make sure that you know the corner cases.
Despite the issues with GIL, high concurrency is still doable. Use frameworks/implementations that provide concurrency through other mechanisms