multithreading - Why does this Python script run 4x slower on multiple cores than on a single core -

i'm trying understand how cpython's gil works , differences between gil in cpython 2.7.x , cpython 3.4.x. i'm using code benchmarking:

from __future__ import print_function  import argparse import resource import sys import threading import time   def countdown(n):     while n > 0:         n -= 1   def get_time():     stats = resource.getrusage(resource.rusage_self)     total_cpu_time = stats.ru_utime + stats.ru_stime     return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime   def get_time_diff(start_time, end_time):     return tuple((end-start) start, end in zip(start_time, end_time))   def main(total_cycles, max_threads, no_headers=false):     header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" %               ("#t", "seq_r", "seq_c", "seq_u", "seq_s",                "par_r", "par_c", "par_u", "par_s"))     row_format = ("%(threads)4d "                   "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f "                   "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f")     if not no_headers:         print(header)     thread_count in range(1, max_threads+1):         # don't care few lost cycles         cycles = total_cycles // thread_count          threads = [threading.thread(target=countdown, args=(cycles,))                    in range(thread_count)]          start_time = get_time()         thread in threads:             thread.start()             thread.join()         end_time = get_time()         sequential = get_time_diff(start_time, end_time)          threads = [threading.thread(target=countdown, args=(cycles,))                    in range(thread_count)]         start_time = get_time()         thread in threads:             thread.start()         thread in threads:             thread.join()         end_time = get_time()         parallel = get_time_diff(start_time, end_time)          print(row_format % {"threads": thread_count,                             "seq_r": sequential[0],                             "seq_c": sequential[1],                             "seq_u": sequential[2],                             "seq_s": sequential[3],                             "par_r": parallel[0],                             "par_c": parallel[1],                             "par_u": parallel[2],                             "par_s": parallel[3]})   if __name__ == "__main__":     arg_parser = argparse.argumentparser()     arg_parser.add_argument("max_threads", nargs="?",                             type=int, default=5)     arg_parser.add_argument("total_cycles", nargs="?",                             type=int, default=50000000)     arg_parser.add_argument("--no-headers",                             action="store_true")     args = arg_parser.parse_args()     sys.exit(main(args.total_cycles, args.max_threads, args.no_headers))

when running script on quad-core i5-2500 machine under ubuntu 14.04 python 2.7.6, following results (_r stands real time, _c cpu time, _u user mode, _s kernel mode):

  #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s    1     1.47     1.47     1.47     0.00     1.46     1.46     1.46     0.00    2     1.74     1.74     1.74     0.00     3.33     5.45     3.52     1.93    3     1.87     1.90     1.90     0.00     3.08     6.42     3.77     2.65    4     1.78     1.83     1.83     0.00     3.73     6.18     3.88     2.30    5     1.73     1.79     1.79     0.00     3.74     6.26     3.87     2.39

now if bind threads 1 core, results different:

taskset -c 0 python countdown.py    #t    seq_r    seq_c    seq_u    seq_s    par_r    par_c    par_u    par_s    1     1.46     1.46     1.46     0.00     1.46     1.46     1.46     0.00    2     1.74     1.74     1.73     0.00     1.69     1.68     1.68     0.00    3     1.47     1.47     1.47     0.00     1.58     1.58     1.54     0.04    4     1.74     1.74     1.74     0.00     2.02     2.02     1.87     0.15    5     1.46     1.46     1.46     0.00     1.91     1.90     1.75     0.15

so question is: why running python code on multiple cores 1.5x-2x slower wall clock , 4x-5x slower cpu clock running on single core?

asking around , googling produced 2 hypotheses:

when running on multiple cores, thread can re-scheduled run on different core means local cache gets invalidated, hence slowdown.
the overhead of suspending thread on 1 core , activating on core larger suspending , activating thread on same core.

are there other reasons? understand what's going on , able understanding numbers (meaning if slowdown due cache misses, want see , compare numbers both cases).

it's due gil thrashing when multiple native threads competing gil. david beazley's materials on subject tell want know.

see info here nice graphical representation of happening.

python3.2 introduced changes gil solve problem should see improved performance 3.2 , later.

it should noted gil implementation detail of cpython reference implementation of language. other implementations jython not have gil , not suffer particular problem.

the rest of d. beazley's info on gil helpful you.

to answer question why performance worse when multiple cores involved, see slide 29-41 of inside gil presentation. goes detailed discussion on multicore gil contention opposed multiple threads on single core. slide 32 shows number of system calls due thread signaling overhead goes through roof add cores. because threads running simulatneously on different cores , allows them engage in true gil battle. opposed multiple threads sharing single cpu. summary bullet above presentation is:

with multiple cores, cpu-bound threads scheduled simultaneously (on different cores) , have gil battle.

WIKI

Search This Blog

multithreading - Why does this Python script run 4x slower on multiple cores than on a single core -

Comments

Post a Comment