multithreading - Why does this Python script run 4x slower on multiple cores than on a single core -
i'm trying understand how cpython's gil works , differences between gil in cpython 2.7.x , cpython 3.4.x. i'm using code benchmarking:
from __future__ import print_function import argparse import resource import sys import threading import time def countdown(n): while n > 0: n -= 1 def get_time(): stats = resource.getrusage(resource.rusage_self) total_cpu_time = stats.ru_utime + stats.ru_stime return time.time(), total_cpu_time, stats.ru_utime, stats.ru_stime def get_time_diff(start_time, end_time): return tuple((end-start) start, end in zip(start_time, end_time)) def main(total_cycles, max_threads, no_headers=false): header = ("%4s %8s %8s %8s %8s %8s %8s %8s %8s" % ("#t", "seq_r", "seq_c", "seq_u", "seq_s", "par_r", "par_c", "par_u", "par_s")) row_format = ("%(threads)4d " "%(seq_r)8.2f %(seq_c)8.2f %(seq_u)8.2f %(seq_s)8.2f " "%(par_r)8.2f %(par_c)8.2f %(par_u)8.2f %(par_s)8.2f") if not no_headers: print(header) thread_count in range(1, max_threads+1): # don't care few lost cycles cycles = total_cycles // thread_count threads = [threading.thread(target=countdown, args=(cycles,)) in range(thread_count)] start_time = get_time() thread in threads: thread.start() thread.join() end_time = get_time() sequential = get_time_diff(start_time, end_time) threads = [threading.thread(target=countdown, args=(cycles,)) in range(thread_count)] start_time = get_time() thread in threads: thread.start() thread in threads: thread.join() end_time = get_time() parallel = get_time_diff(start_time, end_time) print(row_format % {"threads": thread_count, "seq_r": sequential[0], "seq_c": sequential[1], "seq_u": sequential[2], "seq_s": sequential[3], "par_r": parallel[0], "par_c": parallel[1], "par_u": parallel[2], "par_s": parallel[3]}) if __name__ == "__main__": arg_parser = argparse.argumentparser() arg_parser.add_argument("max_threads", nargs="?", type=int, default=5) arg_parser.add_argument("total_cycles", nargs="?", type=int, default=50000000) arg_parser.add_argument("--no-headers", action="store_true") args = arg_parser.parse_args() sys.exit(main(args.total_cycles, args.max_threads, args.no_headers)) when running script on quad-core i5-2500 machine under ubuntu 14.04 python 2.7.6, following results (_r stands real time, _c cpu time, _u user mode, _s kernel mode):
#t seq_r seq_c seq_u seq_s par_r par_c par_u par_s 1 1.47 1.47 1.47 0.00 1.46 1.46 1.46 0.00 2 1.74 1.74 1.74 0.00 3.33 5.45 3.52 1.93 3 1.87 1.90 1.90 0.00 3.08 6.42 3.77 2.65 4 1.78 1.83 1.83 0.00 3.73 6.18 3.88 2.30 5 1.73 1.79 1.79 0.00 3.74 6.26 3.87 2.39 now if bind threads 1 core, results different:
taskset -c 0 python countdown.py #t seq_r seq_c seq_u seq_s par_r par_c par_u par_s 1 1.46 1.46 1.46 0.00 1.46 1.46 1.46 0.00 2 1.74 1.74 1.73 0.00 1.69 1.68 1.68 0.00 3 1.47 1.47 1.47 0.00 1.58 1.58 1.54 0.04 4 1.74 1.74 1.74 0.00 2.02 2.02 1.87 0.15 5 1.46 1.46 1.46 0.00 1.91 1.90 1.75 0.15 so question is: why running python code on multiple cores 1.5x-2x slower wall clock , 4x-5x slower cpu clock running on single core?
asking around , googling produced 2 hypotheses:
- when running on multiple cores, thread can re-scheduled run on different core means local cache gets invalidated, hence slowdown.
- the overhead of suspending thread on 1 core , activating on core larger suspending , activating thread on same core.
are there other reasons? understand what's going on , able understanding numbers (meaning if slowdown due cache misses, want see , compare numbers both cases).
it's due gil thrashing when multiple native threads competing gil. david beazley's materials on subject tell want know.
see info here nice graphical representation of happening.
python3.2 introduced changes gil solve problem should see improved performance 3.2 , later.
it should noted gil implementation detail of cpython reference implementation of language. other implementations jython not have gil , not suffer particular problem.
the rest of d. beazley's info on gil helpful you.
to answer question why performance worse when multiple cores involved, see slide 29-41 of inside gil presentation. goes detailed discussion on multicore gil contention opposed multiple threads on single core. slide 32 shows number of system calls due thread signaling overhead goes through roof add cores. because threads running simulatneously on different cores , allows them engage in true gil battle. opposed multiple threads sharing single cpu. summary bullet above presentation is:
with multiple cores, cpu-bound threads scheduled simultaneously (on different cores) , have gil battle.
Comments
Post a Comment