i want create bunch of histograms grouped data in pandas dataframe. here's a link similar question. generate toy data similar working can use following code:
pandas import dataframe import numpy np x = ['a']*300 + ['b']*400 + ['c']*300 y = np.random.randn(1000) df = dataframe({'letter':x, 'n':y}) i want put histograms (read binned data) in new dataframe , save later processing. here's real kicker, file 6 gb, 400k+ groups, 2 columns.
i've thought using simple loop work:
data=[] group in df['letter'].unique(): data.append(np.histogram(df[df['letter']==group]['n'],range=(-2000,2000),bins=50,density=true)[0]) df2=dataframe(data) note bins, range, , density keywords necessary purposes histograms consistent , normalized across rows in new dataframe df2 (parameter values real dataset overkill on toy dataset). , loop works great, on toy dataset generates pandas dataframe of 3 rows , 50 columns expected. on real dataset i've estimated time completion of code around 9 days. there better/faster way i'm looking for?
p.s. i've thought multiprocessing, think overhead of creating processes , slicing data slower running serially (i may wrong , wouldn't mind corrected on one).
for type of problem describe here, following, delegate whole thing multithreaded cython/c++. it's bit of work, not impossible, , i'm not sure there's viable alternative @ moment.
here building blocks:
first,
df.x.values,df.y.valuesnumpy arrays. this link shows how c-pointers such arrays.now have pointers, can write true multithreaded program using cython's
prange, foregoing python point (you're in c++ territory). have k threads scanning 6gb arrays, , thread i handles groups keys have hash i modulo k.for c program (which code now) gnu scientific library has nice histogram module.
when
prangedone, need convert c++ structures numpy arrays, , there dataframe. wrap whole thing in cython, , use normal python function.
Comments
Post a Comment