python - Efficiently creating lots of Histograms from grouped data held in pandas dataframe -


i want create bunch of histograms grouped data in pandas dataframe. here's a link similar question. generate toy data similar working can use following code:

    pandas import dataframe     import numpy np     x = ['a']*300 + ['b']*400 + ['c']*300     y = np.random.randn(1000)     df = dataframe({'letter':x, 'n':y}) 

i want put histograms (read binned data) in new dataframe , save later processing. here's real kicker, file 6 gb, 400k+ groups, 2 columns.

i've thought using simple loop work:

    data=[]     group in df['letter'].unique():         data.append(np.histogram(df[df['letter']==group]['n'],range=(-2000,2000),bins=50,density=true)[0])     df2=dataframe(data) 

note bins, range, , density keywords necessary purposes histograms consistent , normalized across rows in new dataframe df2 (parameter values real dataset overkill on toy dataset). , loop works great, on toy dataset generates pandas dataframe of 3 rows , 50 columns expected. on real dataset i've estimated time completion of code around 9 days. there better/faster way i'm looking for?

p.s. i've thought multiprocessing, think overhead of creating processes , slicing data slower running serially (i may wrong , wouldn't mind corrected on one).

for type of problem describe here, following, delegate whole thing multithreaded cython/c++. it's bit of work, not impossible, , i'm not sure there's viable alternative @ moment.

here building blocks:

  • first, df.x.values, df.y.values numpy arrays. this link shows how c-pointers such arrays.

  • now have pointers, can write true multithreaded program using cython's prange , foregoing python point (you're in c++ territory). have k threads scanning 6gb arrays, , thread i handles groups keys have hash i modulo k.

  • for c program (which code now) gnu scientific library has nice histogram module.

  • when prange done, need convert c++ structures numpy arrays, , there dataframe. wrap whole thing in cython, , use normal python function.


Comments