python - pandas merge dataframe and pivot creating new columns -


i've got 2 input dataframes

df1 (note, df have more columns of data)

   sample animal  time     sex 0       1        1    male 1       2        2    male 2       3      b   1  female 3       4      c   1    male 4       5      d   1  female 

and df2

             b    c sample                1       0.2  0.4  0.3 2       0.5  0.7  0.2 3       0.4  0.1  0.9 4       0.4  0.2  0.3 5       0.6  0.2  0.4 

and i'd combine them following:

        one_a  one_b  one_c  two_a  two_b  two_c     sex animal                                                           0.2    0.4    0.3    0.5    0.7    0.2    male b         0.4    0.1    0.9    nan    nan    nan  female c         0.4    0.2    0.3    nan    nan    nan    male d         0.6    0.2    0.4    nan    nan    nan  female 

this how i'm doing things:

df2.reset_index(inplace = true) df3 = pd.melt(df2, id_vars=['sample'], value_vars=list(cols)) df4 = pd.merge(df3, df1, on='sample') df4['moo'] = df4['group'] + '_' + df4['variable'] df5 = pd.pivot_table(df4, values='value', index='animal', columns='moo') df6 = df1.groupby('animal').agg('first') pd.concat([df5, df6], axis=1).drop('sample',1).drop('group',1) 

this works fine, potentially slow large datasets. i'm wondering if panda-pros see better (read faster, more efficient)? i'm new pandas , can imagine there shortcuts here don't know about.

a few steps here. key in order create columns one_a one_b .... two_c, need add time column sample index build multi-level index , unstack required form. then, groupby on animal index required aggregate , reduce number of nans. rest manipulations on format.

import pandas pd  # data # ============================== # set index df1 = df1.set_index('sample')  print(df1)         animal time     sex sample                     1            1    male 2            2    male 3           b  1  female 4           c  1    male 5           d  1  female  print(df2)                b    c sample                1       0.2  0.4  0.3 2       0.5  0.7  0.2 3       0.4  0.1  0.9 4       0.4  0.2  0.3 5       0.6  0.2  0.4    # processing # ============================= df = df1.join(df2)  df_temp = df.set_index(['animal', 'sex','time'], append=true).unstack()  print(df_temp)                                   b         c      time                  1  2  1  2  1  2 sample animal sex                                  1           male    0.2  nan  0.4  nan  0.3  nan 2           male    nan  0.5  nan  0.7  nan  0.2 3      b      female  0.4  nan  0.1  nan  0.9  nan 4      c      male    0.4  nan  0.2  nan  0.3  nan 5      d      female  0.6  nan  0.2  nan  0.4  nan  # rename columns if wish df_temp.columns = ['{}_{}'.format(x, y) x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]  print(df_temp)                        one_a  two_a  one_b  two_b  one_c  two_c sample animal sex                                              1           male      0.2    nan    0.4    nan    0.3    nan 2           male      nan    0.5    nan    0.7    nan    0.2 3      b      female    0.4    nan    0.1    nan    0.9    nan 4      c      male      0.4    nan    0.2    nan    0.3    nan 5      d      female    0.6    nan    0.2    nan    0.4    nan   result = df_temp.reset_index('sex').groupby(level='animal').agg(max).sort_index(axis=1)  print(result)             sex  one_a  one_b  one_c  two_a  two_b  two_c animal                                                           male    0.2    0.4    0.3    0.5    0.7    0.2 b       female    0.4    0.1    0.9    nan    nan    nan c         male    0.4    0.2    0.3    nan    nan    nan d       female    0.6    0.2    0.4    nan    nan    nan 

Comments