i've got 2 input dataframes
df1 (note, df have more columns of data)
sample animal time sex 0 1 1 male 1 2 2 male 2 3 b 1 female 3 4 c 1 male 4 5 d 1 female and df2
b c sample 1 0.2 0.4 0.3 2 0.5 0.7 0.2 3 0.4 0.1 0.9 4 0.4 0.2 0.3 5 0.6 0.2 0.4 and i'd combine them following:
one_a one_b one_c two_a two_b two_c sex animal 0.2 0.4 0.3 0.5 0.7 0.2 male b 0.4 0.1 0.9 nan nan nan female c 0.4 0.2 0.3 nan nan nan male d 0.6 0.2 0.4 nan nan nan female this how i'm doing things:
df2.reset_index(inplace = true) df3 = pd.melt(df2, id_vars=['sample'], value_vars=list(cols)) df4 = pd.merge(df3, df1, on='sample') df4['moo'] = df4['group'] + '_' + df4['variable'] df5 = pd.pivot_table(df4, values='value', index='animal', columns='moo') df6 = df1.groupby('animal').agg('first') pd.concat([df5, df6], axis=1).drop('sample',1).drop('group',1) this works fine, potentially slow large datasets. i'm wondering if panda-pros see better (read faster, more efficient)? i'm new pandas , can imagine there shortcuts here don't know about.
a few steps here. key in order create columns one_a one_b .... two_c, need add time column sample index build multi-level index , unstack required form. then, groupby on animal index required aggregate , reduce number of nans. rest manipulations on format.
import pandas pd # data # ============================== # set index df1 = df1.set_index('sample') print(df1) animal time sex sample 1 1 male 2 2 male 3 b 1 female 4 c 1 male 5 d 1 female print(df2) b c sample 1 0.2 0.4 0.3 2 0.5 0.7 0.2 3 0.4 0.1 0.9 4 0.4 0.2 0.3 5 0.6 0.2 0.4 # processing # ============================= df = df1.join(df2) df_temp = df.set_index(['animal', 'sex','time'], append=true).unstack() print(df_temp) b c time 1 2 1 2 1 2 sample animal sex 1 male 0.2 nan 0.4 nan 0.3 nan 2 male nan 0.5 nan 0.7 nan 0.2 3 b female 0.4 nan 0.1 nan 0.9 nan 4 c male 0.4 nan 0.2 nan 0.3 nan 5 d female 0.6 nan 0.2 nan 0.4 nan # rename columns if wish df_temp.columns = ['{}_{}'.format(x, y) x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))] print(df_temp) one_a two_a one_b two_b one_c two_c sample animal sex 1 male 0.2 nan 0.4 nan 0.3 nan 2 male nan 0.5 nan 0.7 nan 0.2 3 b female 0.4 nan 0.1 nan 0.9 nan 4 c male 0.4 nan 0.2 nan 0.3 nan 5 d female 0.6 nan 0.2 nan 0.4 nan result = df_temp.reset_index('sex').groupby(level='animal').agg(max).sort_index(axis=1) print(result) sex one_a one_b one_c two_a two_b two_c animal male 0.2 0.4 0.3 0.5 0.7 0.2 b female 0.4 0.1 0.9 nan nan nan c male 0.4 0.2 0.3 nan nan nan d female 0.6 0.2 0.4 nan nan nan
Comments
Post a Comment