pandas - How to use the merge function to merge the common values in two DataFrames? -
i have 2 dataframes, want merge on column "id"
df1 :
id reputation 1 10 3 5 4 40
df2 :
id reputation 1 10 2 5 3 5 6 55
i want output be:
dfoutput :
id reputation 1 10 2 5 3 5 4 40 6 55
i wish keep values both df s merge duplicate values one. know have use merge() function don't know arguments pass.
you concatenate dataframes, groupby id
, , aggregate taking first item in each group.
in [62]: pd.concat([df1,df2]).groupby('id').first() out[62]: reputation id 1 10 2 5 3 5 4 40 6 55 [5 rows x 1 columns]
or, preserve id
column rather index, use as_index=false
:
in [68]: pd.concat([df1,df2]).groupby('id', as_index=false).first() out[68]: id reputation 0 1 10 1 2 5 2 3 5 3 4 40 4 6 55 [5 rows x 2 columns]
karld. suggests excellent idea; use combine_first:
in [99]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index() out[99]: id reputation 0 1 10 1 2 5 2 3 5 3 4 40 4 6 55 [5 rows x 2 columns]
this solution appears faster large dataframes:
import pandas pd import numpy np n = 10**6 df1 = pd.dataframe({'id':np.arange(n), 'reputation': np.random.randint(5, size=n)}) df2 = pd.dataframe({'id':np.arange(10, 10+n), 'reputation':np.random.randint(5, size=n)})
in [95]: %timeit df1.set_index('id').combine_first(df2.set_index('id')).reset_index() 10 loops, best of 3: 174 ms per loop in [96]: %timeit pd.concat([df1,df2]).groupby('id', as_index=false).first() 1 loops, best of 3: 221 ms per loop
Comments
Post a Comment