How to do outer merge of two dataframes for which column values are within a certain range?

Refresh

6 days ago

Views

42 time

4

This is followup question of this

I have two dataframes:

print df_1

  timestamp      A          B
0 2016-05-15     0.020228   0.026572
1 2016-05-15     0.057780   0.175499
2 2016-05-15     0.098808   0.620986
3 2016-05-17     0.158789   1.014819
4 2016-05-17     0.038129   2.384590
5 2018-05-17     0.011111   9.999999


print df_2

  start                end  event    
0 2016-05-14   2016-05-16   E1
1 2016-05-14   2016-05-16   E2
2 2016-05-17   2016-05-18   E3

I would like to merge df_1 and df_2 and get the event column in df_1 if the timestamp falls in between of start and end.

The problems, and the differences with this question, are

1) that events E1 and E2 have the same start and end.

2) Also in df_1 the 6th row does not fall inside any of the intervals.

In the end I would like to have both of the events and for the row that does not have any event have NA.

So I would like my resulting dataframe to be like this

  timestamp      A          B         event
0 2016-05-15     0.020228   0.026572  E1
1 2016-05-15     0.057780   0.175499  E1
2 2016-05-15     0.098808   0.620986  E1
3 2016-05-15     0.020228   0.026572  E2 
4 2016-05-15     0.057780   0.175499  E2
5 2016-05-15     0.098808   0.620986  E2
6 2016-05-17     0.158789   1.014819  E3
7 2016-05-17     0.038129   2.384590  E3
8 2018-05-17     0.011111   9.999999  NA

1 answers

0
import pandas as pd

df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                     'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                     'end':['2016-05-16','2016-05-16','2016-05-18'],
                     'event':['E1','E2','E3']})
df_1.timestamp = pd.to_datetime(df_1.timestamp, format='%Y-%m-%d')
df_2.start = pd.to_datetime(df_2.start, format='%Y-%m-%d')
df_2.end = pd.to_datetime(df_2.end, format='%Y-%m-%d')

# convert game_ref_dt to long format with all the dates in between, and do a left merge on date
df_2_2 = pd.melt(df_2, id_vars='event', value_name='timestamp')
df_2_2.timestamp = pd.to_datetime(df_2_2.timestamp)
df_2_2.set_index('timestamp', inplace=True)
df_2_2.drop('variable', axis=1, inplace=True)

df_2_3 = df_2_2.groupby('event').resample('D').ffill().reset_index(level=0, drop=True).reset_index()

df_2 = pd.merge(df_2, df_2_3)
df_2 = df_2.drop(columns=['start', 'end'])

df_1 = df_1.merge(df_2,on='timestamp',  how='left')

print(df_1)
   timestamp  A event
0 2016-05-15  1    E1
1 2016-05-15  1    E2
2 2016-05-15  1    E1
3 2016-05-15  1    E2
4 2016-05-15  1    E1
5 2016-05-15  1    E2
6 2016-05-17  1    E3
7 2016-05-17  1    E3
8 2018-05-17  1   NaN

Credits to this

Also this solution, but does not give the NA at the last row

  import pandas as pd

df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
                     'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
                     'end':['2016-05-16','2016-05-16','2016-05-18'],
                     'event':['E1','E2','E3']})   

df_try2 = pd.merge(df_1.assign(key=1), df_2.assign(key=1), on='key').query('timestamp >= start and timestamp <= end')    

print(df_try2)

   timestamp  A  key      start        end event
0  2016-05-15  1    1 2016-05-14 2016-05-16    E1
1  2016-05-15  1    1 2016-05-14 2016-05-16    E2
3  2016-05-15  1    1 2016-05-14 2016-05-16    E1
4  2016-05-15  1    1 2016-05-14 2016-05-16    E2
6  2016-05-15  1    1 2016-05-14 2016-05-16    E1
7  2016-05-15  1    1 2016-05-14 2016-05-16    E2
11 2016-05-17  1    1 2016-05-17 2016-05-18    E3
14 2016-05-17  1    1 2016-05-17 2016-05-18    E3