python – groupby rolling date window sum with duplicate dates
python – groupby rolling date window sum with duplicate dates
Based on comments to the question, it appears that OP already found a solution. However, this is an attempt at providing another way to resolve this, which is to solve the root cause of the error – duplicate date values.
To resolve it, we can add aggregation by date within the apply. In the snippet below, the Amount
values are aggregated using sum
, but it is possible that in some contexts another aggregation should be used, e.g. min
or max
. This is the relevant part:
.apply(
lambda g: (
g
.groupby(Date, as_index=False)
.agg({Amount: sum})
.rolling(28d, on=Date)
.sum()
)
)
And the full snippet below:
import pandas as pd
import datetime
df = pd.DataFrame(
{
ID: {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
Date: {
0: datetime.datetime(2019, 7, 1),
1: datetime.datetime(2019, 5, 1),
2: datetime.datetime(2019, 6, 25),
3: datetime.datetime(2019, 5, 27),
4: datetime.datetime(2019, 6, 29),
5: datetime.datetime(2019, 7, 18),
6: datetime.datetime(2019, 7, 18)
},
Amount: {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
}
)
amounts = (
df
.groupby([ID])
.apply(
lambda g: (
g
.groupby(Date, as_index=False)
.agg({Amount: sum})
.rolling(28d, on=Date)
.sum()
)
)
)
df[amount_4wk_rolling] = df[Date].map(amounts.set_index(Date)[Amount])
# this yields
# ID Date Amount amount_4wk_rolling
# 0 10001 2019-07-01 50 60.0
# 1 10001 2019-05-01 15 15.0
# 2 10001 2019-06-25 10 10.0
# 3 10001 2019-05-27 20 35.0
# 4 10002 2019-06-29 25 25.0
# 5 10002 2019-07-18 35 100.0
# 6 10002 2019-07-18 40 100.0
The problem is the first level index of amounts
:
>>> df
ID Date Amount
0 10001 2019-07-01 50
1 10001 2019-05-01 15
2 10001 2019-06-25 10
3 10001 2019-05-27 20
4 10002 2019-06-29 25
5 10002 2019-07-18 35 # <- dup date
6 10002 2019-07-18 40 # <- dup date
>>> amounts
Amount Date ID
ID
10001 1 15.0 2019-05-01 10001.0
3 35.0 2019-05-27 20002.0
2 10.0 2019-06-25 10001.0
0 60.0 2019-07-01 20002.0
10002 4 25.0 2019-06-29 10002.0
5 60.0 2019-07-18 20004.0
6 100.0 2019-07-18 30006.0
If you map amounts
on Date
columns to merge your data on df
, you got your error because Pandas doesnt know which values it should use for 2019-07-18. If you look carefully the second level of index of amounts
is the index of your original dataframe.
So if you drop the first level index set by groupby
, you can use direct assignment:
df[amount_4wk_rolling] = amounts.droplevel(0)[Amount]
print(df)
# Output:
ID Date Amount amount_4wk_rolling
0 10001 2019-07-01 50 60.0
1 10001 2019-05-01 15 15.0
2 10001 2019-06-25 10 10.0
3 10001 2019-05-27 20 35.0
4 10002 2019-06-29 25 25.0
5 10002 2019-07-18 35 60.0
6 10002 2019-07-18 40 100.0