python – groupby rolling date window sum with duplicate dates

python – groupby rolling date window sum with duplicate dates

Based on comments to the question, it appears that OP already found a solution. However, this is an attempt at providing another way to resolve this, which is to solve the root cause of the error – duplicate date values.

To resolve it, we can add aggregation by date within the apply. In the snippet below, the Amount values are aggregated using sum, but it is possible that in some contexts another aggregation should be used, e.g. min or max. This is the relevant part:

    .apply(
        lambda g: (
            g
            .groupby(Date, as_index=False)
            .agg({Amount: sum})
            .rolling(28d, on=Date)
            .sum()
        )
    )

And the full snippet below:

import pandas as pd
import datetime

df = pd.DataFrame(
    {
        ID: {0: 10001, 1: 10001, 2: 10001, 3: 10001, 4: 10002, 5: 10002, 6: 10002},
        Date: {
            0: datetime.datetime(2019, 7, 1),
            1: datetime.datetime(2019, 5, 1),
            2: datetime.datetime(2019, 6, 25),
            3: datetime.datetime(2019, 5, 27),
            4: datetime.datetime(2019, 6, 29),
            5: datetime.datetime(2019, 7, 18),
            6: datetime.datetime(2019, 7, 18)
        },
        Amount: {0: 50, 1: 15, 2: 10, 3: 20, 4: 25, 5: 35, 6: 40},
    }
)

amounts = (
    df
    .groupby([ID])
    .apply(
        lambda g: (
            g
            .groupby(Date, as_index=False)
            .agg({Amount: sum})
            .rolling(28d, on=Date)
            .sum()
        )
    )
)

df[amount_4wk_rolling] = df[Date].map(amounts.set_index(Date)[Amount])

# this yields
#       ID       Date  Amount  amount_4wk_rolling
# 0  10001 2019-07-01      50                60.0
# 1  10001 2019-05-01      15                15.0
# 2  10001 2019-06-25      10                10.0
# 3  10001 2019-05-27      20                35.0
# 4  10002 2019-06-29      25                25.0
# 5  10002 2019-07-18      35               100.0
# 6  10002 2019-07-18      40               100.0

The problem is the first level index of amounts:

>>> df
      ID       Date  Amount
0  10001 2019-07-01      50
1  10001 2019-05-01      15
2  10001 2019-06-25      10
3  10001 2019-05-27      20
4  10002 2019-06-29      25
5  10002 2019-07-18      35  # <- dup date
6  10002 2019-07-18      40  # <- dup date

>>> amounts
         Amount       Date       ID
ID                                 
10001 1    15.0 2019-05-01  10001.0
      3    35.0 2019-05-27  20002.0
      2    10.0 2019-06-25  10001.0
      0    60.0 2019-07-01  20002.0
10002 4    25.0 2019-06-29  10002.0
      5    60.0 2019-07-18  20004.0
      6   100.0 2019-07-18  30006.0

If you map amounts on Date columns to merge your data on df, you got your error because Pandas doesnt know which values it should use for 2019-07-18. If you look carefully the second level of index of amounts is the index of your original dataframe.

So if you drop the first level index set by groupby, you can use direct assignment:

df[amount_4wk_rolling] = amounts.droplevel(0)[Amount]
print(df)

# Output:
      ID       Date  Amount  amount_4wk_rolling
0  10001 2019-07-01      50                60.0
1  10001 2019-05-01      15                15.0
2  10001 2019-06-25      10                10.0
3  10001 2019-05-27      20                35.0
4  10002 2019-06-29      25                25.0
5  10002 2019-07-18      35                60.0
6  10002 2019-07-18      40               100.0

python – groupby rolling date window sum with duplicate dates

Leave a Reply

Your email address will not be published. Required fields are marked *