Question

Groupby multiple columns and extract top rows based on non-grouped column value

I am trying to solve a problem some what very similar to: https://platform.stratascratch.com/coding/10362-top-monthly-sellers?code_type=2

here is my data frame:

         product  seller  market   total_sales

         books      s1      de     10     
         books      s2      jp     20
         books      s3      in     30
         books      s1      de     25     
         books      s5      in     15
         books      s1      us     12
         books      s2      uk     10
      clothing      s1      de     11
      clothing      s2      de     18
      clothing      s1      uk     55
      clothing      s3      in     31
      clothing      s1      de     14
      clothing      s2      de     10
      clothing      s1      de     35
  electronics      s1       us     18
  electronics      s1       de     12
  electronics      s2       in     16
  electronics      s3       uk     24
  electronics      s1       us     37      
  electronics      s4       jp     27
  electronics      s3       uk     26
  electronics      s1       us     15      

Expected output is:

product      seller   market   total_sales

books        s1         de        35
books        s3         in        30
books        s2         jp        20
Clothing     s1         de        60
Clothing     s1         uk        55
Clothing     s3         in        31
electronics  s1         us        70
electronics  s3         uk        50
electronics  s4         jp        27

I want to get top three product category sales in the world.

I was able to aggregate total sales basing on product,seller & group them. After grouping I am able to sort it by total_sales per group. But, I was not able to get only the top-3 total_sale rows per ['product'] group. Cany anyone help on this.

import pandas as pd

df = df.groupby(['product', 'seller','market'], as_index = False).agg({'total_sales':sum})
df = df.sort_values(['product','total_sales'],ascending=False).groupby(['product', 'seller','market',],as_index = False).aggregate(lambda x: ','.join(map(str, x)))
print(df)
 3  43  3
1 Jan 1970

Solution

 4

You can do it like this:

(dfs:= df.groupby(['product', 'seller', 'market'], as_index=False)['total_sales'].sum())\
    .loc[dfs.groupby('product')['total_sales'].rank(ascending=False)<4]\
    .sort_values(['product','total_sales'], ascending=[True, False])

Output:

        product seller market  total_sales
0         books     s1     de           35
4         books     s3     in           30
2         books     s2     jp           20
6      clothing     s1     de           60
7      clothing     s1     uk           55
9      clothing     s3     in           31
11  electronics     s1     us           70
13  electronics     s3     uk           50
14  electronics     s4     jp           27

Details:

  • Use walrus operator to assign groupby product, seller and market dataframe to a variable dfs.

  • Filter that dataframe by dfs groupby product and rank, then boolean index dfs for only to top three rank

  • Lastly, sort the filtered dataframe on product and total_sales descending.

2024-07-10
Scott Boston