Showing posts with label group by. Show all posts
Showing posts with label group by. Show all posts

18 June 2022

pandas - how to group by multiple columns and get sums of multiple other columns

https://stackoverflow.com/questions/46431243/pandas-dataframe-groupby-how-to-get-sum-of-multiple-columns


Question:

I have a pandas dataframe which looks like this:

index col1   col2   col3   col4   col5
0     a      c      1      2      f 
1     a      c      1      2      f
2     a      d      1      2      f
3     b      d      1      2      g
4     b      e      1      2      g
5     b      e      1      2      g

I want to group by col1 and col2 and get the sum() of col3 and col4. col5 can be dropped since the data can not be aggregated.

Here is what the output should look like. I am interested in having both col3 and col4 in the resulting dataframe. It doesn't really matter if col1 and col2 are part of the index or not.

index col1   col2   col3   col4   
0     a      c      2      4          
1     a      d      1      2      
2     b      d      1      2      
3     b      e      2      4 


Solution:

df.groupby(['col1','col2']).agg(
     sum_col3 = ('col3','sum'),
     sum_col4 = ('col4','sum'),
     ).reset_index()



02 May 2022



Input CSV File:

country, population
Usa, 1273
Usa, 4343
Usa, 1240
Uk, 7879
Uk, 3224
Uk, 4342
Tr, 6565
Tr, 7889
Tr, 1980

========================
import csv

index = {}
with open('/tmp/data.csv') as f:
    cr = csv.reader(f)
    next(cr) # skip header row
    for row in cr:
        index.setdefault(row[0], []).append(int(row[1]))

print("['country', 'avgPop']")
for c, v in index.items():
    print("['{}', '{}']".format(c, int(sum(v) / len(v))))


Result:

['country', 'avgPop']
['Usa', '2285']
['Uk', '5148']
['Tr', '5478']