Groupby

Groupby allows you to group together rows based on a column and perform collective functions on them, which means that a function that takes in many values and spits out a single value.

For understanding this we will create a data frame first.


import numpy as np 
import pandas as pd 
from numpy.random import randn
dic = {'Company':['Myntra','Myntra','Flipkart','Flipkart','Amazon','Amazon'],'Person':['Aks','Sam','Charlie','Vein','Bruce','Kent'],'Sales':[123,234,454,654,905,643]}
dataframe = pd.DataFrame(dic)
print(dataframe)
'''
OUTPUT->
    Company   Person  Sales
0    Myntra      Aks    123
1    Myntra      Sam    234
2  Flipkart  Charlie    454
3  Flipkart     Vein    654
4    Amazon    Bruce    905
5    Amazon     Kent    643
'''

for instance, let's Groupby the company names and find the sales mean.

by_comp = dataframe.groupby('Company')
print(by_comp.mean())
'''
OUTPUT->
          Sales
Company
Amazon    774.0
Flipkart  554.0
Myntra    178.5
'''

Since it returns a data frame, we can do all sort of functions that we already saw in the previous articles.
Like if you want a bunch of information, we can use the describe function.

print(by_comp.describe())
'''
OUTPUT->
         count   mean         std    min     25%    50%     75%    max
Company
Amazon     2.0  774.0  185.261977  643.0  708.50  774.0  839.50  905.0
Flipkart   2.0  554.0  141.421356  454.0  504.00  554.0  604.00  654.0
Myntra     2.0  178.5   78.488853  123.0  150.75  178.5  206.25  234.0
'''

Merging, Joining and Concatenation

First, we need to create a few data frame to perform the tasks.

import numpy as np 
import pandas as pd 
from numpy.random import randn
dataframe1 = pd.DataFrame({'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']},index=[0,1,2,3])
dataframe2 = pd.DataFrame({'A':['A4','A5','A6','A7'],'B':['B4','B5','B6','B7'],'C':['C4','C5','C6','C7'],'D':['D4','D5','D6','D7']},index=[4,5,6,7])
dataframe3 = pd.DataFrame({'A':['A8','A9','A10','A11'],'B':['B8','B9','B10','B11'],'C':['C8','C9','C10','C11'],'D':['D9','D10','D11','D12']},index=[8,9,10,11])
print(dataframe1)
print(dataframe2)
print(dataframe3)

'''
OUTPUT->

dataframe1
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2
3  A3  B3  C3  D3

dataframe2
    A   B   C   D
4  A4  B4  C4  D4
5  A5  B5  C5  D5
6  A6  B6  C6  D6
7  A7  B7  C7  D7

dataframe3
      A    B    C    D
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11
'''

Concatenation

Concatenation -> basically glues together DataFrames. Keep in mind that the dimensions should match along with the axis you are glueing to. Here we make use of the command pd.concat

print(pd.concat([dataframe1,dataframe2,dataframe3]))

'''
OUTPUT->
      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11
'''

Merging

Merging - This function allows us to merge the DataFrames together using similar logic as merging in SQL. For example.

import numpy as np 
import pandas as pd 
from numpy.random import randn
dataframe1 = pd.DataFrame({'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3'],'C':['C0','C1','C2','C3'],'Key':['Key1','Key2','Key3','Key4']},index=[0,1,2,3])
dataframe2 = pd.DataFrame({'A':['A4','A5','A6','A7'],'B':['B4','B5','B6','B7'],'C':['C4','C5','C6','C7'],'Key':['Key1','Key2','Key3','Key4']},index=[4,5,6,7])
dataframe3 = pd.DataFrame({'A':['A8','A9','A10','A11'],'B':['B8','B9','B10','B11'],'C':['C8','C9','C10','C11'],'Key':['Key1','Key2','Key3','Key4']},index=[8,9,10,11])
print(dataframe1)
print(dataframe2)
print(dataframe3)


'''
    A   B   C   Key
0  A0  B0  C0  Key1
1  A1  B1  C1  Key2
2  A2  B2  C2  Key3
3  A3  B3  C3  Key4

    A   B   C   Key
4  A4  B4  C4  Key1
5  A5  B5  C5  Key2
6  A6  B6  C6  Key3
7  A7  B7  C7  Key4

      A    B    C   Key
8    A8   B8   C8  Key1
9    A9   B9   C9  Key2
10  A10  B10  C10  Key3
11  A11  B11  C11  Key4
'''

Now let's merge using the pd.merge().

print(pd.merge(dataframe1,dataframe2,how='inner',on='Key'))
'''
OUTPUT->
  A_x B_x C_x   Key A_y B_y C_y
0  A0  B0  C0  Key1  A4  B4  C4
1  A1  B1  C1  Key2  A5  B5  C5
2  A2  B2  C2  Key3  A6  B6  C6
3  A3  B3  C3  Key4  A7  B7  C7
'''

Joining

Joining -> Joining is a convenient method for combining the column of two potential different-indexed DataFrames into a single result DataFrame. This you can think of as the same as merging instead the index serves as the key.

import numpy as np 
import pandas as pd 
from numpy.random import randn
dataframe1 = pd.DataFrame({'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']},index=['Key1','Key2','Key3','Key4'])
dataframe2 = pd.DataFrame({'C':['C4','C5','C6','C7'],'D':['D4','D5','D6','D7']},index=['Key1','Key2','Key3','Key4'])
#dataframe3 = pd.DataFrame({'A':['A8','A9','A10','A11'],'B':['B8','B9','B10','B11'],'C':['C8','C9','C10','C11']},index=['Key1','Key2','Key3','Key4'])
print(dataframe1)
print(dataframe2)
print(dataframe1.join(dataframe2))


'''
       A   B
Key1  A0  B0
Key2  A1  B1
Key3  A2  B2
Key4  A3  B3

       C   D
Key1  C4  D4
Key2  C5  D5
Key3  C6  D6
Key4  C7  D7

OUTPUT-> 
       A   B   C   D
Key1  A0  B0  C4  D4
Key2  A1  B1  C5  D5
Key3  A2  B2  C6  D6
Key4  A3  B3  C7  D7
'''

Thank-you!

I am glad you made it to the end of this article. I hope you got to learn something, if so please leave a Like which will encourage me for my upcoming write-ups.

My GitHub Repos

Connect with me on Linkedin

Start your own blogs

#3 -> Merging and Joining in Pandas