- Before we start I would like you to go through the below article which will help you get started with Pandas.
Groupby
Groupby allows you to group together rows based on a column and perform collective functions on them, which means that a function that takes in many values and spits out a single value.
- For understanding this we will create a data frame first.
import numpy as np
import pandas as pd
from numpy.random import randn
dic = {'Company':['Myntra','Myntra','Flipkart','Flipkart','Amazon','Amazon'],'Person':['Aks','Sam','Charlie','Vein','Bruce','Kent'],'Sales':[123,234,454,654,905,643]}
dataframe = pd.DataFrame(dic)
print(dataframe)
'''
OUTPUT->
Company Person Sales
0 Myntra Aks 123
1 Myntra Sam 234
2 Flipkart Charlie 454
3 Flipkart Vein 654
4 Amazon Bruce 905
5 Amazon Kent 643
'''
- for instance, let's Groupby the company names and find the sales mean.
by_comp = dataframe.groupby('Company')
print(by_comp.mean())
'''
OUTPUT->
Sales
Company
Amazon 774.0
Flipkart 554.0
Myntra 178.5
'''
- Since it returns a data frame, we can do all sort of functions that we already saw in the previous articles.
- Like if you want a bunch of information, we can use the describe function.
print(by_comp.describe())
'''
OUTPUT->
count mean std min 25% 50% 75% max
Company
Amazon 2.0 774.0 185.261977 643.0 708.50 774.0 839.50 905.0
Flipkart 2.0 554.0 141.421356 454.0 504.00 554.0 604.00 654.0
Myntra 2.0 178.5 78.488853 123.0 150.75 178.5 206.25 234.0
'''
Merging, Joining and Concatenation
- First, we need to create a few data frame to perform the tasks.
import numpy as np
import pandas as pd
from numpy.random import randn
dataframe1 = pd.DataFrame({'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3'],'C':['C0','C1','C2','C3'],'D':['D0','D1','D2','D3']},index=[0,1,2,3])
dataframe2 = pd.DataFrame({'A':['A4','A5','A6','A7'],'B':['B4','B5','B6','B7'],'C':['C4','C5','C6','C7'],'D':['D4','D5','D6','D7']},index=[4,5,6,7])
dataframe3 = pd.DataFrame({'A':['A8','A9','A10','A11'],'B':['B8','B9','B10','B11'],'C':['C8','C9','C10','C11'],'D':['D9','D10','D11','D12']},index=[8,9,10,11])
print(dataframe1)
print(dataframe2)
print(dataframe3)
'''
OUTPUT->
dataframe1
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
dataframe2
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
dataframe3
A B C D
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
'''
Concatenation
- Concatenation -> basically glues together DataFrames. Keep in mind that the dimensions should match along with the axis you are glueing to. Here we make use of the command pd.concat
print(pd.concat([dataframe1,dataframe2,dataframe3]))
'''
OUTPUT->
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11
'''
Merging
- Merging - This function allows us to merge the DataFrames together using similar logic as merging in SQL. For example.
import numpy as np
import pandas as pd
from numpy.random import randn
dataframe1 = pd.DataFrame({'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3'],'C':['C0','C1','C2','C3'],'Key':['Key1','Key2','Key3','Key4']},index=[0,1,2,3])
dataframe2 = pd.DataFrame({'A':['A4','A5','A6','A7'],'B':['B4','B5','B6','B7'],'C':['C4','C5','C6','C7'],'Key':['Key1','Key2','Key3','Key4']},index=[4,5,6,7])
dataframe3 = pd.DataFrame({'A':['A8','A9','A10','A11'],'B':['B8','B9','B10','B11'],'C':['C8','C9','C10','C11'],'Key':['Key1','Key2','Key3','Key4']},index=[8,9,10,11])
print(dataframe1)
print(dataframe2)
print(dataframe3)
'''
A B C Key
0 A0 B0 C0 Key1
1 A1 B1 C1 Key2
2 A2 B2 C2 Key3
3 A3 B3 C3 Key4
A B C Key
4 A4 B4 C4 Key1
5 A5 B5 C5 Key2
6 A6 B6 C6 Key3
7 A7 B7 C7 Key4
A B C Key
8 A8 B8 C8 Key1
9 A9 B9 C9 Key2
10 A10 B10 C10 Key3
11 A11 B11 C11 Key4
'''
- Now let's merge using the pd.merge().
print(pd.merge(dataframe1,dataframe2,how='inner',on='Key'))
'''
OUTPUT->
A_x B_x C_x Key A_y B_y C_y
0 A0 B0 C0 Key1 A4 B4 C4
1 A1 B1 C1 Key2 A5 B5 C5
2 A2 B2 C2 Key3 A6 B6 C6
3 A3 B3 C3 Key4 A7 B7 C7
'''
Joining
- Joining -> Joining is a convenient method for combining the column of two potential different-indexed DataFrames into a single result DataFrame. This you can think of as the same as merging instead the index serves as the key.
import numpy as np
import pandas as pd
from numpy.random import randn
dataframe1 = pd.DataFrame({'A':['A0','A1','A2','A3'],'B':['B0','B1','B2','B3']},index=['Key1','Key2','Key3','Key4'])
dataframe2 = pd.DataFrame({'C':['C4','C5','C6','C7'],'D':['D4','D5','D6','D7']},index=['Key1','Key2','Key3','Key4'])
#dataframe3 = pd.DataFrame({'A':['A8','A9','A10','A11'],'B':['B8','B9','B10','B11'],'C':['C8','C9','C10','C11']},index=['Key1','Key2','Key3','Key4'])
print(dataframe1)
print(dataframe2)
print(dataframe1.join(dataframe2))
'''
A B
Key1 A0 B0
Key2 A1 B1
Key3 A2 B2
Key4 A3 B3
C D
Key1 C4 D4
Key2 C5 D5
Key3 C6 D6
Key4 C7 D7
OUTPUT->
A B C D
Key1 A0 B0 C4 D4
Key2 A1 B1 C5 D5
Key3 A2 B2 C6 D6
Key4 A3 B3 C7 D7
'''
Thank-you!
I am glad you made it to the end of this article. I hope you got to learn something, if so please leave a Like which will encourage me for my upcoming write-ups.
- My GitHub Repos
- Connect with me on Linkedin
- Start your own blogs