#2 -> More on Data Frames in Pandas
- Before we start I would like you to go through the below article which will help you get started with Pandas.
Let's first create a data frame, we are using the same from the above article, just in case you want to recreate it, here is the code.
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
dataframe = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
Conditional Selection
- The conditional selection in Pandas is somehow similar to what we had in NumPy. So if you want to check for elements greater than 0, then it will return in Boolean. print(dataframe > 0)
W X Y Z
A True True True True
B True False False True
C False True True False
D True False False True
E True True True True
- Now if you want to print the elements corresponding to these booleans then we can simply do. print(dataframe[dataframe>0])
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN
D 0.188695 NaN NaN 0.955057
E 0.190794 1.978757 2.605967 0.683509
- Assume you don't want to get the null values then, we can do that instead of passing the whole data frame we can pass the particular column. print(dataframe[dataframe['W']>0])
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
- What if you just want the X column where the corresponding to W column which is having values greater than 0. print(dataframe[dataframe['W']>0]['X'])
A 0.628133
B -0.319318
D -0.758872
E 1.978757
Multiple conditions
- Most of the times we might need multiple conditions to works in one go, assume you want the values greater than 0 in the W column and the values greater than 1 in the Y column. You could do something like this-> print(dataframe[(dataframe['W'] > 0)&(dataframe['Y']>1)])
W X Y Z
E 0.190794 1.978757 2.605967 0.683509
- When you try to use multiple conditions you can't really use Python's conditional operator and hence we use the & sign. If you want the or operator we can use the pipe | sign.
Reseting the Index
In order to reset the Index to the default we can use the following function dataframe.reset_index(inplace=True) . Remember if you don't use inplace = True then it will not get permanent and if you again print the data frame, you will get the old data frame.
Now I will create a new Index list and try to set it to the data frame.
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
dataframe = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
new_index = 'MP AP UP TN WB'.split()
print(new_index)
#OUTPUT -> ['MP', 'AP', 'UP', 'TN', 'WB']
- Now let's try to resent the Index of the data frame.
dataframe['States'] = new_index
dataframe.set_index('States',inplace=True)
print(dataframe)
- The output will be.
W X Y Z
States
MP 2.706850 0.628133 0.907969 0.503826
AP 0.651118 -0.319318 -0.848077 0.605965
UP -2.018168 0.740122 0.528813 -0.589001
TN 0.188695 -0.758872 -0.933237 0.955057
WB 0.190794 1.978757 2.605967 0.683509
Multi-Level Indexing
- Here in order to create multilevel indexing, we will be using a special function which is available under PANDAS.
import numpy as np
import pandas as pd
outside = (' G1 G1 G1 G2 G2 G2').split()
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside)) #Just to create a tuple
print(hier_index )
#OUTPUT -> [('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]
hier_index = pd.MultiIndex.from_tuples(hier_index) #Main function to create a multilevel Indexing
print(hier_index)
'''
Output will be ->
MultiIndex([('G1', 1),
('G1', 2),
('G1', 3),
('G2', 1),
('G2', 2),
('G2', 3)],
)
'''
#Now create a dataframe
dataframe = pd.DataFrame(randn(6,2),hier_index,['A','B'])
print(dataframe)
'''
A B
G1 1 1.913472 0.444590
2 -1.013842 -1.064901
3 -1.102353 0.255780
G2 1 -1.300105 -0.552788
2 0.704361 0.850760
3 0.199433 0.864586
'''
- Now in order to select items from the data frame we can use the following code.
dataframe = pd.DataFrame(randn(6,2),hier_index,['A','B'])
print(dataframe.loc['G1'].loc[1])
#The Idea here is first you call the outside index then you keep on calling the inside index using the **loc function**
'''
OUTPUT is ->
A 1.090103
B -1.771748
'''
- Now in order to name the outside and the inside Index columns we can use the following function.
dataframe.index.names = ['Groups','Nums']
print(dataframe)
'''
OUTPUT ->
A B
Groups Nums
G1 1 -1.456893 0.642521
2 -0.266246 -0.880315
3 -2.137056 0.451063
G2 1 0.362317 -1.317669
2 1.165863 -0.823856
3 -1.090601 0.420701
'''
- Another way of grabbing the group is by using the Cross-section function . Let's see how to do that.
print(dataframe.xs('G1'))
'''
OUTPUT ->
A B
Nums
1 -1.035331 2.059875
2 0.731371 2.915778
3 1.312425 -0.988248
'''
- Now for .loc() it will be a bit tricky to get the 1st sub row from both G1 and G2, hence we make use of .xs().
print(dataframe.xs(1,level='Nums'))
'''
OUTPUT ->
A B
Groups
G1 0.064494 -0.254242
G2 0.664111 0.722540
'''
Thank-you!
I am glad you made it to the end of this article. I hope you got to learn something, if so please leave a Like which will encourage me for my upcoming write-ups.
- My GitHub Repos
- Connect with me on Linkedin
- Start your own blogs