#4 -> Operations in Pandas

#4 -> Operations in Pandas

  • Before we start I would like you to go through the below articles which will help you get started with Pandas.

Unique Operation

  • First, let's create a data frame.
import numpy as np 
import pandas as pd 
dataframe = pd.DataFrame({'col1':[1,2,3,4],
'col2':[444,555,666,444],
'col3':['abc','def','ghi','xyz']})
print(dataframe)

'''
OUTPUT->
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
'''
  • How to find unique values in a data frame. If we want to find unique values in a particular column or row we can make use of .unique()
print(dataframe['col2'].unique())

'''
OUTPUT ->
[444 555 666]
'''
  • else if you just want to find the number of unique elements then instead of using .unique() you could make use of .nunique().

  • if you want to find how many time that particular unique values occurred in that column then we can do it as below.

print(dataframe['col2'].value_counts())

'''
OUTPUT ->
444    2
666    1
555    1
'''

Apply Operation

  • So there are multiple functions that are actually builtin, but what if you want to customize your own function and apply them. Pandas have the ability to do that.
import numpy as np 
import pandas as pd 

def times2(x):
    return x*2

dataframe = pd.DataFrame({'col1':[1,2,3,4],
'col2':[444,555,666,444],
'col3':['abc','def','ghi','xyz']})
print(dataframe['col2'].apply(times2))

'''
OUTPUT ->
0     888
1    1110
2    1332
3     888
'''

Sorting and Ordering

  • Inorder to sort a column we can make use of the function .sort_values()
print(dataframe['col2'].sort_values())
print(dataframe.sort_values('col2'))

'''
OUTPUT ->
0    444
3    444
1    555
2    666


   col1  col2 col3
0     1   444  abc
3     4   444  xyz
1     2   555  def
2     3   666  ghi
'''

Missing Data

  • A very useful way to find out if you have any null values in the data frame is using the function .isnull(). This will return in the boolean form.

  • A lot of time when you use Pandas to read-in data if you have missing point what will happen is that Pandas will automatically fill it will Null Value. We can try to change that to any n value that we want, let's see how we can do that.

  • First, we go ahead and create a data frame using a dictionary.

import numpy as np 
import pandas as pd 
from numpy.random import randn
dic = {'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]}
dataframe = pd.DataFrame(dic)
print(dataframe)

'''
OUTPUT ->
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3
'''
  • A lot of time you might just want to drop the missing values from the database, for this we will make use of dataframe.dropna() but if you simply use this, it will drop any row which has a null value, by default if we want to drop the column then we can mention the dataframe.dropna(axis=1)
  • We can also set the Threshold, when we give Threshold as a number it will keep that particular row that has at least that amount of non-zero values.
print(dataframe.dropna(thresh=2))
'''
OUTPUT->
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
'''
  • What if you want to fill-in different values instead of NULL then we could make use of dataframe.fillna(value='NON')
print(dataframe.fillna(value='NON'))
'''
OUTPUT->
     A    B  C
0  1.0  5.0  1
1  2.0  NON  2
2  NON  NON  3
'''
  • What if we want to fill this with the mean of the values.
print(dataframe.fillna(value=dataframe['A'].mean()))
'''
     A    B  C
0  1.0  5.0  1
1  2.0  1.5  2
2  1.5  1.5  3
'''

Pivot Table

  • This one is similar to what we have in Excel. Don't worry if you are not familiar with the same will now look into an example to see how this works. Let's first start with creating a new data frame.
import numpy as np 
import pandas as pd 
foo_bar = ('foo foo foo bar bar bar').split()
one_two = ('one one two two one one ').split()
x = ('x y z x y z').split()
data= {'A':foo_bar, 'B':one_two,'C':x , 'D': [1,3,2,5,4,1]}
dataframe = pd.DataFrame(data)
print(dataframe)

'''
OUTPUT ->

     A    B  C  D
0  foo  one  x  1
1  foo  one  y  3
2  foo  two  z  2
3  bar  two  x  5
4  bar  one  y  4
5  bar  one  z  1
'''
  • let's now create a pivot table, which is just multi-level indexing.
print(dataframe.pivot_table(values='D',index=['A','B'],columns=['C']))

'''

OUTPUT-> 
C          x    y    z
A   B
bar one  NaN  4.0  1.0
    two  5.0  NaN  NaN
foo one  1.0  3.0  NaN
    two  NaN  NaN  2.0
'''

Thank-you!

I am glad you made it to the end of this article. I hope you got to learn something, if so please leave a Like which will encourage me for my upcoming write-ups.