#1 -> Introduction to Pandas

#1 -> Introduction to Pandas

What is Pandas?

  • Panda is an open-source library that is built on top of NumPy. The main advantage of Pandas is it allows for fast analysis, data cleaning and the preparation of the data.
  • It is also known as Python's version of excel as it can work with data from a wide variety of sources.
  • It also has many built-in visualization tools.
  • In order to install Pandas
  • if you have anaconda distribution then you can use the command -> conda install pandas
  • or if you have installed python through a different method then you can use -> pip install pandas

Series

A series is very similar to the NumPy Array, the only difference between them is that a Series can have access labels which means that elements can be accessed by labels. We can understand the same using a few examples.

Creating a Series

  • Let's create various Python objects to help in creating a series.
import numpy as np 
import pandas as pd 
lables = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
dic = {'a':10,'b':20,'c':30}
# now we have created a set of seperate Python Objects 
# Now lets see how to create a series
  • Now will see how to create a series
import numpy as np 
import pandas as pd 
lables = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)
dic = {'a':10,'b':20,'c':30}
# now we have created a set of seperate Python Objects 
# Now lets see how to create a series 
print(pd.Series(data = my_data))
  • output will be.
0    10
1    20
2    30
dtype: int64
  • Now we can change the index to whatever we want it to be. In the below code you can see that the index is now labelled and we can reach out using the labels themselves.
print(pd.Series(data = my_data,index = labels))

#OUTPUT will be 
#a    10
#b    20
#c    30
  • There are a few more ways we can create a series. Like we can simply pass the NumPy to create a series, or simply we can pass a dictionary. What Pandas will do, will mark the keys as the Index labels and assign them to the corresponding values.
print(pd.Series(arr,lables))
print(pd.Series(dic))

Grabbing information from a Series.

  • For this, I will create 2 different series. This will be similar to getting the values of a dictionary or an array
import numpy as np 
import pandas as pd 
series1 = pd.Series([1,2,3,4],['India','USA','China','Russia'])
series2 = pd.Series([1,2,5,4],['India','USA','Japan','Russia'])
print(series1['India'])
print(series2['Japan'])

Data Frames

  • DataFrame is a 2-dimensional labelled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. Let's built one and see the output of how looks.
import numpy as np 
import pandas as pd 
from numpy.random import randn
np.random.seed(101)
dataframe = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
print(dataframe)
  • Output will be like.
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

How to select a data frame

  • we can say like print(dataframe['W']) . The Output it produces is a Series itself.
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
  • If you want two separate columns then we can pass a list like print(dataframe[['W','Z']])
          W         Z
A  2.706850  0.503826
B  0.651118  0.605965
C -2.018168 -0.589001
D  0.188695  0.955057
E  0.190794  0.683509
  • Now in order to add a new column.
import numpy as np 
import pandas as pd 
from numpy.random import randn
np.random.seed(101)
dataframe = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])
dataframe['new'] = dataframe['W'] + dataframe['Y']
print(dataframe)
  • Output will be.
          W         X         Y         Z       new
A  2.706850  0.628133  0.907969  0.503826  3.614819
B  0.651118 -0.319318 -0.848077  0.605965 -0.196959
C -2.018168  0.740122  0.528813 -0.589001 -1.489355
D  0.188695 -0.758872 -0.933237  0.955057 -0.744542
E  0.190794  1.978757  2.605967  0.683509  2.796762
  • we can delete it by using the command -> dataframe.drop('new',axis=1,inplace=True). Here we make use of inplace to make it affect in the main table.

  • The rows are axis=0 and the column is axis=1, this comes down to NumPy. At the 0th Index are the number of rows and at the 1st Index are the number of columns.

  • In order to select rows of a data frame we can make use of two methods.

print(dataframe.loc['A'])
#OR
print(dataframe.iloc[0])
#This one is just an index-based row location technique

#OR a particular element
print(dataframe.loc['B','Y'])

Thank-you!

I am glad you made it to the end of this article. I hope you got to learn something, if so please leave a Like which will encourage me for my upcoming write-ups.