Pandas – What is a DataFrame Explained With Examples

Pandas DataFrame is a Two-Dimensional data structure, Portenstitially heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe is consists of three components principal, data, rows, and columns. In this article, we’ll explain how to create Pandas data structure DataFrame Dictionaries and indexes, how to access fillna() & dropna() method, Iterating over rows and columns, and finally using some functions with examples.

Dataframe Features

Potentially columns are of different types.
Pandas DataFrame size is mutable.
DataFrame labeled axes (rows and columns).
can perform arithmetic operations on rows and columns on DataFrame.

DataFrame Methods

Following are the most used Pandas DataFrame methods.

FUNCTIONSDESCRIPTION
index()Returns the index (row labels) of the DataFrame.
axes()Returns a list representing the axes of the DataFrame.
insert()Inserts a column into a DataFrame.
add()Returns the addition of DataFrame and other, element-wise (binary operator add)
sub()Returns subtraction of DataFrame and other, element-wise (binary operator sub)
mul()Returns multiplication of DataFrame and other, element-wise (binary operator mul)
div()Returns floating division of DataFrame and other, element-wise (binary operator truediv)
dtypes()Returns a Series with the data type of each column.
unique()Extracts the unique values in the DataFrame.
loc[]Retrieves rows based on an index label.
drop()Delete rows or columns from a DataFrame.
pop()Delete rows or columns from a DataFrame.
columns()Alternative attribute to change the column name.
dropna()Allows the user to analyze and drop Rows/Columns with Null values in different ways.
fillna()Manages and lets the user replace NaN values with some value of their own.

Create a DataFrame Using Dictionary Ndarray/Lists

To create Pandas DataFrame from the dictionary of ndarray/list, all the ndarray must be of the same length. If the Data index is passed then the length index should be equal to the length of the array. If no index is passed, by default index will be range(n) where n is the array length.


import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
         })
 df = pd.DataFrame(technologies)
print(df)

Yields below output


  Courses   Fee Duration
0    Spark  20000    30day
1  PySpark  25000   40days
2   Hadoop  26000   35days
3   Python  22000   40days
4   pandas  24000   60days
5   Oracle  21000   50days
6     Java  22000   55days

Using List of Dictionaries

Pandas DataFrame can be created in different ways by using loading the datasets from existing storage, storage can be Excel file, CSV file, and SQL Database. Pandas DataFrame can be created from the lists, dictionary, and from a list of the dictionary, etc.


import pandas as pd
data = [{'a=5', 'b=9', 'c=8'},{'a=6', 'b=8', 'c=4'}]
df =pd.DataFrame(data)
print(df)

Yields below output


     0    1    2
0  a=5  c=8  b=9
1  a=6  c=4  b=8

Using 2D List


import pandas as pd
data = [['William', 28], ['mia', 25], ['juli', 21], ['messi',30]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)

Yields below output


     Name  Age
0  William   28
1      mia   25
2     juli   21
3    messi   30

Using Indexes

You can create data in your own index argument.


import pandas as pd
data = {'Name':['William', 'Mia', 'messi', 'juli'], 'marks':[98, 96, 94, 90]}
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])
print(df)

Yields below output


          Name  marks
rank1  William     98
rank2      Mia     96
rank3    messi     94
rank4     juli     90

Locate Named Indexes

Locate Named Index in the .loc[] attribute to return the specified rows. For instance, df.loc['rank3'] it retrieves the row Name: rank3, dtype: object.


import pandas as pd
data = {'Name':['William', 'Mia', 'messi', 'juli'], 'marks':[98, 96, 94, 90]}
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])
print(df.loc['rank3'])

Yields below output


Name     messi
marks       94
Name: rank3, dtype: object

Using zip() Function


import pandas as pd
name = ['William', 'Mia', 'messi', 'juli']  
age = [30, 20, 28, 32]  
list_of_tuples = list(zip(name, age)) 
df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])  
print(df)

Yields below output


      Name  Age
0  William   30
1      Mia   20
2    messi   28
3     juli   32

fillna() Method

In order to fill null values in a dataset. The fillna() function is used Manages and lets the user replace file NA/NaN values using the specified method.


import pandas as pd
import numpy as np
dataset = {
    "Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
    "Age" : [33, 32, np.nan, 30, np.nan],
    "Height": [5.6, 5.8, 6.2, np.nan, np.nan],
    "Job": [True, np.nan, np.nan, np.nan, np.nan]
}
data = pd.DataFrame(dataset)
print(data.fillna("None"))

Yields below output


      Name   Age Height   Job
0    Messi  33.0    5.6  True
1  Ronaldo  32.0    5.8  None
2  Alisson  None    6.2  None
3  Mohamed  30.0   None  None
4     None  None   None  None

dropna() Method

In order to drop null values from a DataFrame, we used dropna() function this function drops Rows and Columns of datasets with Null values in different ways.


import pandas as pd
import numpy as np
dataset = {
    "Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
    "Age" : [33, 32, np.nan, 30, np.nan],
    "Height": [5.9, 5.8, 6.2, np.nan, np.nan],
    "Designation": ['football player', np.nan, 'fp', np.nan,'true']
}
data = pd.DataFrame(dataset)
print(data)

Yields below output


      Name   Age  Height      Designation
0    Messi  33.0     5.9  football player
1  Ronaldo  32.0     5.8              NaN
2  Alisson   NaN     6.2               fp
3  Mohamed  30.0     NaN              NaN
4      NaN   NaN     NaN             true

Now we drop rows at least one Null value


import pandas as pd
import numpy as np
dataset = {
    "Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
    "Age" : [33, 32, np.nan, 30, np.nan],
    "Height": [5.9, 5.8, 6.2, np.nan, np.nan],
    "Designation": ['football player', np.nan, np.nan, np.nan,'true']
}
data = pd.DataFrame(dataset)
print(data.dropna())

Yields below output


    Name   Age  Height      Designation
0  Messi  33.0     5.9  football player

Pandas DataFrame Iterating over rows and columns

Sometimes you need to process all the data values of a DataFrame, in such a case writing separate statements for assigning accessing individual data values makes the process cumbersome. Pandas DataFrame supports Iterating over rows and columns, let’s see these with some examples.

To iterate over rows : a.iterrows()

To iterate over columns : a.iteritems()


import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
              })
df = pd.DataFrame(technologies)
print(df)

Yields below output


   Courses   Fee Duration
0    Spark  20000    30day
1  PySpark  25000   40days
2   Hadoop  26000   35days
3   Python  22000   40days
4   pandas  24000   60days
5   Oracle  21000   50days
6     Java  22000   55days

Iterating over rows

In order to iterate over rows, we can use three functions iteritems(), iterrows(), itertuples(). We can apply iterrows() function in order to get each element of rows.


import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
            })
df = pd.DataFrame(technologies)
for x, y in df.iterrows():
    print(x, y)
    print()

Yields below output.


0 Courses     Spark
Fee        20000
Duration    30day
Name: 0, dtype: object

1 Courses     PySpark
Fee          25000
Duration     40days
Name: 1, dtype: object

2 Courses     Hadoop
Fee         26000
Duration    35days
Name: 2, dtype: object

3 Courses     Python
Fee         22000
Duration    40days
Name: 3, dtype: object

4 Courses     pandas
Fee         24000
Duration    60days
Name: 4, dtype: object

5 Courses     Oracle
Fee         21000
Duration    50days
Name: 5, dtype: object

6 Courses       Java
Fee         22000
Duration    55days
Name: 6, dtype: object

Iterating over columns

In order to iterate over columns, we need to create a list of DataFrame columns and iterating through that list to pull out the DataFrame columns. retrieve the fifth element of the column.


import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
          })
df = pd.DataFrame(technologies)
columns = list(df)
for i in columns:
    print (df[i][4])

Yields be low output


pandas
24000
60days

Reference

https://www.w3schools.com/python/pandas/pandas_dataframes.asp

Leave a Reply