• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:14 mins read
You are currently viewing Pandas – What is a DataFrame Explained With Examples

Pandas DataFrame is a Two-Dimensional data structure, Portenstitially heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe is consists of three components principal, data, rows, and columns. In this article, we’ll explain how to create Pandas data structure DataFrame Dictionaries and indexes, how to access fillna() & dropna() method, Iterating over rows and columns, and finally using some functions with examples.

1. Dataframe Features

  • Potentially columns are of different types.
  • Pandas DataFrame size is mutable.
  • DataFrame labeled axes (rows and columns).
  • can perform arithmetic operations on rows and columns on DataFrame.

2. DataFrame Methods

Following are the most used Pandas DataFrame methods.

FUNCTIONSDESCRIPTION
index()Returns the index (row labels) of the DataFrame.
axes()Returns a list representing the axes of the DataFrame.
insert()Inserts a column into a DataFrame.
add()Returns the addition of DataFrame and other, element-wise (binary operator add)
sub()Returns subtraction of DataFrame and other, element-wise (binary operator sub)
mul()Returns multiplication of DataFrame and other, element-wise (binary operator mul)
div()Returns floating division of DataFrame and other, element-wise (binary operator truediv)
dtypes()Returns a Series with the data type of each column.
unique()Extracts the unique values in the DataFrame.
loc[]Retrieves rows based on an index label.
drop()Delete rows or columns from a DataFrame.
pop()Delete rows or columns from a DataFrame.
columns()Alternative attribute to change the column name.
dropna()Allows the user to analyze and drop Rows/Columns with Null values in different ways.
fillna()Manages and lets the user replace NaN values with some value of their own.

3. Create a DataFrame Using Dictionary Ndarray/Lists

To create Pandas DataFrame from the dictionary of ndarray/list, all the ndarray must be of the same length. If the Data index is passed then the length index should be equal to the length of the array. If no index is passed, by default index will be range(n) where n is the array length.


# Create DataFrame
import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
         })
 df = pd.DataFrame(technologies)
print(df)

Yields below output


# Output:
  Courses   Fee Duration
0    Spark  20000    30day
1  PySpark  25000   40days
2   Hadoop  26000   35days
3   Python  22000   40days
4   pandas  24000   60days
5   Oracle  21000   50days
6     Java  22000   55days

4. Using List of Dictionaries

You can create Pandas DataFrame in different ways by using loading the datasets from existing storage, storage can be Excel file, CSV file, and SQL Database. Pandas DataFrame can be created from the lists, dictionary, and from a list of the dictionary, etc.


# Using List of Dictionaries
import pandas as pd
data = [{'a=5', 'b=9', 'c=8'},{'a=6', 'b=8', 'c=4'}]
df =pd.DataFrame(data)
print(df)

Yields below output


# Output:
     0    1    2
0  a=5  c=8  b=9
1  a=6  c=4  b=8

5. Using 2D List

You can create DataFrame using 2-dimensional list.


# Using 2D List
import pandas as pd
data = [['William', 28], ['mia', 25], ['juli', 21], ['messi',30]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)

Yields below output


# Output:
     Name  Age
0  William   28
1      mia   25
2     juli   21
3    messi   30

6. Using Indexes

You can create data in your own index argument.


# Using Indexes
import pandas as pd
data = {'Name':['William', 'Mia', 'messi', 'juli'], 'marks':[98, 96, 94, 90]}
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])
print(df)

Yields below output


# Output:
          Name  marks
rank1  William     98
rank2      Mia     96
rank3    messi     94
rank4     juli     90

7. Locate Named Indexes

Locate Named Index in the df.loc[] attribute to return the specified rows. For instance, df.loc['rank3'] it retrieves the row Name: rank3, dtype: object.


# Locate Named Indexes
import pandas as pd
data = {'Name':['William', 'Mia', 'messi', 'juli'], 'marks':[98, 96, 94, 90]}
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])
print(df.loc['rank3'])

Yields below output


# Output:
Name     messi
marks       94
Name: rank3, dtype: object

8. Using zip() Function


# Using zip() Function
import pandas as pd
name = ['William', 'Mia', 'messi', 'juli']  
age = [30, 20, 28, 32]  
list_of_tuples = list(zip(name, age)) 
df = pd.DataFrame(list_of_tuples, columns = ['Name', 'Age'])  
print(df)

Yields below output


# Output:
      Name  Age
0  William   30
1      Mia   20
2    messi   28
3     juli   32

9. fillna() Method

In order to fill null values in a dataset. The fillna() function is used Manages and lets the user replace file NA/NaN values using the specified method.


# fillna() Method
import pandas as pd
import numpy as np
dataset = {
    "Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
    "Age" : [33, 32, np.nan, 30, np.nan],
    "Height": [5.6, 5.8, 6.2, np.nan, np.nan],
    "Job": [True, np.nan, np.nan, np.nan, np.nan]
}
data = pd.DataFrame(dataset)
print(data.fillna("None"))

Yields below output


# Output:
      Name   Age Height   Job
0    Messi  33.0    5.6  True
1  Ronaldo  32.0    5.8  None
2  Alisson  None    6.2  None
3  Mohamed  30.0   None  None
4     None  None   None  None

10. dropna() Method

In order to drop null values from a DataFrame, we used dropna() function this function drops Rows and Columns of datasets with Null values in different ways.


# dropna() Method
import pandas as pd
import numpy as np
dataset = {
    "Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
    "Age" : [33, 32, np.nan, 30, np.nan],
    "Height": [5.9, 5.8, 6.2, np.nan, np.nan],
    "Designation": ['football player', np.nan, 'fp', np.nan,'true']
}
data = pd.DataFrame(dataset)
print(data)

Yields below output.


# Output:
      Name   Age  Height      Designation
0    Messi  33.0     5.9  football player
1  Ronaldo  32.0     5.8              NaN
2  Alisson   NaN     6.2               fp
3  Mohamed  30.0     NaN              NaN
4      NaN   NaN     NaN             true

Now we drop rows at least one Null value.


import pandas as pd
import numpy as np
dataset = {
    "Name" : ["Messi", "Ronaldo", "Alisson", "Mohamed", np.nan],
    "Age" : [33, 32, np.nan, 30, np.nan],
    "Height": [5.9, 5.8, 6.2, np.nan, np.nan],
    "Designation": ['football player', np.nan, np.nan, np.nan,'true']
}
data = pd.DataFrame(dataset)
print(data.dropna())

Yields below output.


# Output:
    Name   Age  Height      Designation
0  Messi  33.0     5.9  football player

11. Pandas DataFrame Iterating over rows and columns

Sometimes you need to process all the data values of a DataFrame, in such a case writing separate statements for assigning accessing individual data values makes the process cumbersome. Pandas DataFrame supports Iterating over rows and columns, let’s see these with some examples.

To iterate over rows : a.iterrows()

To iterate over columns : a.iteritems()


# Pandas DataFrame Iterating over rows and columns
import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
              })
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
   Courses   Fee Duration
0    Spark  20000    30day
1  PySpark  25000   40days
2   Hadoop  26000   35days
3   Python  22000   40days
4   pandas  24000   60days
5   Oracle  21000   50days
6     Java  22000   55days

11.1 Iterating over rows

In order to iterate over rows, we can use three functions iteritems(), iterrows(), itertuples(). We can apply iterrows() function in order to get each element of rows.


# Iterating over rows
import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
            })
df = pd.DataFrame(technologies)
for x, y in df.iterrows():
    print(x, y)
    print()

Yields below output.


# Output:
0 Courses     Spark
Fee        20000
Duration    30day
Name: 0, dtype: object

1 Courses     PySpark
Fee          25000
Duration     40days
Name: 1, dtype: object

2 Courses     Hadoop
Fee         26000
Duration    35days
Name: 2, dtype: object

3 Courses     Python
Fee         22000
Duration    40days
Name: 3, dtype: object

4 Courses     pandas
Fee         24000
Duration    60days
Name: 4, dtype: object

5 Courses     Oracle
Fee         21000
Duration    50days
Name: 5, dtype: object

6 Courses       Java
Fee         22000
Duration    55days
Name: 6, dtype: object

11.2 Iterating over columns

In order to iterate over columns, we need to create a list of DataFrame columns and iterating through that list to pull out the DataFrame columns. retrieve the fifth element of the column.


# Iterating over columns
import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
          })
df = pd.DataFrame(technologies)
columns = list(df)
for i in columns:
    print (df[i][4])

Yields be low output.


# Output:
pandas
24000
60days

Reference

https://www.w3schools.com/python/pandas/pandas_dataframes.asp

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium