Pandas DataFrame Tutorial with Examples

  • Post author:
  • Post category:Pandas
  • Post last modified:March 3, 2024
  • Reading time:33 mins read

1. Pandas DataFrame Tutorial Introduction

This is a beginner’s guide of python pandas DataFrame Tutorial where you will learn what is pandas DataFrame? its features, advantages, how to use DataFrame with sample examples.

Every sample example explained in this pandas tutorial is tested in our development environment and is available for reference.

All pandas DataFrame examples provided in this tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn pandas and advance their career in Data Science, analytics and Machine Learning.

Note: In case you can’t find the pandas DataFrame examples you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your tutorial and sample example code, there are hundreds of tutorials in pandas on this website you can learn from.

2. What is Pandas DataFrame?

pandas DataFrame is a Two-Dimensional data structure, an immutable, heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe consists of three components principal, data, rows, and columns. Pandas is built on the NumPy library and written in languages like PythonCython, and C.

3. DataFrame Features

  • DataFrames support named rows & columns (you can also provide names to rows)
  • Supports heterogeneous collections of data.
  • DataFrame labeled axes (rows and columns).
  • Can perform arithmetic operations on rows and columns.
  • Supporting reading flat files like CSV, Excel, and JSON and also reads SQL tables’s
  • Handling of missing data.

4. Pandas DataFrame vs PySpark DataFrame

In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application with larger datasets, PySpark is the best to use as it perform operations many times(100x) faster than Pandas.

PySpark is also very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. Also, PySpark is used due to its efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.

Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Spark was basically written in Scala and later on due to its industry adaptation, its API PySpark was released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDEJupyter notebook to run PySpark applications.

You can learn PySpark from the following tutorials. And also read more on pandas vs PySpark differences with Examples.

4.1 How to Decide Between Pandas vs PySpark

Below are a few considerations when choosing PySpark over Pandas.

  • If your data is huge and grows significantly over the years and you wanted to improve your processing time.
  • If you want fault-tolerant.
  • ANSI SQL compatibility.
  • Language to choose (Spark supports Python, Scala, Java & R)
  • When you want Machine-learning capability.
  • Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
  • If you wanted to stream the data and process it in real-time.

5. Installing Pandas

In this section of the pandas DataFrame tutorial let’s see how to install& upgrade pandas. In order to run pandas, you should have python installed first. You can install python either directly downloading from python or using Anaconda distribution. Depending on your need, follow the below link’s to install Python, Anaconda, and Jupyter notebook to run pandas examples. I would recommend installing Anaconda with Jupyter as a good choice if you are intended to learn pandas for data science, analytics & machine learning.

Once you have either Python or Anaconda setup, you can install pandas on top of Python or Anaconda in simple steps.

5.1 Install pandas using Python pip Command

pip (Python package manager) is used to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.

Since the pandas package is available in PyPI (Python Package Index), we should use it to install pandas latest version on windows.


# Install pandas using pip
pip install pandas
(or)
pip3 install pandas

This should give you output as below. If your pip is not up to date, then upgrade pip to the latest version.

Install pandas pip windows

5.2 Install Pandas using Anaconda conda Command

Anaconda distribution comes with a conda tool that is used to install/upgrade/downgrade most of the python and other packages.


# Install pandas using conda
conda install pandas

6. Upgrade Pandas to Latest or Specific Version

In order to upgrade pandas to the latest or specific version, you can use either pip install command or conda install if you are using Anaconda distribution. Before you start to upgrade, you the following command to know the current version of pandas installed.

pandas tutorial installed version

Below are statements to upgrade pandas. Depending on how you wanted to update, use either pip or conda statements.


# Using pip to upgrade pandas
pip install --upgrade pandas

# Alternatively you can also try
python -m pip install --upgrade pandas

# Upgrade pandas to specific version
pip install pandas==specific-higher-version

# Use conda update
conda update pandas

# Upgrade to specific version
conda update pandas==0.14.0

If you use pip3 to upgrade, you should see something like below.

pandas dataframe tutorial

7. Run Pandas Hello World Example

7.1 Run Pandas From Command Line

If you installed Anaconda, open the Anaconda command line or open the python shell/command prompt and enter the following lines to get the version of pandas, to learn more follow the links from the left-hand side of the pandas tutorial.


>>> import pandas as pd
>>> pd.__version__
'1.3.2'
>>>

7.2 Run Pandas From Jupyter

Go to Anaconda Navigator -> Environments -> your environment (I have created pandas-tutorial) -> select Open With Jupyter Notebook

pandas dataframe tutorial

This opens up Jupyter Notebook in the default browser.

pandas jupyter notenook

Now select New -> PythonX and enter the below lines and select Run.

pandas tutorial with jupyter

7.3 Run Pandas from IDE

You can also run pandas from any python IDE’s like Spyder, PyCharm e.t.c

8. Create Pandas DataFrame Tutorial

In this section of the tutorial, I will explain different ways to create pandas DataFrame with examples.

8.1 Create using Constructor

One of the easiest ways to create a DataFrame is by using its constructor.


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days"], 
                 ["pandas",20000, "40days"], 
               ]
df=pd.DataFrame(technologies)
print(df)

Since we have not given labels to columns and indexes, DataFrame by default assigns incremental sequence numbers as labels to both rows and columns, these are called Index.


# Output
        0      1       2
0   Spark  20000  30days
1  pandas  20000  40days

Column names with sequence numbers don’t make sense as it’s hard to identify what data holds on each column hence, it is always best practice to provide column names that identify the data it holds. Use column param and index param to provide column & row labels respectively to the DataFrame.


# Add Column & Row Labels to the DataFrame
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df=pd.DataFrame(technologies,columns=column_names,index=row_label)
print(df)

Yields below output.


# Output
  Courses    Fee Duration
a   Spark  20000   30days
b  pandas  20000   40days

By default, pandas identify the data types from the data and assign’s to the DataFrame. df.dtypes returns the data type of each column.


Courses     object
Fee          int64
Duration    object
dtype: object

You can also assign custom data types to columns.


# Set custom types to DataFrame
types={'Courses': str,'Fee':float,'Duration':str}
df=df.astype(types)

Another most used way to create DataFrame is from the dictionary.


# Create DataFrame from Dictionary
technologies = {
    'Courses':["Spark","PySpark","Hadoop"],
    'Fee' :[20000,25000,26000],
    'Duration':['30day','40days','35days'],
    'Discount':[1000,2300,1500]
              }
df = pd.DataFrame(technologies)

8.2 Create DataFrame From CSV File

In real-time we are often required to read TEXT, CSV, JSON files to create a DataFrame. In pandas, creating a DataFrame from CSV is pretty simple.


# Create DataFrame from CSV file
df = pd.read_csv('data_file.csv')

9. Pandas DataFrame Tutorial – Basic Operations

In order to explain some basic operations with pandas DataFrame let’s create it with some data.


# Create DataFrame with None/Null to work with examples
import pandas as pd
import numpy as np
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas",None,"Spark","Python"],
    'Fee' :[22000,25000,23000,24000,np.nan,25000,25000,22000],
    'Duration':['30day','50days','55days','40days','60days','35day','','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
          })
row_labels=['r0','r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies, index=row_labels)
print(df)

Note that our data contains np.nan, None, and empty values. Note that every column in DataFrame is internally represented as pandas Series.

9.1 DataFrame properties

DataFrame has several properties, in this pandas DataFrame tutorial I will cover most used properties with examples.

Method/PropertyResultDescription
df.shape(8, 4)Returns a shape of the pandas DataFrame (number of rows and columns) as a tuple. Number of rows and columns
df.size32Returns number of cells. It would be rows * columns.
df.emptyFalseReturn boolean. True when DF is empty.
df.columnsIndex([‘Courses’, ‘Fee’, ‘Duration’, ‘Discount’], dtype=’object’)Returns all column names as Series
df.columns.values[‘Courses’ ‘Fee’ ‘Duration’ ‘Discount’]Returns column names from the header as a list in pandas.
df.indexIndex([‘r0’, ‘r1’, ‘r2’, ‘r3’, ‘r4’, ‘r5’, ‘r6’, ‘r7′], dtype=’object’)Returns Index of DataFrame
df.index.values[‘r0’ ‘r1’ ‘r2’ ‘r3’ ‘r4’ ‘r5’ ‘r6’ ‘r7’]Returns Index as List.
df.dtypesCourses object
Fee float64
Duration object
Discount int64
dtype: object
Returns Data types of columns
df['Fee']
df[['Fee','Duration']]
r0 22000.0
r1 25000.0
r2 23000.0
r3 24000.0
r4 NaN
r5 25000.0
r6 25000.0
r7 22000.0
Name: Fee, dtype: float64
Pandas Select Columns by Name.
Also, use to select multiple columns
df2=df[df['Fee'] == 22000]Courses Fee Duration Discount
r0 Spark 22000.0 30day 1000
r7 Python 22000.0 50days 1600
Filter DataFrame
df2=df[6:]Courses Fee Duration Discount
r6 Spark 25000.0 30day 1400
r7 Python 22000.0 50days 1600
Select Dataframe Rows by Index
Select’s Row from 6th Index
df['Duration'][3]
df["Duration"].values[3]
40daysGet cell value (row x column) of DataFrame
df['Fee'] = df['Fee'] - 500
df['Fee']
r0 21500.0
r1 24500.0
r2 22500.0
r3 23500.0
r4 NaN
r5 24500.0
r6 24500.0
r7 21500.0
Update DataFrame Column
Substract 500 from ‘Fee’ Column
df[‘new_column’] = ”Add new column with empty values

10. Pandas Tutorial to Manipulate DataFrame

10.1 Describe DataFrame

describe() – describe function calculates count, mean, std, min, max, and different percentages of each numeric column of pandas DataFrame.


# Describe DataFrame for all numberic columns
df.describe()

# Output
                Fee     Discount
count      7.000000     8.000000
mean   23714.285714  1537.500000
std     1380.131119   570.557372
min    22000.000000  1000.000000
25%    22500.000000  1150.000000
50%    24000.000000  1350.000000
75%    25000.000000  1775.000000
max    25000.000000  2500.000000

10.2 Filter Rows from DataFrame

query()/apply()/loc[] – These are used to query pandas DataFrame. you can also do operator chaining while filtering pandas rows.


# Using DataFrame.query()
df.query("Courses == 'Spark'",inplace=True)
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Using DataFrame.loc[]
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

# Using apply()
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])

# Other ways to filter 
df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

10.3 Insert Rows & Columns to DataFrame

insert()/assign() – Adds a new column to the pandas DataFrame

By using assign() & insert() methods you can add one or multiple columns to the pandas DataFrame.


df = pd.DataFrame(technologies, index=row_labels)

# Adds new column 'TutorsAssigned' to DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 
          'Messi', 'Ramana','Kumar','Vasu']
df2 = df.assign(TutorsAssigned=tutors)

# Add new column from existing column
df2=df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)

# Other way to add a column
df["TutorsAssigned"] = tutors

# Add new column at the beginning
df.insert(0,'TutorsAssigned', tutors )

10.4 Rename DataFrame Columns

rename() – Renames pandas DataFrame columns

Pandas DataFrame.rename() method is used to change/replace columns (single & multiple columns), by index, and all columns of the DataFrame.


df = pd.DataFrame(technologies, index=row_labels)

# Assign new header by setting new column names.
df.columns=['A','B','C']

# Change column name by index. This changes 3rd column 
df.columns.values[2] = "C"

# Rename Column Names using rename() method
df2 = df.rename({'a': 'A', 'b': 'B'}, axis=1)
df2 = df.rename({'a': 'A', 'b': 'B'}, axis='columns')
df2 = df.rename(columns={'a': 'A', 'b': 'B'})

# Rename columns inplace (self DataFrame)
df.rename(columns={'a': 'A', 'b': 'B'}, inplace = True)

# Rename using lambda function
df.rename(columns=lambda x: x[1:], inplace=True)

# Rename with error. When x not present, it thorows error.
df.rename(columns = {'x':'X'}, errors = "raise")

10.5 Drop DataFrame Rows and Columns

drop() – drop method is used to drop rows and columns

Below are some examples. In order to understand better go through drop rows from panda DataFrame with examples. dropping rows doesn’t complete without learning how to drop rows with/by condition


df = pd.DataFrame(technologies, index=row_labels)

# Drop rows by labels
df1 = df.drop(['r1','r2'])

# Delete Rows by position
df1=df.drop(df.index[[1,3]])

# Delete Rows by Index Range
df1=df.drop(df.index[2:])

# When you have default indexs for rows
df1 = df.drop(0)
df1 = df.drop([0, 3])
df1 = df.drop(range(0,2))

# Drop rows by checking conditions
df1 = df.loc[df["Discount"] >=1500 ]

# DataFrame slicing
df2=df[4:]     # Returns rows from 4th row
df2=df[1:-1]   # Removes first and last row
df2=df[2:4]    # Return rows between 2 and 4

Now let’s see how to how to drop columns from pandas DataFrame with examples. In order to drop columns, you have to use either axis=1 or columns param to drop() method.


df = pd.DataFrame(technologies, index=row_labels)

# Delete Column by Name
df2=df.drop(["Fee"], axis = 1)

# Drop by using labels & axis
df2=df.drop(labels=["Fee"], axis = 1)

# Drop by using columns
df2=df.drop(columns=["Fee"])

# Drop column by index
df2=df.drop(df.columns[[1]], axis = 1)

# Other ways to drop columns
df.loc[:, 'Courses':'Fee'].columns, axis = 1, inplace=True)
df.drop(df.iloc[:, 1:2], axis=1, inplace=True)

If you wanted to drop duplicate rows from pandas DataFrame use DataFrame.drop_duplicates()

11. Pandas Join, Merge, Concat to Combine DataFrames

In this section of the python pandas tutorial I will cover how to combine DataFrame using join(), merge(), and concat() methods. All these methods perform below join types. All these join methods works similarly to SQL joins.

Join TypesSupported ByDescription
innerjoin(), merge() and concat()Performs Inner Join on pandas DataFrames
leftjoin(), merge()Performs Left Join on pandas DataFrames
rightjoin(), merge()Performs Right Join on pandas DataFrames
outerjoin(), merge() and concat()Performs Outer Join on pandas DataFrames
crossmerge()Performs Cross Join on pandas DataFrames

pandas.merge() and DataFrame.merge() works same way that merges two or more DataFrames. When doing a join on columns, it ignores indexes. When joining on the index, the resultant DataFrame contains indexes from sources. When no params are used, by default join happens on all common columns.


# Quick Examples of pandas merge DataFrames
# pandas.merge()
df3=pd.merge(df1,df2)

# DataFrame.merge()
df3=df1.merge(df2)

# Merge by column
df3=pd.merge(df1,df2, on='Courses')

# Merge on different colunn names
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')

# Merge by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)

# Merge by multiple columns
df3 = pd.merge(df3, df1,  how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])

# Merge by left join
df3=pd.merge(df1,df2, on='Courses', how='left')

# Merge by right join
df3=pd.merge(df1,df2, on='Courses', how='right')

# Merge by outer join
df3=pd.merge(df1,df2, on='Courses', how='outer')

Alternatively use join() for joining on the index. pandas.DataFrame.join() method is the most efficient way to join two pandas DataFrames on row index.


# pandas default join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")

# pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')

# pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')

# pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')

# pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')

Similarly, pandas also support concatenate two pandas DataFrames using concat() method.

12. Iterate over Rows to perform an operation

Pandas DataFrame provides methods iterrows(), itertuples() to iterate over each Row, where iterrows() returns (index, Series) where the index is an index of the Row and Series is data or content of each row and itertuples() returns all DataFrame elements as an iterator that contains a tuple for each row. itertuples() is faster compared with iterrows() and preserves data type.


df = pd.DataFrame(technologies, index=row_labels)

# Iterate all rows using DataFrame.iterrows()
for index, row in df.iterrows():
    print (index,row["Fee"], row["Courses"])

# Iterate all rows using DataFrame.itertuples()
for row in df.itertuples(index = True):
    print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))

# Using DataFrame.index
for idx in df.index:
     print(df['Fee'][idx], df['Courses'][idx])

13. Working with Null, np.NaN & Empty Values

Below are some of the articles that I have covered to handle None/NaN values in pandas DataFrame. It is very important to handle missing data in Pandas before you perform any analytics or run with machine learning algorithms.

14. Column Manipulations

One most used way to manipulate is by using pandas apply() function to DataFrame columns. If you are familiar with the lambda expressions, you can also use lambda expression with apply().

If you are not familiar with the lambda, A lambda function in python is a small anonymous function that can take any number of arguments and execute an expression. lambda expressions are utilized to construct anonymous functions. You can create one by using the lambda keyword.

DataFrame also provides several methods to manipulate data on columns.

15. Pandas Read & Write Excel

Use pandas DataFrame.to_excel() function to write a DataFrame to an excel sheet with extension .xlsx and use pandas.read_excel() function is used to read excel sheet with extension xlsx into pandas DataFrame

Read excel sheet Example


# Read Excel file
df = pd.read_excel('c:/apps/courses_schedule.xlsx')
print(df)

Write DataFrame to excel sheet


# Write DataFrame to Excel file
df.to_excel('Courses.xlsx')

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium