1. Pandas DataFrame Tutorial Introduction
This is a beginner’s guide of python pandas DataFrame Tutorial where you will learn what is pandas DataFrame? its features, advantages, how to use DataFrame with sample examples.
Every sample example explained in this pandas tutorial is tested in our development environment and is available for reference.
All pandas DataFrame examples provided in this tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn pandas and advance their career in Data Science, analytics and Machine Learning.
Note: In case you can’t find the pandas DataFrame examples you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your tutorial and sample example code, there are hundreds of tutorials in pandas on this website you can learn from.
2. What is Pandas DataFrame?
pandas DataFrame is a Two-Dimensional data structure, an immutable, heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe consists of three components principal, data, rows, and columns. Pandas is built on the NumPy library and written in languages like Python, Cython, and C.
3. DataFrame Features
- DataFrames support named rows & columns (you can also provide names to rows)
- Supports heterogeneous collections of data.
- DataFrame labeled axes (rows and columns).
- Can perform arithmetic operations on rows and columns.
- Supporting reading flat files like CSV, Excel, and JSON and also reads SQL tables’s
- Handling of missing data.
4. Pandas DataFrame vs PySpark DataFrame
In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application with larger datasets, PySpark is the best to use as it perform operations many times(100x) faster than Pandas.
PySpark is also very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. Also, PySpark is used due to its efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.
PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.
Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.
Spark was basically written in Scala and later on due to its industry adaptation, its API PySpark was released for Python using Py4J. Py4J
is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.
Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run PySpark applications.
You can learn PySpark from the following tutorials. And also read more on pandas vs PySpark differences with Examples.
- What is PySpark
- PySpark RDD Tutorial
- PySpark DataFrame Tutorial
- How to Convert pandas DataFrame to PySpark DataFrame
- How to convert PySpark DataFrame to Pandas
4.1 How to Decide Between Pandas vs PySpark
Below are a few considerations when choosing PySpark over Pandas.
- If your data is huge and grows significantly over the years and you wanted to improve your processing time.
- If you want fault-tolerant.
- ANSI SQL compatibility.
- Language to choose (Spark supports Python, Scala, Java & R)
- When you want Machine-learning capability.
- Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
- If you wanted to stream the data and process it in real-time.
5. Installing Pandas
In this section of the pandas DataFrame tutorial let’s see how to install& upgrade pandas. In order to run pandas, you should have python installed first. You can install python either directly downloading from python or using Anaconda distribution. Depending on your need, follow the below link’s to install Python, Anaconda, and Jupyter notebook to run pandas examples. I would recommend installing Anaconda with Jupyter as a good choice if you are intended to learn pandas for data science, analytics & machine learning.
- Step-by-Step Instruction of Install Anaconda & Pandas
- Run pandas from Anaconda & Jupyter Notebook
- Install Python & Run pandas from Windows
Once you have either Python or Anaconda setup, you can install pandas on top of Python or Anaconda in simple steps.
5.1 Install pandas using Python pip Command
pip (Python package manager) is used to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.
Since the pandas package is available in PyPI (Python Package Index), we should use it to install pandas latest version on windows.
# Install pandas using pip
pip install pandas
(or)
pip3 install pandas
This should give you output as below. If your pip is not up to date, then upgrade pip to the latest version.
5.2 Install Pandas using Anaconda conda Command
Anaconda distribution comes with a conda
tool that is used to install/upgrade/downgrade most of the python and other packages.
# Install pandas using conda
conda install pandas
6. Upgrade Pandas to Latest or Specific Version
In order to upgrade pandas to the latest or specific version, you can use either pip install
command or conda install
if you are using Anaconda distribution. Before you start to upgrade, you the following command to know the current version of pandas installed.
Below are statements to upgrade pandas. Depending on how you wanted to update, use either pip
or conda
statements.
# Using pip to upgrade pandas
pip install --upgrade pandas
# Alternatively you can also try
python -m pip install --upgrade pandas
# Upgrade pandas to specific version
pip install pandas==specific-higher-version
# Use conda update
conda update pandas
# Upgrade to specific version
conda update pandas==0.14.0
If you use pip3 to upgrade, you should see something like below.
7. Run Pandas Hello World Example
7.1 Run Pandas From Command Line
If you installed Anaconda, open the Anaconda command line or open the python shell/command prompt and enter the following lines to get the version of pandas, to learn more follow the links from the left-hand side of the pandas tutorial.
>>> import pandas as pd
>>> pd.__version__
'1.3.2'
>>>
7.2 Run Pandas From Jupyter
Go to Anaconda Navigator -> Environments -> your environment (I have created pandas-tutorial) -> select Open With Jupyter Notebook
This opens up Jupyter Notebook in the default browser.
Now select New -> PythonX and enter the below lines and select Run.
7.3 Run Pandas from IDE
You can also run pandas from any python IDE’s like Spyder, PyCharm e.t.c
8. Create Pandas DataFrame Tutorial
In this section of the tutorial, I will explain different ways to create pandas DataFrame with examples.
8.1 Create using Constructor
One of the easiest ways to create a DataFrame is by using its constructor.
# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days"],
["pandas",20000, "40days"],
]
df=pd.DataFrame(technologies)
print(df)
Since we have not given labels to columns and indexes, DataFrame by default assigns incremental sequence numbers as labels to both rows and columns, these are called Index.
# Output
0 1 2
0 Spark 20000 30days
1 pandas 20000 40days
Column names with sequence numbers don’t make sense as it’s hard to identify what data holds on each column hence, it is always best practice to provide column names that identify the data it holds. Use column
param and index
param to provide column & row labels respectively to the DataFrame.
# Add Column & Row Labels to the DataFrame
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df=pd.DataFrame(technologies,columns=column_names,index=row_label)
print(df)
Yields below output.
# Output
Courses Fee Duration
a Spark 20000 30days
b pandas 20000 40days
By default, pandas identify the data types from the data and assign’s to the DataFrame. df.dtypes
returns the data type of each column.
Courses object
Fee int64
Duration object
dtype: object
You can also assign custom data types to columns.
# Set custom types to DataFrame
types={'Courses': str,'Fee':float,'Duration':str}
df=df.astype(types)
Another most used way to create DataFrame is from the dictionary.
# Create DataFrame from Dictionary
technologies = {
'Courses':["Spark","PySpark","Hadoop"],
'Fee' :[20000,25000,26000],
'Duration':['30day','40days','35days'],
'Discount':[1000,2300,1500]
}
df = pd.DataFrame(technologies)
8.2 Create DataFrame From CSV File
In real-time we are often required to read TEXT, CSV, JSON files to create a DataFrame. In pandas, creating a DataFrame from CSV is pretty simple.
# Create DataFrame from CSV file
df = pd.read_csv('data_file.csv')
9. Pandas DataFrame Tutorial – Basic Operations
In order to explain some basic operations with pandas DataFrame let’s create it with some data.
# Create DataFrame with None/Null to work with examples
import pandas as pd
import numpy as np
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas",None,"Spark","Python"],
'Fee' :[22000,25000,23000,24000,np.nan,25000,25000,22000],
'Duration':['30day','50days','55days','40days','60days','35day','','50days'],
'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
})
row_labels=['r0','r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies, index=row_labels)
print(df)
Note that our data contains np.nan
, None
, and empty
values. Note that every column in DataFrame is internally represented as pandas Series.
9.1 DataFrame properties
DataFrame has several properties, in this pandas DataFrame tutorial I will cover most used properties with examples.
Method/Property | Result | Description |
---|---|---|
df.shape | (8, 4) | Returns a shape of the pandas DataFrame (number of rows and columns) as a tuple. Number of rows and columns |
df.size | 32 | Returns number of cells. It would be rows * columns. |
df.empty | False | Return boolean. True when DF is empty. |
df.columns | Index([‘Courses’, ‘Fee’, ‘Duration’, ‘Discount’], dtype=’object’) | Returns all column names as Series |
df.columns.values | [‘Courses’ ‘Fee’ ‘Duration’ ‘Discount’] | Returns column names from the header as a list in pandas. |
df.index | Index([‘r0’, ‘r1’, ‘r2’, ‘r3’, ‘r4’, ‘r5’, ‘r6’, ‘r7′], dtype=’object’) | Returns Index of DataFrame |
df.index.values | [‘r0’ ‘r1’ ‘r2’ ‘r3’ ‘r4’ ‘r5’ ‘r6’ ‘r7’] | Returns Index as List. |
df.dtypes | Courses object Fee float64 Duration object Discount int64 dtype: object | Returns Data types of columns |
df['Fee'] df[['Fee','Duration']] | r0 22000.0 r1 25000.0 r2 23000.0 r3 24000.0 r4 NaN r5 25000.0 r6 25000.0 r7 22000.0 Name: Fee, dtype: float64 | Pandas Select Columns by Name. Also, use to select multiple columns |
df2=df[df['Fee'] == 22000] | Courses Fee Duration Discount r0 Spark 22000.0 30day 1000 r7 Python 22000.0 50days 1600 | Filter DataFrame |
df2=df[6:] | Courses Fee Duration Discount r6 Spark 25000.0 30day 1400 r7 Python 22000.0 50days 1600 | Select Dataframe Rows by Index Select’s Row from 6th Index |
df['Duration'][3] df["Duration"].values[3] | 40days | Get cell value (row x column) of DataFrame |
df['Fee'] = df['Fee'] - 500 df['Fee'] | r0 21500.0 r1 24500.0 r2 22500.0 r3 23500.0 r4 NaN r5 24500.0 r6 24500.0 r7 21500.0 | Update DataFrame Column Substract 500 from ‘Fee’ Column |
df[‘new_column’] = ” | Add new column with empty values |
10. Pandas Tutorial to Manipulate DataFrame
10.1 Describe DataFrame
describe()
– describe function calculates count, mean, std, min, max, and different percentages of each numeric column of pandas DataFrame.
# Describe DataFrame for all numberic columns
df.describe()
# Output
Fee Discount
count 7.000000 8.000000
mean 23714.285714 1537.500000
std 1380.131119 570.557372
min 22000.000000 1000.000000
25% 22500.000000 1150.000000
50% 24000.000000 1350.000000
75% 25000.000000 1775.000000
max 25000.000000 2500.000000
10.2 Filter Rows from DataFrame
query()/apply()/loc[]
– These are used to query pandas DataFrame. you can also do operator chaining while filtering pandas rows.
- pandas.DataFrame.filter() – To filter rows by index and columns by name.
- pandas.DataFrame.loc[] – To select rows by indices label and column by name.
- pandas.DataFrame.iloc[] – To select rows by index and column by position.
- pandas.DataFrame.apply() – To custom select using lambda function.
# Using DataFrame.query()
df.query("Courses == 'Spark'",inplace=True)
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")
# Using DataFrame.loc[]
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
# Using apply()
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])
# Other ways to filter
df[df["Courses"] == 'Spark']
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]
10.3 Insert Rows & Columns to DataFrame
insert()/assign()
– Adds a new column to the pandas DataFrame
By using assign() & insert() methods you can add one or multiple columns to the pandas DataFrame.
df = pd.DataFrame(technologies, index=row_labels)
# Adds new column 'TutorsAssigned' to DataFrame
tutors = ['William', 'Henry', 'Michael', 'John',
'Messi', 'Ramana','Kumar','Vasu']
df2 = df.assign(TutorsAssigned=tutors)
# Add new column from existing column
df2=df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)
# Other way to add a column
df["TutorsAssigned"] = tutors
# Add new column at the beginning
df.insert(0,'TutorsAssigned', tutors )
10.4 Rename DataFrame Columns
rename()
– Renames pandas DataFrame columns
Pandas DataFrame.rename() method is used to change/replace columns (single & multiple columns), by index, and all columns of the DataFrame.
df = pd.DataFrame(technologies, index=row_labels)
# Assign new header by setting new column names.
df.columns=['A','B','C']
# Change column name by index. This changes 3rd column
df.columns.values[2] = "C"
# Rename Column Names using rename() method
df2 = df.rename({'a': 'A', 'b': 'B'}, axis=1)
df2 = df.rename({'a': 'A', 'b': 'B'}, axis='columns')
df2 = df.rename(columns={'a': 'A', 'b': 'B'})
# Rename columns inplace (self DataFrame)
df.rename(columns={'a': 'A', 'b': 'B'}, inplace = True)
# Rename using lambda function
df.rename(columns=lambda x: x[1:], inplace=True)
# Rename with error. When x not present, it thorows error.
df.rename(columns = {'x':'X'}, errors = "raise")
10.5 Drop DataFrame Rows and Columns
drop()
– drop method is used to drop rows and columns
Below are some examples. In order to understand better go through drop rows from panda DataFrame with examples. dropping rows doesn’t complete without learning how to drop rows with/by condition
df = pd.DataFrame(technologies, index=row_labels)
# Drop rows by labels
df1 = df.drop(['r1','r2'])
# Delete Rows by position
df1=df.drop(df.index[[1,3]])
# Delete Rows by Index Range
df1=df.drop(df.index[2:])
# When you have default indexs for rows
df1 = df.drop(0)
df1 = df.drop([0, 3])
df1 = df.drop(range(0,2))
# Drop rows by checking conditions
df1 = df.loc[df["Discount"] >=1500 ]
# DataFrame slicing
df2=df[4:] # Returns rows from 4th row
df2=df[1:-1] # Removes first and last row
df2=df[2:4] # Return rows between 2 and 4
Now let’s see how to how to drop columns from pandas DataFrame with examples. In order to drop columns, you have to use either axis=1
or columns
param to drop() method.
df = pd.DataFrame(technologies, index=row_labels)
# Delete Column by Name
df2=df.drop(["Fee"], axis = 1)
# Drop by using labels & axis
df2=df.drop(labels=["Fee"], axis = 1)
# Drop by using columns
df2=df.drop(columns=["Fee"])
# Drop column by index
df2=df.drop(df.columns[[1]], axis = 1)
# Other ways to drop columns
df.loc[:, 'Courses':'Fee'].columns, axis = 1, inplace=True)
df.drop(df.iloc[:, 1:2], axis=1, inplace=True)
If you wanted to drop duplicate rows from pandas DataFrame use DataFrame.drop_duplicates()
11. Pandas Join, Merge, Concat to Combine DataFrames
In this section of the python pandas tutorial I will cover how to combine DataFrame using join(), merge(), and concat() methods. All these methods perform below join types. All these join methods works similarly to SQL joins.
Join Types | Supported By | Description |
---|---|---|
inner | join(), merge() and concat() | Performs Inner Join on pandas DataFrames |
left | join(), merge() | Performs Left Join on pandas DataFrames |
right | join(), merge() | Performs Right Join on pandas DataFrames |
outer | join(), merge() and concat() | Performs Outer Join on pandas DataFrames |
cross | merge() | Performs Cross Join on pandas DataFrames |
pandas.merge() and DataFrame.merge() works same way that merges two or more DataFrames. When doing a join on columns, it ignores indexes. When joining on the index, the resultant DataFrame contains indexes from sources. When no params are used, by default join happens on all common columns.
# Quick Examples of pandas merge DataFrames
# pandas.merge()
df3=pd.merge(df1,df2)
# DataFrame.merge()
df3=df1.merge(df2)
# Merge by column
df3=pd.merge(df1,df2, on='Courses')
# Merge on different colunn names
df3=pd.merge(df1,df2, left_on='Courses', right_on='Courses')
# Merge by Index
df3 = pd.merge(df1,df2,left_index=True,right_index=True)
# Merge by multiple columns
df3 = pd.merge(df3, df1, how='left', left_on=['Col1','col2'], right_on = ['col1','col2'])
# Merge by left join
df3=pd.merge(df1,df2, on='Courses', how='left')
# Merge by right join
df3=pd.merge(df1,df2, on='Courses', how='right')
# Merge by outer join
df3=pd.merge(df1,df2, on='Courses', how='outer')
Alternatively use join() for joining on the index. pandas.DataFrame.join() method is the most efficient way to join two pandas DataFrames on row index.
# pandas default join
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
# pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
# pandas Right join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='right')
# pandas outer join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='outer')
# pandas join on columns
df3=df1.set_index('Courses').join(df2.set_index('Courses'), how='inner')
Similarly, pandas also support concatenate two pandas DataFrames using concat() method.
12. Iterate over Rows to perform an operation
Pandas DataFrame provides methods iterrows(), itertuples() to iterate over each Row, where iterrows() returns (index, Series) where the index is an index of the Row and Series is data or content of each row and itertuples() returns all DataFrame elements as an iterator that contains a tuple for each row. itertuples()
is faster compared with iterrows()
and preserves data type.
df = pd.DataFrame(technologies, index=row_labels)
# Iterate all rows using DataFrame.iterrows()
for index, row in df.iterrows():
print (index,row["Fee"], row["Courses"])
# Iterate all rows using DataFrame.itertuples()
for row in df.itertuples(index = True):
print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))
# Using DataFrame.index
for idx in df.index:
print(df['Fee'][idx], df['Courses'][idx])
13. Working with Null, np.NaN & Empty Values
Below are some of the articles that I have covered to handle None/NaN values in pandas DataFrame. It is very important to handle missing data in Pandas before you perform any analytics or run with machine learning algorithms.
- How to replace None & NaN values with Blank or Empty String in pandas DataFrame
- How to replace None & NaN values with zero (0) in pandas DataFrame
- Check If any Value is NaN in pandas DataFrame
- Drop Rows with NaN Values in pandas DataFrame
- Drop Columns with NaN Values in pandas DataFrame
- Drop Infinite Values From DataFrame
- How to Add an Empty Column to a pandas DataFrame
14. Column Manipulations
One most used way to manipulate is by using pandas apply() function to DataFrame columns. If you are familiar with the lambda expressions, you can also use lambda expression with apply().
If you are not familiar with the lambda, A lambda function in python is a small anonymous function that can take any number of arguments and execute an expression. lambda expressions are utilized to construct anonymous functions. You can create one by using the lambda keyword.
DataFrame also provides several methods to manipulate data on columns.
15. Pandas Read & Write Excel
Use pandas DataFrame.to_excel() function to write a DataFrame to an excel sheet with extension .xlsx and use pandas.read_excel() function is used to read excel sheet with extension xlsx into pandas DataFrame
Read excel sheet Example
# Read Excel file
df = pd.read_excel('c:/apps/courses_schedule.xlsx')
print(df)
Write DataFrame to excel sheet
# Write DataFrame to Excel file
df.to_excel('Courses.xlsx')