pandas DataFrame Tutorial | Beginners Guide

1. pandas DataFrame Tutorial Introduction

This is a beginner’s guide of python pandas DataFrame Tutorial where you will learn what is pandas DataFrame? its features, advantages, how to use DataFrame with sample examples.

Every sample example explained in this tutorial is tested in our development environment and is available for reference.

All pandas DataFrame examples provided in this tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn pandas and advance their career in Data Science, analytics and Machine Learning.

Note: In case if you can’t find the pandas DataFrame examples you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your tutorial and sample example code, there are hundreds of tutorials in pandas on this website you can learn from.

2. What is pandas DataFrame?

pandas DataFrame is a Two-Dimensional data structure, immutable, heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe is consists of three components principal, data, rows, and columns.

3. DataFrame Features

  • DataFrames support named rows & columns (you can also provides names to rows)
  • Supports Hetrogenous Collections of data.
  • DataFrame labeled axes (rows and columns).
  • Can perform arithmetic operations on rows and columns.
  • Supporting reading flat files like CSV,Excel, JSON and also reads SQL tables’s
  • Handling of missing data.

4. Pandas DataFrame vs PySpark DataFrame

In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best where you need to process operations many times(100x) faster than Pandas.

PySpark is also very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. Also, PySpark is used due to its efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.

Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Spark was basically written in Scala and later on due to its industry adaptation, its API PySpark was released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDEJupyter notebook to run PySpark applications.

You can learn PySpark from the following tutorials. And also read more on pandas vs PySpark differences with Examples.

4.1 How to Decide Between Pandas vs PySpark

Below are a few considerations when to choose PySpark over Pandas.

  • If your data is huge and grows significantly over the years and you wanted to improve your processing time.
  • If you want fault-tolerant.
  • ANSI SQL compatibility.
  • Language to choose (Spark supports Python, Scala, Java & R)
  • When you want Machine-learning capability.
  • Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
  • If you wanted to stream the data and process it real-time.

5. Installing pandas

In this section of the pandas DataFrame tutorial let’s see how to install& upgrade pandas. In order to run pandas, you should have python installed first. You can install the python either directly downloading from python or using Anaconda distribution. Depending on your need, follow the below link’s to install Python, Anaconda, and Jupyter notebook to run pandas examples. I would recommend installing Anaconda with Jupyter as a good choice if you are intended to learn pandas for data science, analytics & machine learning.

Once you have either Python or Anaconda setup, you can install pandas on top of Python or Anaconda in simple steps.

5.1 Install pandas using Python pip Command

pip (Python package manager) is used to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.

Since the pandas package is available in PyPI (Python Package Index), we should use it to install pandas latest version on windows.


# Install pandas using pip
pip install pandas
(or)
pip3 install pandas

This should give you output as below. If your pip is not up to date, then upgrade pip to the latest version.

Install pandas pip windows

5.2 Install pandas using Anaconda conda Command

Anaconda distribution comes with a conda tool that is used to install/upgrade/downgrade most of the python and other packages.


# Install pandas using conda
conda install pandas

6. Upgrade pandas to Latest or Specific Version

In order to upgrade pandas to the latest or specific version, you can use either pip install command or conda install if you are using Anaconda distribution. Before you start to upgrade, you the following command to know the current version of pandas installed.

pandas tutorial installed version

Below are statements to upgrade pandas. Depending on how you wanted to update, use either pip or conda statements.


# Using pip to upgrade pandas
pip install --upgrade pandas

# Alternatively you can also try
python -m pip install --upgrade pandas

# Upgrade pandas to specific version
pip install pandas==specific-higher-version

# Use conda update
conda update pandas

#Upgrade to specific version
conda update pandas==0.14.0

If you use pip3 to upgrade, you should see something like below.

pandas dataframe tutorial

7. Run pandas Hello World Example

7.1 Run pandas From Command Line

If you installed Anaconda, open the Anaconda command line or open the python shell/command prompt and enter the following lines to get the version of pandas, to learn more follow the links from the left-hand side of the pandas tutorial.


>>> import pandas as pd
>>> pd.__version__
'1.3.2'
>>>

7.2 Run pandas From Jupyter

Go to Anaconda Navigator -> Environments -> your environment (I have created pandas-tutorial) -> select Open With Jupyter Notebook

pandas dataframe tutorial

This opens up Jupyter Notebook in the default browser.

pandas jupyter notenook

Now select New -> PythonX and enter the below lines and select Run.

pandas tutorial with jupyter

7.3 Run pandas from IDE

You can also run pandas from any python IDE’s like Spyder, PyCharm e.t.c

8. Create pandas DataFrame

8.1 Creating using Constructor

One of the easiest ways to create a DataFrame is by using its constructor.


# Create pandas DataFrame from List
import pandas as pd
technologies = [ ["Spark",20000, "30days"], 
                 ["pandas",20000, "40days"], 
               ]
df=pd.DataFrame(technologies)
print(df)

Since we are not giving labels to columns and rows(index), DataFrame by default assigns incremental sequence numbers as labels to both rows and columns.


        0      1       2
0   Spark  20000  30days
1  pandas  20000  40days

Column names with sequence numbers don’t make sense as it’s hard to identify what data holds on each column hence, it is always best practice to provide column names that identify the data it holds. Use column param and index param to provide column & row labels respectively to the DataFrame.


# Add Column & Row Labels to the DataFrame
column_names=["Courses","Fee","Duration"]
row_label=["a","b"]
df=pd.DataFrame(technologies,columns=column_names,index=row_label)
print(df)

Yields below output.


  Courses    Fee Duration
a   Spark  20000   30days
b  pandas  20000   40days

By default, pandas identify the data types from the data and assign’s to the DataFrame. df.dtypes returns the data type of each column.


Courses     object
Fee          int64
Duration    object
dtype: object

You can also assign custom data types to columns.


# set custom types to DataFrame
types={'Courses': str,'Fee':float,'Duration':str}
df=df.astype(types)

Another most used way to create DataFrame is from the dictionary.


# Create DataFrame from Dictionary
technologies = {
    'Courses':["Spark","PySpark","Hadoop"],
    'Fee' :[20000,25000,26000],
    'Duration':['30day','40days','35days'],
    'Discount':[1000,2300,1500]
              }
df = pd.DataFrame(technologies)

8.2 Create DataFrame From CSV File

In real-time we are often required to read TEXT, CSV, JSON files to create a DataFrame. In pandas, creating a DataFrame from CSV is pretty simple.


# Create DataFrame from CSV file
df = pd.read_csv('data_file.csv')

9. Basic Operations of pandas DataFrame

IN order to explain some basic operations with pandas DataFrame let’s create it with some data.


# Create DataFrame with None/Null to work with examples
import pandas as pd
import numpy as np
technologies   = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas",None,"Spark","Python"],
    'Fee' :[22000,25000,23000,24000,np.nan,25000,25000,22000],
    'Duration':['30day','50days','55days','40days','60days','35day','','50days'],
    'Discount':[1000,2300,1000,1200,2500,1300,1400,1600]
          })
row_labels=['r0','r1','r2','r3','r4','r5','r6','r7']
df = pd.DataFrame(technologies, index=row_labels)
print(df)

Note that our data contains np.nan, None, and empty values.

9.1 DataFrame properties

DataFrame has several properties, in this pandas DataFrame tutorial I will cover most used properties with examples.

Method/PropertyResultDescription
df.shape(8, 4)Returns a shape of the pandas DataFrame (number of rows and columns) as a tuple. Number of rows and columns
df.size32Returns number of cells. It would be rows * columns.
df.emptyFalseReturn boolean. True when DF is empty.
df.columnsIndex([‘Courses’, ‘Fee’, ‘Duration’, ‘Discount’], dtype=’object’)Returns all column names as Series
df.columns.values[‘Courses’ ‘Fee’ ‘Duration’ ‘Discount’]Returns column names from the header as a list in pandas.
df.indexIndex([‘r0’, ‘r1’, ‘r2’, ‘r3’, ‘r4’, ‘r5’, ‘r6’, ‘r7′], dtype=’object’)Returns all row labels
df.index.values[‘r0’ ‘r1’ ‘r2’ ‘r3’ ‘r4’ ‘r5’ ‘r6’ ‘r7’]Returns row labels as list
df.dtypesCourses object
Fee float64
Duration object
Discount int64
dtype: object
Returns Data types of columns
df2=df['Fee']
df2=df[['Fee']]
r0 22000.0
r1 25000.0
r2 23000.0
r3 24000.0
r4 NaN
r5 25000.0
r6 25000.0
r7 22000.0
Name: Fee, dtype: float64
Pandas Select Columns by Name.
Also, use to select multiple columns
df2=df[df['Fee'] == 22000]Courses Fee Duration Discount
r0 Spark 22000.0 30day 1000
r7 Python 22000.0 50days 1600
Filter DataFrame Rows by Condition
df2=df[6:]Courses Fee Duration Discount
r6 Spark 25000.0 30day 1400
r7 Python 22000.0 50days 1600
Select Dataframe Rows by Index
Select’s Row from 6th Index
df['Duration'][3]
df["Duration"].values[3]
40daysGet cell value (row x column) of DataFrame
df['Fee'] = df['Fee'] - 500
df['Fee']
r0 21500.0
r1 24500.0
r2 22500.0
r3 23500.0
r4 NaN
r5 24500.0
r6 24500.0
r7 21500.0
Update DataFrame Column
Substract 500 from ‘Fee’ Column

9.3 Select loc[], iloc[], at[], iat[]

9.3 Manipulate DataFrame Methods

describe() – describe function calculates count, mean, std, min, max, and different percentages of each numeric column of pandas DataFrame.


# Describe DataFrame for all numberic columns
df.describe()

# Output
                Fee     Discount
count      7.000000     8.000000
mean   23714.285714  1537.500000
std     1380.131119   570.557372
min    22000.000000  1000.000000
25%    22500.000000  1150.000000
50%    24000.000000  1350.000000
75%    25000.000000  1775.000000
max    25000.000000  2500.000000

query()/apply()/loc[] – These are used to filter rows from pandas DataFrame. you can also do operator chaining while filtering pandas rows.


# Using DataFrame.query()
df.query("Courses == 'Spark'",inplace=True)
df.query("Courses != 'Spark'")
df.query("Courses in ('Spark','PySpark')")
df.query("`Courses Fee` >= 23000 and `Courses Fee` <= 24000")

# Using DataFrame.loc[]
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]

#Using apply()
df.apply(lambda row: row[df['Courses'].isin(['Spark','PySpark'])])

# Other ways to filter 
df[df["Courses"] == 'Spark'] 
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]

insert()/assign() – Adds a new column to the pandas DataFrame

By using assign() & insert() methods you can add one or multiple columns to the pandas DataFrame.


df = pd.DataFrame(technologies, index=row_labels)

# Adds new column 'TutorsAssigned' to DataFrame
tutors = ['William', 'Henry', 'Michael', 'John', 
          'Messi', 'Ramana','Kumar','Vasu']
df2 = df.assign(TutorsAssigned=tutors)

# Add new column from existing column
df2=df.assign(Discount_Percent=lambda x: x.Fee * x.Discount / 100)

#Other way to add a column
df["TutorsAssigned"] = tutors

# Add new column at the beginning
df.insert(0,'TutorsAssigned', tutors )

rename() – Rename DataFrame columns

Pandas DataFrame.rename() method is used to change/replace columns (single & multiple columns), by index, and all columns of the DataFrame.


df = pd.DataFrame(technologies, index=row_labels)

# Assign new header by setting new column names.
df.columns=['A','B','C']

# Change column name by index. This changes 3rd column 
df.columns.values[2] = "C"

# Rename Column Names using rename() method
df2 = df.rename({'a': 'A', 'b': 'B'}, axis=1)
df2 = df.rename({'a': 'A', 'b': 'B'}, axis='columns')
df2 = df.rename(columns={'a': 'A', 'b': 'B'})

# Rename columns inplace (self DataFrame)
df.rename(columns={'a': 'A', 'b': 'B'}, inplace = True)

# Rename using lambda function
df.rename(columns=lambda x: x[1:], inplace=True)

# Rename with error. When x not present, it thorows error.
df.rename(columns = {'x':'X'}, errors = "raise")

drop() – drop method is used to drop rows and columns

Below are some examples. In order to understand better go through drop rows from panda DataFrame with examples.


df = pd.DataFrame(technologies, index=row_labels)

# Drop rows by labels
df1 = df.drop(['r1','r2'])

# Delete Rows by position
df1=df.drop(df.index[[1,3]])

# Delete Rows by Index Range
df1=df.drop(df.index[2:])

# When you have default indexs for rows
df1 = df.drop(0)
df1 = df.drop([0, 3])
df1 = df.drop(range(0,2))

# Drop rows by checking conditions
df1 = df.loc[df["Discount"] >=1500 ]

# DataFrame slicing
df2=df[4:]     # Returns rows from 4th row
df2=df[1:-1]   # Removes first and last row
df2=df[2:4]    # Return rows between 2 and 4

Now let’s see how to drop single & multiple columns with examples. In order to drop columns, you have to use either axis=1 or columns param to drop() method.


df = pd.DataFrame(technologies, index=row_labels)

# Delete Column by Name
df2=df.drop(["Fee"], axis = 1)

# Drop by using labels & axis
df2=df.drop(labels=["Fee"], axis = 1)

# Drop by using columns
df2=df.drop(columns=["Fee"])

#Drop column by index
df2=df.drop(df.columns[[1]], axis = 1)

# Other ways to drop columns
df.loc[:, 'Courses':'Fee'].columns, axis = 1, inplace=True)
df.drop(df.iloc[:, 1:2], axis=1, inplace=True)

10. Iterate over Rows to perform an operation

Pandas DataFrame provides methods iterrows(), itertuples() to iterate over each Row, where iterrows() returns (index, Series) where the index is an index of the Row and Series is data or content of each row and itertuples() returns all DataFrame elements as an iterator that contains a tuple for each row. itertuples() is faster compared with iterrows() and preserves data type.


df = pd.DataFrame(technologies, index=row_labels)

# Iterate all rows using DataFrame.iterrows()
for index, row in df.iterrows():
    print (index,row["Fee"], row["Courses"])

# Iterate all rows using DataFrame.itertuples()
for row in df.itertuples(index = True):
    print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))

# Using DataFrame.index
for idx in df.index:
     print(df['Fee'][idx], df['Courses'][idx])

11. Working with Null, np.NaN & Empty Values

Below are some of the articles that I have covered to handle None/NaN values in pandas DataFrame

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply