Python Pandas Tutorials For Beginners

1. Pandas Introduction

This is a beginner’s guide of Python Pandas DataFrame Tutorial where you will learn what is DataFrame? its features, its advantages, and how to use DataFrame with sample examples.

2. What is Python Pandas?

Pandas is the most popular open-source library in the Python programming language and pandas is widely used for data science/data analysis and machine learning applications. It is built on top of another popular package named Numpy, which provides scientific computing in Python and supports multi-dimensional arrays. It is developed by Wes McKinney, check his GitHub for other projects he is working on.

Following are the main two data structures supported by Pandas.

pandas Series
pandas DataFrame
pandas Index

2.1 What is Pandas Series

In simple words Pandas Series is a one-dimensional labeled array that holds any data type (integers, strings, floating-point numbers, None, Python objects, etc.). The axis labels are collectively referred to as the index. The later section of this pandas tutorial covers more on the Series with examples.

2.2 What is Pandas DataFrame

Pandas DataFrame is a 2-dimensional labeled data structure with rows and columns (columns of potentially different types like integers, strings, float, None, Python objects e.t.c). You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. The later section of this pandas tutorial covers more on DataFrame with examples.

3. Pandas Advantages

4. Pandas vs PySpark

In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best where you need to process operations many times(100x) faster than Pandas.

PySpark is also very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. Also, PySpark is used due to its efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.

Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Spark was basically written in Scala and later on due to its industry adaptation, its API PySpark was released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run PySpark applications.

You can learn PySpark from the following tutorials. And also read more on pandas vs PySpark differences with Examples.

4.1 How to Decide Between Pandas vs PySpark

Below are a few considerations when choosing PySpark over Pandas.

If your data is huge and grows significantly over the years and you wanted to improve your processing time.
If you want fault-tolerant.
ANSI SQL compatibility.
Language to choose (Spark supports Python, Scala, Java & R)
When you want Machine-learning capability.
Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
If you wanted to stream the data and process it real-time.

5. Installing Pandas

In this section of the python pandas tutorial let’s see how to install & upgrade pandas. In order to run pandas, you should have python installed first. You can install Python either directly downloading from python or using Anaconda distribution. Depending on your need, follow the below link’s to install Python, Anaconda, and Jupyter notebook to run pandas examples. I would recommend installing Anaconda with Jupyter as a good choice if you are intended to learn pandas for data science, analytics & machine learning.

Once you have either Python or Anaconda setup, you can install pandas on top of Python or Anaconda in simple steps.

5.1 Install Pandas using Python pip Command

pip (Python package manager) is used to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.

Since the Pandas package is available in PyPI (Python Package Index), we should use it to install Pandas latest version on windows.


# Install pandas using pip
pip install pandas
(or)
pip3 install pandas

This should give you the output as below. If your pip is not up to date, then upgrade pip to the latest version.

5.2 Install Pandas using Anaconda conda Command

Anaconda distribution comes with a conda tool that is used to install/upgrade/downgrade most of the python and other packages.


# Install pandas using conda
conda install pandas

6. Upgrade Pandas to Latest or Specific Version

In order to upgrade pandas to the latest or specific version, you can use either pip install command or conda install if you are using Anaconda distribution. Before you start to upgrade, you use the following command to know the current version of pandas installed.

Below are statements to upgrade pandas. Depending on how you wanted to update, use either pip or conda statements.


# Using pip to upgrade pandas
pip install --upgrade pandas

# Alternatively you can also try
python -m pip install --upgrade pandas

# Upgrade pandas to specific version
pip install pandas==specific-higher-version

# Use conda update
conda update pandas

#Upgrade to specific version
conda update pandas==0.14.0

If you use pip3 to upgrade, you should see something like the below.

7. Run Pandas Hello World Example

7.1 Run Pandas From Command Line

If you installed Anaconda, open the Anaconda command line or open the python shell/command prompt and enter the following lines to get the version of pandas, to learn more follow the links from the left-hand side of the pandas tutorial.


>>> import pandas as pd
>>> pd.__version__
'1.3.2'
>>>

7.2 Run Pandas From Jupyter

Go to Anaconda Navigator -> Environments -> your environment (I have created pandas-tutorial) -> select Open With Jupyter Notebook

This opens up Jupyter Notebook in the default browser.

Now select New -> PythonX and enter the below lines and select Run.

7.3 Run Pandas from IDE

You can also run pandas from any python IDE’s like Spyder, PyCharm e.t.c

8. Pandas Series Introduction

A pandas Series is a one-dimensional array that can accommodate diverse data types, including integers, strings, floats, Python objects, and more. Utilizing the series() method, we can convert lists, tuples, and dictionaries into Series objects. Within a pandas Series, the row labels are referred to as the index. It’s important to note that a Series can only consist of a single column and cannot hold multiple columns simultaneously. Lists, NumPy arrays, and dictionaries can all be transformed into pandas Series.

8.1. Pandas.series() Constructor

Below is the syntax of pandas Series Constructor, which is used to create Series object.


# Pandas Series Constructor Syntax
Pandas.series(data,index,dtype,copy)

data: The data contains ndarray, list, constants.
Index: The index must be unique and hashable. np.arrange(n) if no index is passed.
dtype: dtype is also a data type.
copy: It is used to copy the data. The data contains ndarray, list, constants.

8.2 . Create Pandas Series

pandas Series can be created in multiple ways, From array, list, dict, and from existing DataFrame.

8.2.1 Creating Series from NumPy Array


# Create Series from array
import pandas as pd 
import numpy as np
data = np.array(['python','php','java'])
series = pd.Series(data)
print (series)

8.2.2 Creating Series from Dict


# Create a Dict from a input
data = {'Courses' :"pandas", 'Fees' : 20000, 'Duration' : "30days"}
s2 = pd.Series(data)
print (s2)

8.3.3 Creating Series from List


#Creating DataFrame from List
data = ['python','php','java']
s2 = pd.Series(data, index=['r1', 'r2','r3'])
print(s2)

Refer to pandas Series Tutorial For Beginners with Examples.

9. Pandas DataFrame

I have a dedicated tutorial for python pandas DataFrame hence, in this section I will briefly explain what is DataFrame. DataFrame is a Two-Dimensional data structure, immutable, heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe is consists of three components principal, data, rows, and columns.

9.1 DataFrame Features

DataFrames supported named rows & columns (you can also provide names to rows)
Pandas DataFrame size is mutable.
Supports Hetrogenous Collections of data.
DataFrame labeled axes (rows and columns).
Can perform arithmetic operations on rows and columns.
Supporting reading flat files like CSV,Excel, JSON and also reading SQL tables’s
Handling of missing data.

10. Pandas Series vs DataFrame?

Here is a comparison between pandas Series and DataFrames.

Feature	Series	DataFrame
Dimensionality	One-dimensional	Two-dimensional
Structure	Labeled array	Labeled data structure with rows and columns
Components	Consists of data and index	Consists of data, row index, and column index
Data Types	Homogeneous (same data type)	Heterogeneous (different data types per column)
Creation	From lists, arrays, dictionaries, or scalars	From dictionaries, arrays, lists, or other DataFrames
Operations	Supports operations like indexing, slicing, arithmetic operations	Supports operations like merging, joining, grouping, reshaping
Use Cases	Useful for representing a single column of data or simple data structures	Suitable for tabular data with multiple columns and rows

Refer to pandas DataFrame Tutorial For Beginners with Examples

Happy Learning !!