Python pandas Tutorial | Introduction with Examples
1. pandas Tutorial Introduction
This is a beginner’s guide of python pandas DataFrame Tutorial where you will learn what is pandas DataFrame? its features, advantages, how to use DataFrame with sample examples.
Every sample example explained in this tutorial is tested in our development environment and is available for reference.
All pandas DataFrame examples provided in this tutorial are basic, simple, and easy to practice for beginners who are enthusiastic to learn pandas and advance their career in Data Science, analytics and Machine Learning.
Note: In case if you can’t find the pandas DataFrame examples you are looking for on this tutorial page, I would recommend using the Search option from the menu bar to find your tutorial and sample example code, there are hundreds of tutorials in pandas on this website you can learn from.
2. What is pandas?
pandas is the most popular open-source library in the python programming language and pandas is widely used for data science/data analysis and machine learning applications. It is built on top of another popular package named Numpy, which provides scientific computing in Python and supports multi-dimensional arrays. It is developed by Wes McKinney, check his GitHub for other projects he is working on.
- panda Series
- panda DataFrame
2.1 What is pandas Series
In simple words pandas
Series is a one-dimensional labeled array that holds any data type (integers, strings, floating-point numbers, None, Python objects, etc.). The axis labels are collectively referred to as the index. The later section of this pandas tutorial covers more on Series with examples.
2.2 What is pandas DataFrame
DataFrame is a 2-dimensional labeled data structure with rows and columns (columns of potentially different types like integers, strings, float, None, Python objects e.t.c). You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. The later section of this pandas tutorial covers more on DataFrame with examples.
3. pandas Advantages
4. Pandas vs PySpark
In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best where you need to process operations many times(100x) faster than Pandas.
PySpark also very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. Also PySpark used due to it’s efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.
PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.
Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.
Spark was basically written in Scala and later on due to its industry adaptation, its API PySpark was released for Python using Py4J.
Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.
Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run PySpark applications.
You can learn PySpark from the following tutorials. And also read more on pandas vs PySpark differences with Examples.
- What is PySpark
- PySpark RDD Tutorial
- PySpark DataFrame Tutorial
- How to Convert pandas DataFrame to PySpark DataFrame
- How to convert PySpark DataFrame to Pandas
4.1 How to Decide Between Pandas vs PySpark
Below are a few considerations when to choose PySpark over Pandas.
- If your data is huge and grows significantly over the years and you wanted to improve your processing time.
- If you want fault-tolerant.
- ANSI SQL compatibility.
- Language to choose (Spark supports Python, Scala, Java & R)
- When you want Machine-learning capability.
- Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
- If you wanted to stream the data and process it real-time.
5. Installing pandas
In this section of the pandas DataFrame tutorial let’s see how to install& upgrade pandas. In order to run pandas, you should have python installed first. You can install the python either directly downloading from python or using Anaconda distribution. Depending on your need, follow the below link’s to install Python, Anaconda, and Jupyter notebook to run pandas examples. I would recommend installing Anaconda with Jupyter as a good choice if you are intended to learn pandas for data science, analytics & machine learning.
- Step-by-Step Instruction of Install Anaconda & Pandas
- Run pandas from Anaconda & Jupyter Notebook
- Install Python & Run pandas from Windows
Once you have either Python or Anaconda setup, you can install pandas on top of Python or Anaconda in simple steps.
5.1 Install pandas using Python pip Command
pip (Python package manager) is used to install third-party packages from PyPI. Using pip you can install/uninstall/upgrade/downgrade any python library that is part of Python Package Index.
Since the pandas package is available in PyPI (Python Package Index), we should use it to install pandas latest version on windows.
# Install pandas using pip pip install pandas (or) pip3 install pandas
This should give you output as below. If your pip is not up to date, then upgrade pip to the latest version.
5.2 Install pandas using Anaconda conda Command
Anaconda distribution comes with a
conda tool that is used to install/upgrade/downgrade most of the python and other packages.
# Install pandas using conda conda install pandas
6. Upgrade pandas to Latest or Specific Version
In order to upgrade pandas to the latest or specific version, you can use either
pip install command or
conda install if you are using Anaconda distribution. Before you start to upgrade, you the following command to know the current version of pandas installed.
Below are statements to upgrade pandas. Depending on how you wanted to update, use either
# Using pip to upgrade pandas pip install --upgrade pandas # Alternatively you can also try python -m pip install --upgrade pandas # Upgrade pandas to specific version pip install pandas==specific-higher-version # Use conda update conda update pandas #Upgrade to specific version conda update pandas==0.14.0
If you use pip3 to upgrade, you should see something like below.
7. Run pandas Hello World Example
7.1 Run pandas From Command Line
If you installed Anaconda, open the Anaconda command line or open the python shell/command prompt and enter the following lines to get the version of pandas, to learn more follow the links from the left-hand side of the pandas tutorial.
>>> import pandas as pd >>> pd.__version__ '1.3.2' >>>
7.2 Run pandas From Jupyter
Go to Anaconda Navigator -> Environments -> your environment (I have created pandas-tutorial) -> select Open With Jupyter Notebook
This opens up Jupyter Notebook in the default browser.
Now select New -> PythonX and enter the below lines and select Run.
7.3 Run pandas from IDE
You can also run pandas from any python IDE’s like Spyder, PyCharm e.t.c
8. pandas Series Introduction
Refer to pandas Series Tutorial For Beginners with Examples.
9. pandas DataFrame
pandas DataFrame is a Two-Dimensional data structure, immutable, heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe is consists of three components principal, data, rows, and columns.
9.1 pandas Features
- DataFrames supported named rows & columns (you can also provides names to rows)
- Pandas DataFrame size is mutable.
- Supports Hetrogenous Collections of data.
- DataFrame labeled axes (rows and columns).
- Can perform arithmetic operations on rows and columns.
- Supporting reading flat files like CSV,Excel, JSON and also reads SQL tables’s
- Handling of missing data.
Happy Learning !!