• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:11 mins read
You are currently viewing Pandas set_index() – Set Index to DataFrame

pandas.DataFrame.set_index() is used to set the index to pandas DataFrame. By using set_index() method you can set the list of values, existing pandas DataFrame column, Series as an index, also set multiple columns as indexes. Use pandas.DataFrame.reset_index() to reset the index with default numeric values.

What is pandas Index?

An index is like a pointer to identify rows/columns across the DataFrame or series. Rows and columns both have indexes. Rows indices are called indexes and for columns, it’s usually column names or labels.

pandas.DataFrame.set_index() Key Points

  • Index can be set while creating a pandas DataFrame, use set_index() method to set indices to existing DataFrmae.
  • You can also set index from a List, Series or DataFrame. hence, you can have mutliple indices to the DataFrame.

1. Quick Examples of pandas Set Index

Below are quick examples and usage of pandas.DataFrame.set_index() method.


# Below are the quick examples.

# Set list to index
index_labels=['r1','r2','r3']
df.index = index_labels

# Set single colin as index
df2 = df.set_index('Courses')

# Append index
df2 = df.set_index('Courses', append=True)

# Set multiple columns as Index
df2 = df.set_index(['Courses','Duration'])

# Set date time as index 
df2 = df.set_index(pd.DatetimeIndex(pd.to_datetime(df['Start_Date'])))

2. pandas.DataFrame.set_index() Syntax

Below is the syntax of the set_index() method.


# Pandas DataFrame set_index() syntax
DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)

This method takes the below parameters and returns a DataFrame after setting an Index. If you used inplace=True, this returns None and sets the Index on the existing DataFrame object.

  • keys – Accepts singe column name as String, list of column names e.t.c
  • drop – Deletes the column after setting an index. Default set to True.
  • append – Specify to append new Index to existing Index. Default set to False.
  • inplace – Modifies the existing DataFrame object in place. Default set to False.
  • verify_integrity – Check the new index for duplicates. Default set to False. By using True it degrades the performance of the method.

Let’s create a pandas DataFrame, run the above examples, and validate results.


# Create DataFrame
import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop"],
    'Fee' :[20000,25000,26000],
    'Duration':['30day','40days','35days'],
    'Discount':[1000,np.nan,1200],
    'Start_Date' : ['2021-02-04 05:30:00','01-09-2021 06:30:00',
                    '2021-03-06 07:30:00']
              }

df = pd.DataFrame(technologies)
print(df)

# Output:
#   Courses    Fee Duration  Discount           Start_Date
# 0    Spark  20000    30day    1000.0  2021-02-04 05:30:00
# 1  PySpark  25000   40days       NaN  01-09-2021 06:30:00
# 2   Hadoop  26000   35days    1200.0  2021-03-06 07:30:00

3. pandas Set Index Example

Since we have not provided an index list at the time of creating the above DataFrame, pandas DataFrame by default assigns incremental sequence numbers as labels to rows as Index. You can change the index by assigning the list of values to DataFrame.index variable.


# Set list to index
index_labels=['r1','r2','r3']
df.index = index_labels
print(df)

# Outputs:
#    Courses    Fee Duration  Discount           Start_Date
# r1    Spark  20000    30day    1000.0  2021-02-04 05:30:00
# r2  PySpark  25000   40days       NaN  01-09-2021 06:30:00
# r3   Hadoop  26000   35days    1200.0  2021-03-06 07:30:00

If you want, you can also set name to index using rename_axis().

4. Setting Single Column as Index by using set_index()

Sometimes you would be required to set one of the existing DataFrame column as an Index, you can achieve this by using set_index() method. after setting the index, it drops the column from DataFrame. To retain it use the drop=False param.


# Set single colin as index
df2 = df.set_index('Courses')
print(df2)

# Output:
#           Fee Duration  Discount           Start_Date
# Courses                                               
# Spark    20000    30day    1000.0  2021-02-04 05:30:00
# PySpark  25000   40days       NaN  01-09-2021 06:30:00
# Hadoop   26000   35days    1200.0  2021-03-06 07:30:00

Note that setting the index replaces the existing index in DataFrame. If you wanted to retain the existing Index and append new index use append=True.


# Append index
df2 = df.set_index('Courses', append=True)
print(df2)

# Output:
#              Fee Duration  Discount           Start_Date
#   Courses                                               
# r1 Spark    20000    30day    1000.0  2021-02-04 05:30:00
# r2 PySpark  25000   40days       NaN  01-09-2021 06:30:00
# r3 Hadoop   26000   35days    1200.0  2021-03-06 07:30:00

5. pandas set Index Multiple Columns

You can also set multiple columns as index in pandas, In order to do so just pass all columns in a list to DataFrame.set_index() method.


# Set multiple columns as Index
df2 = df.set_index(['Courses','Duration'])
print(df2)

# Output:
#                    Fee  Discount           Start_Date
# Courses Duration                                      
# Spark   30day     20000    1000.0  2021-02-04 05:30:00
# PySpark 40days    25000       NaN  01-09-2021 06:30:00
# Hadoop  35days    26000    1200.0  2021-03-06 07:30:00

6. pandas Set Index to datetime

When you are working with date and time and wanted to perform some filtering on datetime, it’s best practice to set the date and time column as an index. Before you do this, make sure your date column is in datetime format. Use pandas.DatetimeIndex() method to conver datetime to index.


# Set date time as index 
df2 = df.set_index(pd.DatetimeIndex(pd.to_datetime(df['Start_Date'])))
print(df2)

# Output:
#                     Courses    Fee Duration  Discount           Start_Date
# Start_Date                                                                 
# 2021-02-04 05:30:00    Spark  20000    30day    1000.0  2021-02-04 05:30:00
# 2021-01-09 06:30:00  PySpark  25000   40days       NaN  01-09-2021 06:30:00
# 2021-03-06 07:30:00   Hadoop  26000   35days    1200.0  2021-03-06 07:30:00

By run df2.inf(), will result you below


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2021-02-04 05:30:00 to 2021-03-06 07:30:00
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Courses     3 non-null      object 
 1   Fee         3 non-null      int64  
 2   Duration    3 non-null      object 
 3   Discount    2 non-null      float64
 4   Start_Date  3 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 144.0+ bytes
None

In case you wanted to set the index to a column use DataFrame.reset_index(). There are also several other ways to set indices.

7. Complete Example of pandas Set Index


import pandas as pd
import numpy as np
technologies = {
    'Courses':["Spark","PySpark","Hadoop"],
    'Fee' :[20000,25000,26000],
    'Duration':['30day','40days','35days'],
    'Discount':[1000,np.nan,1200],
    'Start_Date' : ['2021-02-04 05:01:21','01-09-2021 06:03:41',
                    '2021-03-06 07:06:21']
              }

df = pd.DataFrame(technologies)
print(df)

# Set list to index
index_labels=['r1','r2','r3']
df.index = index_labels
print(df)

# Set single colin as index
df2 = df.set_index('Courses')
print(df2)

# Append index
df2 = df.set_index('Courses', append=True)
print(df2)

# Set multiple columns as Index
df2 = df.set_index(['Courses','Duration'])
print(df2)

# Set date time as index 
df2 = df.set_index(pd.DatetimeIndex(pd.to_datetime(df['Start_Date'])))
print(df2)
print(df2.info())

8. Conclusion

In this article, you have learned pandas.DataFrame.set_index() syntax, usage, and examples like setting list, DataFrame column as an index. And also learned to set multiple columns and DateTime as indexes to DataFrame.

Reference

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium