• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:19 mins read
You are currently viewing Pandas Read Excel with Examples

pandas.read_excel() function is used to read Excel sheets with the extension xlsx into pandas DataFrame. By reading a single sheet it returns a pandas DataFrame object, but by reading two sheets it returns a Dict of DataFrame.

pandas Read Excel KeyPoints

  • This supports reading files with extensions xls, xlsx, xlsm, xlsb, odf, ods, and odt 
  • Can load Excel files stored in a local filesystem or from a URL.
  • For URL, it supports http, ftp, s3, and file.
  • Also supports reading from a single sheet or a list of sheets.
  • When reading two sheets, it returns a Dict of DataFrame.

Table of contents –

I have an Excel file with two sheets named Technologies and Schedule, I will be using this to demonstrate how to read into pandas DataFrame.

Notice that in our Excel file, the top row contains the header of the table which can be used as column names on DataFrame.

1. pandas Read Excel Sheet

Use the pandas.read_excel() function to read the Excel sheet into pandas DataFrame, by default it loads the first sheet from the Excel file and parses the first row as a DataFrame column name. Excel file has an extension .xlsx. This function also supports several extensions xls, xlsx, xlsm, xlsb, odf, ods, and odt.

Following are some of the features supported by read_excel() with optional parameters.

  • Reading Excel files from URL, S3, and from local file ad supports several extensions.
  • Ignoring the column names and provides an option to set column names.
  • Setting column as Index
  • Considering multiple values as NaN
  • Decimal points to use for numbers
  • Data types for each column
  • Skipping rows and columns

I will cover how to use some of these optional params with examples, let’s see how to read an Excel sheet & create a DataFrame without any params.


import pandas as pd
# Read Excel file
df = pd.read_excel('c:/apps/courses_schedule.xlsx')
print(df)

# Outputs
#   Courses    Fee  Duration  Discount
#0    Spark  25000   50 Days      2000
#1   Pandas  20000   35 Days      1000
#2     Java  15000       NaN       800
#3   Python  15000   30 Days       500
#4      PHP  18000   30 Days       800

Related: pandas Write to Excel Sheet

2. Read by Ignoring Column Names

By default, it considers the first row from Excel as a header and uses it as a DataFrame column name. In case you want to consider the first row from Excel as a data record use header=None param and use names param to specify the column names. Not specifying names results in column names with numerical numbers.


# Read excel by considering first row as data
columns = ['courses','course_fee','course_duration','course_discount']
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   header=None, names = columns)
print(df2)

# Outputs        
#0  courses  course_fee  Duration  Discount
#1    Spark  25000   50 Days      2000
#2   Pandas  20000   35 Days      1000
#3     Java  15000       NaN       800
#4   Python  15000   30 Days       500
#5      PHP  18000   30 Days       800

3. Set Column from Excel as Index

If you notice, the DataFrame was created with the default index, if you wanted to set the column name as index use index_col param. This param takes values {int, list of int, default None}. If a list is passed with header positions, it creates a MultiIndex.

By default, it is set to None meaning no column is set as an index.


# Read excel by setting column as index
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   index_col=0)
print(df2)

# Outputs
#            Fee Duration  Discount
# 
# Spark    25000  50 Days      2000
# Pandas   20000  35 Days      1000
# Java     15000      NaN       800
# Python   15000  30 Days       500
# PHP      18000  30 Days       800

4. Read Excel by Sheet Name

As I said in the above section by default pandas read the first sheet from the Excel file and provide a sheet_name param to read a specific sheet by name. This param takes {str, int, list, or None} as values. This is also used to load a sheet by position.

By default, it is set to 0 meaning load the first sheet.


# Read specific excel sheet
df = pd.read_excel('records.xlsx', sheet_name='Sheet1')
print(df)

5. Read Two Sheets

sheet_name param also takes a list of sheet names as values that can be used to read two sheets into Pandas DataFrame. Not that while reading two sheets it returns a Dict of DataFrame. The key in Dict is a sheet name and the value would be DataFrame.

Use None to load all sheets from Excel and return a Dict of Dictionary.


# Read Multiple sheets
dict_df = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   sheet_name=['Technologies','Schedule'])

# Get DataFrame from Dict
technologies_df = dict_df .get('Technologies')
schedule_df = dict_df.get('Schedule')

# Print DataFrame's
print(technologies_df)
print(schedule_df)

I will leave this to you to execute and validate the output.

6. Skip Columns From Excel Sheet

Sometimes while reading an Excel sheet into pandas DataFrame you may need to skip columns, you can do this by using usecols param. This takes values {int, str, list-like, or callable default None}. To specify the list of column names or positions use a list of strings or a list of int.

By default, it is set to None meaning load all columns


# Read excel by skipping columns
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   usecols=['Courses', 'Duration'])
print(df2)
# Outputs
#  Courses Duration
#0   Spark  50 Days
#1  Pandas  35 Days
#2    Java      NaN
#3  Python  30 Days
#4     PHP  30 Days

Alternatively, you can also write it by column position.


# Skip columns with list of values
df = pd.read_excel('records.xlsx', usecols=[0,2])
print(df)

Also supports a range of columns as values. For example, value ‘B:D’ means parsing B, C, and D columns.


# Skip columns by range
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   usecols='B:D')
print(df2)

     Fee Duration  Discount
0  25000  50 Days      2000
1  20000  35 Days      1000
2  15000      NaN       800
3  15000  30 Days       500
4  18000  30 Days       800

7. Skip Rows from the Excel Sheet

Use skiprows param to skip rows from the Excel file, this param takes values {list-like, int, or callable, optional}. With this, you can skip the first few rows, selected rows, and range of rows. The below example skips the first 3 rows and considers the 4th row from Excel as the header.


# Read excel file by skipping rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=2)
print(df2)

   Pandas  20000  35 Days  1000
0    Java  15000      NaN   800
1  Python  15000  30 Days   500
2     PHP  18000  30 Days   800

Use header=None to consider the 4th row as data. you can also use a list of rows to skip.


# Using skiprows to skip rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=[1,3])
print(df2)

  Courses    Fee Duration  Discount
0  Pandas  20000  35 Days      1000
1  Python  15000  30 Days       500
2     PHP  18000  30 Days       800

By using a lambda expression.


# Using skiprows with lambda
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=lambda x: x in [1,3])
print(df2)

8. Other Important Params

  • dtype – Dict with column name and type.
  • nrows – How many rows to parse?
  • na_values – Additional strings to recognize as NA/NaN. 
  • keep_default_na – Whether or not to include the default NaN values when parsing the data. 
  • na_filter – Filters missing values.
  • parse_dates – Specify the column index you wanted to parse as dates
  • thousands – Thousands separator for parsing string columns to numeric.
  • skipfooter – Specify how to rows you want to skip from the footer.
  • mangle_dupe_cols – Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, 

For complete parameters and description, refer to pandas documentation.

Frequently Asked Questions (FAQ) on Pandas Read Excel with Examples

How do I read an Excel file using Pandas?

Use the pandas.read_excel() function to read an Excel file. Example, df = pd.read_excel('file.xlsx')

How can I specify a particular sheet to read from an Excel file?

You can pass sheet_name parameter along with 'file.xlsx' into pandas.read_excel() function to read the particular sheet from an Excel file. Example, df = pd.read_excel('file.xlsx', sheet_name='Sheet1')

What if I want to read multiple sheets into a dictionary of DataFrames?

You can set sheet_name to None or use sheet_name=None of pandas.read_excel() function to read the multiple sheets into a dictionary of DataFrame, for example, dfs = pd.read_excel(‘file.xlsx’, sheet_name=None)

How can I read only a specific range of cells from an Excel file?

You can use the skiprows and nrows parameters of pandas.read_excel() function to read only a specific range of cells from an Excel file. For example, df = pd.read_excel('file.xlsx', skiprows=2, nrows=5)

How can I read Excel files from the web directly?

You can provide a URL instead of a file path to read Excel files from the web directly. For example, url = 'https://example.com/file.xlsx'
df = pd.read_excel(url)

Conclusion

In this article, you have learned how to read an Excel sheet and convert it into DataFrame by ignoring headers, skipping rows, skipping columns, specifying column names, and many more.

Happy Learning !!

Related Articles

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

This Post Has 2 Comments

  1. NNK

    Thanks, Sudhakar for pointing out. I have corrected it now.

  2. Sudhakar

    In 5.

    schedule_df = df.get(‘Schedule’)

    The above line should be

    schedule_df = dict_df.get(‘Schedule’)

Comments are closed.