pandas.read_excel() function is used to read Excel sheets with the extension xlsx into pandas DataFrame. By reading a single sheet it returns a pandas DataFrame object, but by reading two sheets it returns a Dict of DataFrame.
pandas Read Excel KeyPoints
- This supports reading files with extensions xls, xlsx, xlsm, xlsb, odf, ods, and odt
- Can load Excel files stored in a local filesystem or from a URL.
- For URL, it supports http, ftp, s3, and file.
- Also supports reading from a single sheet or a list of sheets.
- When reading two sheets, it returns a Dict of DataFrame.
Table of contents –
- Read Excel Sheet into DataFrame
- Read by Ignoring Column Names
- Set Column from Excel as Index
- Read Excel by Sheet Name
- Read Two Sheets
- Skip Columns From Excel
- Skip Rows From Excel
- Other Important Params
I have an Excel file with two sheets named Technologies
and Schedule
, I will be using this to demonstrate how to read into pandas DataFrame.
Notice that in our Excel file, the top row contains the header of the table which can be used as column names on DataFrame.
1. pandas Read Excel Sheet
Use the pandas.read_excel() function to read the Excel sheet into pandas DataFrame, by default it loads the first sheet from the Excel file and parses the first row as a DataFrame column name. Excel file has an extension .xlsx. This function also supports several extensions xls, xlsx, xlsm, xlsb, odf, ods, and odt.
Following are some of the features supported by read_excel() with optional parameters.
- Reading Excel files from URL, S3, and from local file ad supports several extensions.
- Ignoring the column names and provides an option to set column names.
- Setting column as Index
- Considering multiple values as NaN
- Decimal points to use for numbers
- Data types for each column
- Skipping rows and columns
I will cover how to use some of these optional params with examples, let’s see how to read an Excel sheet & create a DataFrame without any params.
import pandas as pd
# Read Excel file
df = pd.read_excel('c:/apps/courses_schedule.xlsx')
print(df)
# Outputs
# Courses Fee Duration Discount
#0 Spark 25000 50 Days 2000
#1 Pandas 20000 35 Days 1000
#2 Java 15000 NaN 800
#3 Python 15000 30 Days 500
#4 PHP 18000 30 Days 800
Related: pandas Write to Excel Sheet
2. Read by Ignoring Column Names
By default, it considers the first row from Excel as a header and uses it as a DataFrame column name. In case you want to consider the first row from Excel as a data record use header=None param and use names param to specify the column names. Not specifying names results in column names with numerical numbers.
# Read excel by considering first row as data
columns = ['courses','course_fee','course_duration','course_discount']
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx',
header=None, names = columns)
print(df2)
# Outputs
#0 courses course_fee Duration Discount
#1 Spark 25000 50 Days 2000
#2 Pandas 20000 35 Days 1000
#3 Java 15000 NaN 800
#4 Python 15000 30 Days 500
#5 PHP 18000 30 Days 800
3. Set Column from Excel as Index
If you notice, the DataFrame was created with the default index, if you wanted to set the column name as index use index_col
param. This param takes values {int, list of int, default None}. If a list is passed with header positions, it creates a MultiIndex.
By default, it is set to None
meaning no column is set as an index.
# Read excel by setting column as index
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx',
index_col=0)
print(df2)
# Outputs
# Fee Duration Discount
#
# Spark 25000 50 Days 2000
# Pandas 20000 35 Days 1000
# Java 15000 NaN 800
# Python 15000 30 Days 500
# PHP 18000 30 Days 800
4. Read Excel by Sheet Name
As I said in the above section by default pandas read the first sheet from the Excel file and provide a sheet_name
param to read a specific sheet by name. This param takes {str, int, list, or None} as values. This is also used to load a sheet by position.
By default, it is set to 0 meaning load the first sheet.
# Read specific excel sheet
df = pd.read_excel('records.xlsx', sheet_name='Sheet1')
print(df)
5. Read Two Sheets
sheet_name
param also takes a list of sheet names as values that can be used to read two sheets into Pandas DataFrame. Not that while reading two sheets it returns a Dict of DataFrame. The key in Dict is a sheet name and the value would be DataFrame.
Use None
to load all sheets from Excel and return a Dict of Dictionary.
# Read Multiple sheets
dict_df = pd.read_excel('c:/apps/courses_schedule.xlsx',
sheet_name=['Technologies','Schedule'])
# Get DataFrame from Dict
technologies_df = dict_df .get('Technologies')
schedule_df = dict_df.get('Schedule')
# Print DataFrame's
print(technologies_df)
print(schedule_df)
I will leave this to you to execute and validate the output.
6. Skip Columns From Excel Sheet
Sometimes while reading an Excel sheet into pandas DataFrame you may need to skip columns, you can do this by using usecols
param. This takes values {int, str, list-like, or callable default None}. To specify the list of column names or positions use a list of strings or a list of int.
By default, it is set to None meaning load all columns
# Read excel by skipping columns
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx',
usecols=['Courses', 'Duration'])
print(df2)
# Outputs
# Courses Duration
#0 Spark 50 Days
#1 Pandas 35 Days
#2 Java NaN
#3 Python 30 Days
#4 PHP 30 Days
Alternatively, you can also write it by column position.
# Skip columns with list of values
df = pd.read_excel('records.xlsx', usecols=[0,2])
print(df)
Also supports a range of columns as values. For example, value ‘B:D’ means parsing B, C, and D columns.
# Skip columns by range
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx',
usecols='B:D')
print(df2)
Fee Duration Discount
0 25000 50 Days 2000
1 20000 35 Days 1000
2 15000 NaN 800
3 15000 30 Days 500
4 18000 30 Days 800
7. Skip Rows from the Excel Sheet
Use skiprows
param to skip rows from the Excel file, this param takes values {list-like, int, or callable, optional}. With this, you can skip the first few rows, selected rows, and range of rows. The below example skips the first 3 rows and considers the 4th row from Excel as the header.
# Read excel file by skipping rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx',
skiprows=2)
print(df2)
Pandas 20000 35 Days 1000
0 Java 15000 NaN 800
1 Python 15000 30 Days 500
2 PHP 18000 30 Days 800
Use header=None
to consider the 4th row as data. you can also use a list of rows to skip.
# Using skiprows to skip rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx',
skiprows=[1,3])
print(df2)
Courses Fee Duration Discount
0 Pandas 20000 35 Days 1000
1 Python 15000 30 Days 500
2 PHP 18000 30 Days 800
By using a lambda expression.
# Using skiprows with lambda
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx',
skiprows=lambda x: x in [1,3])
print(df2)
8. Other Important Params
- dtype – Dict with column name and type.
- nrows – How many rows to parse?
- na_values – Additional strings to recognize as NA/NaN.
- keep_default_na – Whether or not to include the default NaN values when parsing the data.
- na_filter – Filters missing values.
- parse_dates – Specify the column index you wanted to parse as dates
- thousands – Thousands separator for parsing string columns to numeric.
- skipfooter – Specify how to rows you want to skip from the footer.
- mangle_dupe_cols – Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’,
For complete parameters and description, refer to pandas documentation.
Frequently Asked Questions (FAQ) on Pandas Read Excel with Examples
Use the pandas.read_excel()
function to read an Excel file. Example, df = pd.read_excel('file.xlsx')
You can pass sheet_name
parameter along with 'file.xlsx'
into pandas.read_excel()
function to read the particular sheet from an Excel file. Example, df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
You can set sheet_name
to None
or use sheet_name=None
of pandas.read_excel()
function to read the multiple sheets into a dictionary of DataFrame, for example, dfs = pd.read_excel(‘file.xlsx’, sheet_name=None)
You can use the skiprows
and nrows
parameters of pandas.read_excel()
function to read only a specific range of cells from an Excel file. For example, df = pd.read_excel('file.xlsx', skiprows=2, nrows=5)
You can provide a URL instead of a file path to read Excel files from the web directly. For example, url = 'https://example.com/file.xlsx'
df = pd.read_excel(url)
Conclusion
In this article, you have learned how to read an Excel sheet and convert it into DataFrame by ignoring headers, skipping rows, skipping columns, specifying column names, and many more.
Happy Learning !!
Related Articles
- pandas ExcelWriter Usage with Examples
- pandas write CSV file
- Pandas Read SQL Query or Table with Examples
- Pandas Read TSV with Examples
- Pandas Read Text with Examples
- Pandas read_csv() with Examples
- Pandas Read JSON File with Examples
- How to Read CSV from String in Pandas
References
- https://docs.microsoft.com/en-us/deployoffice/compat/office-file-format-reference
- https://en.wikipedia.org/wiki/List_of_Microsoft_Office_filename_extensions
Thanks, Sudhakar for pointing out. I have corrected it now.
In 5.
schedule_df = df.get(‘Schedule’)
The above line should be
schedule_df = dict_df.get(‘Schedule’)