Pandas Read Excel with Examples

pandas.read_excel() function is used to read excel sheet with extension xlsx into pandas DataFrame. By reading a single sheet it returns a pandas DataFrame object, but reading two sheets it returns a Dict of DataFrame.

pandas Read Excel Key Points

  • This supports to read files with extension xls, xlsx, xlsm, xlsb, odf, ods and odt 
  • Can load excel files stored in a local filesystem or from an URL.
  • For URL, it supports http, ftp, s3, and file.
  • Also supports reading from a single sheet or a list of sheets.
  • When reading a two sheets, it returns a Dict of DataFrame.

Table of contents –

I have an excel file with two sheets named Technologies and Schedule, I will be using this to demonstrate how to read into pandas DataFrame.

Notice that on our excel file the top row contains the header of the table which can be used as column names on DataFrame.

1. pandas Read Excel Sheet

Use pandas.read_excel() function to read excel sheet into pandas DataFrame, by default it loads the first sheet from the excel file and parses the first row as a DataFrame column name. Excel file has an extension .xlsx. This function also supports several extensions xls, xlsx, xlsm, xlsb, odf, ods and odt .

Following are some of the features supported by read_excel() with optional param.

  • Reading excel file from URL, S3, and from local file ad supports several extensions.
  • Ignoreing the column names and provides an option to set column names.
  • Setting column as Index
  • Considering multiple values as NaN
  • Decimal points to use for numbers
  • Data types for each column
  • Skipping rows and columns

I will cover how to use some of these optional params with examples, first let’s see how to read an excel sheet & create a DataFrame without any params.


import pandas as pd
# Read Excel file
df = pd.read_excel('c:/apps/courses_schedule.xlsx')
print(df)

# Outputs
#   Courses    Fee  Duration  Discount
#0    Spark  25000   50 Days      2000
#1   Pandas  20000   35 Days      1000
#2     Java  15000       NaN       800
#3   Python  15000   30 Days       500
#4      PHP  18000   30 Days       800

Related: pandas Write to Excel Sheet

2. Read by Ignoring Column Names

By default, it considers the first row from excel as a header and used it as DataFrame column names. In case you wanted to consider the first row from excel as a data record use header=None param and use names param to specify the column names. Not specifying names result in column names with numerical numbers.


# Read excel by considering first row as data
columns = ['courses','course_fee','course_duration','course_discount']
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   header=None, names = columns)
print(df2)

# Outputs        
#0  courses  course_fee  Duration  Discount
#1    Spark  25000   50 Days      2000
#2   Pandas  20000   35 Days      1000
#3     Java  15000       NaN       800
#4   Python  15000   30 Days       500
#5      PHP  18000   30 Days       800

3. Set Column from Excel as Index

If you notice, the DataFrame was created with the default index, if you wanted to set the column name as index use index_col param. This param takes values {int, list of int, default None}. If a list is passed with header positions, it creates a MultiIndex.

By default, it is set to None meaning not column is set as an index.


# Read excel by setting column as index
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   index_col=0)
print(df2)

# Outputs
#           Fee Duration  Discount
#Courses                          
#Spark    25000  50 Days      2000
#Pandas   20000  35 Days      1000
#Java     15000      NaN       800
#Python   15000  30 Days       500
#PHP      18000  30 Days       800

4. Read Excel by Sheet Name

As I said in the above section by default pandas read the first sheet from the excel file and provide a sheet_name param to read a specific sheet by name. This param takes {str, int, list, or None} as values. This is also used to load a sheet by position.

By default, it is set to 0 meaning load the first sheet.


# Read specific excel sheet
df = pd.read_excel('records.xlsx', sheet_name='Sheet1')
print(df)

5. Read Two Sheets

sheet_name param also takes a list of sheet names as values that can be used to read two sheets into pandas DataFrame. Not that while reading two sheets it returns a Dict of DataFrame. The key in Dict is a sheet name and the value would be DataFrame.

Use None to load all sheets from excel and returns a Dict of Dictionary.


# Read Multiple sheets
dict_df = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   sheet_name=['Technologies','Schedule'])

# Get DataFrame from Dict
technologies_df = dict_df .get('Technologies')
schedule_df = df.get('Schedule')

# Print DataFrame's
print(technologies_df)
print(schedule_df)

I will leave this to you to execute and validate the output.

6. Skip Columns From Excel Sheet

Sometimes while reading an excel sheet into pandas DataFrame you may need to skip columns, you can do this by using usecols param. This takes values {int, str, list-like, or callable default None}. To specify the list of column names or positions use a list of strings or a list of int.

By default it is set to None meaning load all columns


# Read excel by skipping columns
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   usecols=['Courses', 'Duration'])
print(df2)
# Outputs
#  Courses Duration
#0   Spark  50 Days
#1  Pandas  35 Days
#2    Java      NaN
#3  Python  30 Days
#4     PHP  30 Days

Alternatively, you can also write it by column position.


# Skip columns with list of values
df = pd.read_excel('records.xlsx', usecols=[0,2])
print(df)

Also supports a range of columns as value. For example, value ‘B:D’ means parsing B, C, and D columns.


# Skip columns by range
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   usecols='B:D')
print(df2)

     Fee Duration  Discount
0  25000  50 Days      2000
1  20000  35 Days      1000
2  15000      NaN       800
3  15000  30 Days       500
4  18000  30 Days       800

7. Skip Rows from Excel Sheet

Use skiprows param to skip rows from the excel file, this param takes values {list-like, int, or callable, optional}. With this, you can skip the first few rows, selected rows, and range of rows. The below example skips the first 3 rows and considers the 4th row from excel as the header.


# Read excel file by skipping rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=2)
print(df2)

   Pandas  20000  35 Days  1000
0    Java  15000      NaN   800
1  Python  15000  30 Days   500
2     PHP  18000  30 Days   800

Use header=None to consider the 4th row as data. you can also use a list of rows to skip.


# Using skiprows to skip rows
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=[1,3])
print(df2)

  Courses    Fee Duration  Discount
0  Pandas  20000  35 Days      1000
1  Python  15000  30 Days       500
2     PHP  18000  30 Days       800

By using a lambda expression.


# Using skiprows with lambda
df2 = pd.read_excel('c:/apps/courses_schedule.xlsx', 
                   skiprows=lambda x: x in [1,3])
print(df2)

8. Other Important Params

  • dtype – Dict with column name an type.
  • nrows – How many rows to parse.
  • na_values – Additional strings to recognize as NA/NaN. 
  • keep_default_na – Whether or not to include the default NaN values when parsing the data. 
  • na_filter – Filters missing values.
  • parse_dates – Specify the column index you wanted to parse as dates
  • thousands – Thousands separator for parsing string columns to numeric.
  • skipfooter – Specify how to rows you wanted to skip from the footer.
  • mangle_dupe_cols – Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, 

For complete params and description, refer to pandas documentation.

Conclusion

In this article, you have learned how to read an Excel sheet and covert it into DataFrame by ignoring header, skipping rows, skipping columns, specifying column names, and many more.

Happy Learning !!

You May Also Like

References

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Pandas Read Excel with Examples