Pandas read_csv() with Examples

Use pandas read_csv() function to read CSV file (comma separated) into python pandas DataFrame and supports options to read any delimited file. In this pandas article, I will explain how to read a CSV file with or without a header, skip rows, skip columns, set columns to index, and many more with examples.

CSV files are plain text that is used to store 2-dimensional data in a simple human-readable format, this is the format mostly used in industry to exchange big batch files between organizations. In some cases, these files are also used to store metadata.

Related: pandas Write to CSV File

1. read_csv() Syntax

Following is the Syntax of read_csv() function.

# Syntax of read_csv()
pandas.read_csv(filepath_or_buffer, sep=NoDefault.no_default, delimiter=None, header='infer', names=NoDefault.no_default, index_col=None, usecols=None, squeeze=None, prefix=NoDefault.no_default, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None)

As you see above, it takes several optional parameters to support reading CSV files with different options. When you are dealing with huge files, some of these params helps you in loading CSV file faster. In this article, I will explain the usage of some of these options with examples.

2. pandas Read CSV into DataFrame

To read a CSV file with comma delimiter use pandas.read_csv() and to read tab delimiter (\t) file use read_table(). Besides these, you can also use pipe or any custom separator file.

pandas read csv
Comma delimiter CSV file

I will use the above data to read CSV file, you can find the data file at GitHub.

# Import pandas
import pandas as pd

# Read CSV file into DataFrame
df = pd.read_csv('courses.csv')

# Output:
#  Courses    Fee Duration  Discount
# 0   Spark  25000  50 Days      2000
# 1  Pandas  20000  35 Days      1000
# 2    Java  15000      NaN       800
# 3  Python  15000  30 Days       500
# 4     PHP  18000  30 Days       800

By default, it reads first rows on CSV as column names (header) and it creates an incremental numerical number as index starting from zero.

Use sep or delimiter to specify the separator of the columns. By default it uses comma.

3. Set Column as Index

You can set a column as an index using index_col as param. This param takes values {int, str, sequence of int / str, or False, optional, default None}.

# Set column as Index
df = pd.read_csv('courses.csv', index_col='Courses')

# Output:
#           Fee Duration  Discount
# Courses                          
# Spark    25000  50 Days      2000
# Pandas   20000  35 Days      1000
# Java     15000      NaN       800
# Python   15000  30 Days       500
# PHP      18000  30 Days       800

Alternatively, you can also use index/position to specify the column name. When used a list of values, it creates a MultiIndex.

4. Skip Rows

Sometimes you may need to skip first-row or skip footer rows, use skiprows and skipfooter param respectively.

# Skip first few rows
df = pd.read_csv('courses.csv', header=None, skiprows=2)

# Output:
#        0      1        2     3
# 0  Pandas  20000  35 Days  1000
# 1    Java  15000      NaN   800
# 2  Python  15000  30 Days   500
# 3     PHP  18000  30 Days   800

skiprows param also takes a list of rows to skip.

5. Read CSV by Ignoring Column Names

By default, it considers the first row from excel as a header and used it as DataFrame column names. In case you wanted to consider the first row from excel as a data record use header=None param and use names param to specify the column names. Not specifying names result in column names with numerical numbers.

# Ignore header and assign new columns
columns = ['courses','course_fee','course_duration','course_discount']
df = pd.read_csv('courses.csv', header=None,names=columns,skiprows=1)

# Output:
#  courses  course_fee course_duration  course_discount
# 0   Spark       25000         50 Days             2000
# 1  Pandas       20000         35 Days             1000
# 2    Java       15000             NaN              800
# 3   Python       15000         30 Days              500
# 4     PHP       18000         30 Days              800

6. Load only Selected Columns

Using usecols param you can select columns to load from the CSV file. This takes columns as a list of strings or a list of int.

# Load only selected columns
columns = ['courses','course_fee','course_duration','course_discount']
df = pd.read_csv('courses.csv', usecols =['Courses','Fee','Discount'])

# Output:
#  Courses    Fee  Discount
# 0   Spark  25000      2000
# 1  Pandas  20000      1000
# 2    Java  15000       800
# 3  Python  15000       500
# 4     PHP  18000       800

7. Set DataTypes to Columns

By default read_csv() assigns the data type that best fits based on the data. For example Fee and Discount for DataFrame is given int64 and Courses and Duration are given string.

Let’s change the Fee columns to float type.

# Set column data types
df = pd.read_csv('courses.csv', dtype={'Courses':'string','Fee':'float'})

# Output:
# Courses      string
# Fee         float64
# Duration     object
# Discount      int64
# dtype: object

7. Other Params of pandas read_csv()

  • nrows – Specify how many rows to read.
  • true_value – What all values to consider as True.
  • false_values – What all values to consider as False.
  • mangle_dupe_cols – Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’.
  • converters – Supply Dict of values you wanted to convert.
  • skipinitialspace – Similar to right trim. Skips spaces after separator.
  • na_values – Specify what all values to consider as NaN/NA.
  • keep_default_na – Specify whether to load NaN values from the data.
  • na_filter – Detect missing values. set this to False to improve performance.
  • skip_blank_lines – skip empty lines with out data.
  • parse_dates – Specify how you wanted to parse dates.
  • thousands– Separator for thousdand.
  • decimal – Character for decimal point.
  • lineterminator – Line separator.
  • quotechar – Use quote character when you wanted to consider delimiter within a value.

Besides these, there are many more optional params, refer to pandas documentation for details.

8. Conclusion

In this python article, you have learned what is CSV file, how to load it into pandas DataFrame. Also learned skipping rows, selecting columns, ignoring header, and many more examples.

Related Articles


Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply

You are currently viewing Pandas read_csv() with Examples