Pandas Read Multiple CSV Files into DataFrame

  • Post author:
  • Post category:Pandas
  • Post last modified:December 6, 2023
  • Reading time:11 mins read

Sometimes you may need to read or import multiple CSV files from a folder or from a list of files and convert them into Pandas DataFrame. You can do this by reading each CSV file into DataFrame and appending or concatenating the DataFrames to create a single DataFrame with data from all files.

Here, I will use read_csv() to read CSV files and the concat() function to concatenate DataFrams together to create one big DataFrame.

1. Read Multiple CSV Files from the List

When you want to read multiple CSV files that exist in different folders, first create a list of strings with absolute paths and use it as shown below to load all CSV files and create one big Pandas DataFrame.


# Read CSV files from List
df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))

Note that by default concat() method performs an append operation meaning, it appends each DataFrame at the end of another DataFrame and creates a single DataFrame. Similar to SQL union.

2. Read Multiple CSV Files from a Folder

Unfortunately, read_csv() doesn’t support reading multiple CSV files from a folder into DataFrame, maybe in future pandas versions, it might support it, until then we have to use workarounds to read multiple CSV files from a folder and merge them into DataFrame.


# Import libraries
import glob
import pandas as pd

# Get CSV files list from a folder
path = '/apps/data_csv_files
csv_files = glob.glob(path + "/*.csv")

# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file) for file in csv_files)

# Concatenate all DataFrames
big_df   = pd.concat(df_list, ignore_index=True)

An alternate approach using the map() function.


# Approach using map() function.
df = pd.concat(map(pd.read_csv, glob.glob(path + "/*.csv")))

In case you want to use optional params of the read_csv() function use it by defining the function.


# By using function
def readcsv(args):
    return pd.read_csv(args, header=None)

df = pd.concat(map(readcsv, filepaths))

3. Using Dask DataFrames

The Dask Dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the DataFrame into a Pandas DataFrame.

The Dask library can be used to read a data frame from multiple files. Before you use the Dask library, first you need to install it using the pip command or any other approach.


# Using data library
import dask.dataframe as dd
df = dd.read_csv(path + "/*.csv")

Frequently Asked Questions on Pandas Read Multiple CSV Files into DataFrame

How can I read multiple CSV files into a single DataFrame in Pandas?

To read multiple CSV files into a single DataFrame, you can use a loop or list comprehension along with the pd.concat() function. For example, df = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)

How can I handle different column names or orders in the CSV files?

You can specify a common set of columns or handle the columns dynamically. For example, common_columns = ['column1', 'column2', 'column3']
dfs = [pd.read_csv(f)[common_columns] for f in files]
df = pd.concat(dfs, ignore_index=True)

What if the CSV files have different delimiters or separators?

Use the sep parameter in pd.read_csv() to specify the delimiter. For example, df = pd.concat((pd.read_csv(f, sep=';') for f in files), ignore_index=True)

How can I read only specific columns from the CSV files?

You can use the usecols parameter in pd.read_csv() to specify the columns you want to include. For example, df = pd.concat((pd.read_csv(f, usecols=['column1', 'column2']) for f in files), ignore_index=True)

Conclusion

In this article, you have learned multiple ways of reading CSV files from a folder and creating one big DataFrame. Since the read_csv() function doesn’t support reading you have to load each CSV into a separate DataFrame and combine them using the concat() function.

Happy Learning !!

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply