Pandas Read Multiple CSV Files into DataFrame

  • Post author:

Sometimes you may need to read or import multiple CSV files from a folder or from a list of files and convert them into pandas DataFrame. You can do this by reading each CSV file into DataFrame and appending or concatenating the DataFrames to create a single DataFrame with data from all files.

Here, I will use read_csv() to read CSV files and concat() function to concatenate DataFrams together to create one big DataFrame.

1. Read Multiple CSV Files from List

When you wanted to read multiple CSV files that exist in different folders, first create a list of strings with absolute paths and use it as shown below to load all CSV files and create one big pandas DataFrame.


# Read CSV files from List
df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))

Note that by default concat() method performs append operation meaning, it appends each DataFrame at the end of another DataFrame and creates a single DataFrame. Similar to SQL union.

2. Read Multiple CSV Files from a Folder

Unfortunately, read_csv() doesn’t support reading multiple CSV files from a folder into DataFrame, maybe in future pandas versions, it might support it, until then we have to use workarounds to read multiple CSV files from a folder and merge them into DataFrame.


# Import libraries
import glob
import pandas as pd

# Get CSV files list from a folder
path = '/apps/data_csv_files
csv_files = glob.glob(path + "/*.csv")

# Read each CSV file into DataFrame
# This creates a list of dataframes
df_list = (pd.read_csv(file) for file in csv_files)

# Concatenate all DataFrames
big_df   = pd.concat(df_list, ignore_index=True)

An alternate approach using map() function.


df = pd.concat(map(pd.read_csv, glob.glob(path + "/*.csv")))

In case you want to use optional params of read_csv() function use it by defining function.


# By using function
def readcsv(args):
    return pd.read_csv(args, header=None)

df = pd.concat(map(readcsv, filepaths))

3. Using Dask DataFrames

The Dask Dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the DataFrame into a Pandas DataFrame.

The Dask library can be used to read a data frame from multiple files. Before you use Dask library, first you need to install it using pip command or any other approach.


# Using data library
import dask.dataframe as dd
df = dd.read_csv(path + "/*.csv")

Conclusion

In this article, you have learned multiple ways of reading CSV files from a folder and creating one big DataFrame. Since read_csv() function doesn’t support reading you have to use loading each CSV into a separate DataFrame and combining them using concat() function.

Happy Learning !!

pandas read multiple csv

NNK

SparkByExamples.com is a Big Data and Spark examples community page, all examples are simple and easy to understand and well tested in our development environment Read more ..

Leave a Reply

You are currently viewing Pandas Read Multiple CSV Files into DataFrame