Sometimes you may need to read or import multiple CSV files from a folder or from a list of files and convert them into pandas DataFrame. You can do this by reading each CSV file into DataFrame and appending or concatenating the DataFrames to create a single DataFrame with data from all files.
Here, I will use read_csv() to read CSV files and concat() function to concatenate DataFrams together to create one big DataFrame.
1. Read Multiple CSV Files from List
When you wanted to read multiple CSV files that exist in different folders, first create a list of strings with absolute paths and use it as shown below to load all CSV files and create one big pandas DataFrame.
# Read CSV files from List df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))
Note that by default concat() method performs append operation meaning, it appends each DataFrame at the end of another DataFrame and creates a single DataFrame. Similar to SQL union.
2. Read Multiple CSV Files from a Folder
Unfortunately, read_csv() doesn’t support reading multiple CSV files from a folder into DataFrame, maybe in future pandas versions, it might support it, until then we have to use workarounds to read multiple CSV files from a folder and merge them into DataFrame.
# Import libraries import glob import pandas as pd # Get CSV files list from a folder path = '/apps/data_csv_files csv_files = glob.glob(path + "/*.csv") # Read each CSV file into DataFrame # This creates a list of dataframes df_list = (pd.read_csv(file) for file in csv_files) # Concatenate all DataFrames big_df = pd.concat(df_list, ignore_index=True)
An alternate approach using map() function.
df = pd.concat(map(pd.read_csv, glob.glob(path + "/*.csv")))
In case you want to use optional params of read_csv() function use it by defining function.
# By using function def readcsv(args): return pd.read_csv(args, header=None) df = pd.concat(map(readcsv, filepaths))
3. Using Dask DataFrames
The Dask Dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call
df.compute() to convert the DataFrame into a Pandas DataFrame.
The Dask library can be used to read a data frame from multiple files. Before you use Dask library, first you need to install it using pip command or any other approach.
# Using data library import dask.dataframe as dd df = dd.read_csv(path + "/*.csv")
In this article, you have learned multiple ways of reading CSV files from a folder and creating one big DataFrame. Since read_csv() function doesn’t support reading you have to use loading each CSV into a separate DataFrame and combining them using concat() function.
Happy Learning !!
- How to Read Excel Multiple Sheets in Pandas
- Pandas Drop Index Column Explained
- Pandas Groupby Aggregate Explained
- Convert Pandas DataFrame to Series
- Pandas Read Text with Examples
- Pandas Read JSON File with Examples
- Pandas ExcelWriter Explained with Examples
- Pandas Read Excel with Examples
- Pandas Set Column as Index in DataFrame