Like any other data structure, Pandas DataFrame also has a way to iterate (loop through row by row) over rows and access columns/elements of each row. DataFrame provides methods iterrows()
, itertuples()
to iterate over each Row.
Key Points –
- Iterating over rows is generally slow in Pandas due to its inefficiency in handling row-by-row operations. Vectorized operations are usually preferred for better performance.
- This method allows you to iterate over DataFrame rows as (index, Series) pairs. Each row is returned as a Pandas Series.
- Using
iterrows()
, the data type of elements might change because each row is returned as a Series, which may have a mixed data type. itertuples()
method provides a faster way to iterate over rows, returning named tuples with values for each row, which is more efficient thaniterrows()
.- For row-wise operations, using the
apply()
function is often more efficient than looping because it can be vectorized.
Related: 10 Ways to Select Pandas Rows based on DataFrame Column Values
Using DataFrame.iterrows() to Iterate Over Rows
Pandas DataFrame.iterrows()
is used to iterate over DataFrame rows. This returns (index, Series) where the index is an index of the Row and the Series is the data or content of each row. To get the data from the series, you should use the column name like row["Fee"]
. To learn more about the Series access How to use Series with Examples.
First, let’s create a DataFrame.
import pandas as pd
technologies = ({
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
})
df = pd.DataFrame(technologies)
print("Create DataFrame:", df)
Yields below result. As you see the DataFrame has 3 columns Courses
, Fee
and Duration
.
The below example Iterates all rows in a DataFrame using iterrows()
.
# Iterate all rows
# Using DataFrame.iterrows()
print("After iterating all rows:\n")
for index, row in df.iterrows():
print (index,row["Fee"], row["Courses"], row["Duration"])
Yields below output.
Let’s see what a row looks like by printing it.
# Row contains the column name and data
row = next(df.iterrows())[1]
print("Data For First Row :")
print(row)
Yields below output.
# Output:
Data For First Row :
Courses Spark
Fee 20000
Duration 30day
Name: 0, dtype: object
Note that the Series returned from iterrows()
doesn’t contain the datatype (dtype
), to access the data type you should use row["Fee"].dttype
. If you want data type for each row you should use DataFrame.itertuples()
.
Note: Pandas document states that “You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.”
Using DataFrame.itertuples() to Iterate Over Rows
Pandas DataFrame.itertuples()
is the most used method to iterate over rows as it returns all DataFrame elements as an iterator that contains a tuple for each row. itertuples()
is faster compared with iterrows()
and preserves data type.
Below is the syntax of the itertuples()
.
# Syntax DataFrame.itertuples()
DataFrame.itertuples(index=True, name='Pandas')
index
– Defaults to ‘True’. Returns the DataFrame Index as the first element in a tuple. Setting it to False, doesn’t return Index.name
– Defaults to ‘Pandas’. You can provide a custom name to your returned tuple.
The below example loop through all elements in a tuple and get the value of each column by using getattr()
.
# Iterate all rows
# Using DataFrame.itertuples()
for row in df.itertuples(index = True):
print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))
Yields below output.
# Output:
0 20000 Spark
1 25000 PySpark
2 26000 Hadoop
3 22000 Python
4 24000 Pandas
5 21000 Oracle
6 22000 Java
Let’s provide the custom name to the tuple.
# Display one row from iterator
row = next(df.itertuples(index = True,name='Tution'))
print(row)
Yields below output.
# Output:
Tution(Index=0, Courses='Spark', Fee=20000, Duration='30day')
If you set the index parameter to False
, it removes the index as the first element of the tuple.
DataFrame.apply() to Iterate
You can also use apply()
method of the DataFrame to loop through the rows by using the lambda function. For more details, refer to DataFrame.apply().
# Syntax of DataFrame.apply()
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
Example:
# Another alternate approach by using DataFrame.apply()
print(df.apply(lambda row: str(row["Fee"]) + " " + str(row["Courses"]), axis = 1))
Yields below output.
# Output:
0 20000 Spark
1 25000 PySpark
2 26000 Hadoop
3 22000 Python
4 24000 Pandas
5 21000 Oracle
6 22000 Java
dtype: object
Iterating using for & DataFrame.index
You can also loop through rows by using a for loop. df['Fee'][0]
returns the first-row value from the column Fee
.
# Using DataFrame.index
for idx in df.index:
print(df['Fee'][idx], df['Courses'][idx])
Yields below output.
# Output:
20000 Spark
25000 PySpark
26000 Hadoop
22000 Python
24000 Pandas
21000 Oracle
22000 Java
Using for & DataFrame.loc
# Another alternate approach
# By using DataFrame.loc()
for i in range(len(df)) :
print(df.loc[i, "Fee"], df.loc[i, "Courses"])
Yields the same output as above.
Using For & DataFrame.iloc
# Another alternate approach
# By using DataFrame.iloc()
for i in range(len(df)) :
print(df.iloc[i, 0], df.iloc[i, 2])
Yields below output.
# Output:
Spark 30day
PySpark 40days
Hadoop 35days
Python 40days
Pandas 60days
Oracle 50days
Java 55days
Using DataFrame.items() to Iterate Over Columns
DataFrame.items() are used to iterate over columns (column by column) of pandas DataFrame. This returns a tuple (column name, Series) with the name and the content as Series.
The first value in the returned tuple contains the column label name and the second contains the content/data of DataFrame as a series.
# Iterate over column by column
# Using DataFrame.items()
for label, content in df.items():
print(f'label: {label}')
print(f'content: {content}', sep='\n')
Yields below output.
# Output:
label: Courses
content: 0 Spark
1 PySpark
2 Hadoop
3 Python
4 Pandas
5 Oracle
6 Java
Name: Courses, dtype: object
label: Fee
content: 0 20000
1 25000
2 26000
3 22000
4 24000
5 21000
6 22000
Name: Fee, dtype: int64
label: Duration
content: 0 30day
1 40days
2 35days
3 40days
4 60days
5 50days
6 55days
Name: Duration, dtype: object
Performance of Iterating DataFrame
Iterating a DataFrame is not advised or recommended to use as the performance would be very bad when iterating over a large dataset. Make sure you use this only when you exhausted all other options. Before using examples mentioned in this article, check if you can use any of these 1) Vectorization, 2) Cython routines, 3) List Comprehensions (vanilla for
loop).
Complete Example of pandas Iterate over Rows
import pandas as pd
Technologys = ({
'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
})
df = pd.DataFrame(Technologys)
print(df)
# Using DataFrame.iterrows()
row = next(df.iterrows())[1]
print("Data For First Row :")
print(row)
for index, row in df.iterrows():
print (index,row["Fee"], row["Courses"])
# Using DataFrame.itertuples()
row = next(df.itertuples(index = True, name='Tution'))
print("Data For First Row :")
print(row)
for row in df.itertuples(index = True):
print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))
# Another alternate approach by using DataFrame.apply
print(df.apply(lambda row: str(row["Fee"]) + " " + str(row["Courses"]), axis = 1))
# Using DataFrame.index
for idx in df.index:
print(df['Fee'][idx], df['Courses'][idx])
# Another alternate approach by using DataFrame.loc
for i in range(len(df)) :
print(df.loc[i, "Fee"], df.loc[i, "Courses"])
# Another alternate approach by using DataFrame.iloc
for i in range(len(df)) :
print(df.iloc[i, 0], df.iloc[i, 2])
# Using DataFrame.items
for label, content in df.items():
print(f'label: {label}')
print(f'content: {content}', sep='\n')
FAQ on Pandas Iterate Over Rows
To iterate over rows in a Pandas DataFrame, there are several methods available, each with its pros and cons.
itertuples()
is generally faster than iterrows()
. It returns rows as namedtuples, which is more efficient.
If you need to access rows by index and column name in a Pandas DataFrame, you can use the .iloc[]
or .loc[]
methods.
But direct modification of rows while iterating can be slow and inefficient. You can either modify a copy or use vectorized operations for better performance.
You can iterate over rows in a multi-index DataFrame using iterrows()
or itertuples()
just like you would with a single-index DataFrame, but you’ll need to account for the multi-index when accessing rows
Whenever possible, it’s better to use vectorized operations, as these are much faster and more efficient than iterating over rows. For operations that can be applied to entire columns, try to use Pandas’ built-in functions instead of loops.
Conclusion
DataFrame provides several methods to iterate over rows (loop over row by row) and access columns/cells. But it is not recommended to manually loop over the rows as it degrades the performance of the application when used on large datasets. Each example explained in this article behaves differently so depending on your use-case use the one that suits your need.
Happy Learning !!
Related Articles
- Pandas Iterate Over Series
- Pandas remove elements from Series
- Pandas Series apply() Function Usage
- How to Convert Pandas DataFrame to List?
- Append Pandas DataFrames Using for Loop
- Pandas Get First Column of DataFrame as Series?
- Pandas – Get All Column Names as List from DataFrame
- Different Ways to Rename Pandas DataFrame Column Names
- How to transform or remap Pandas DataFrame column values with Dict
References
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.items.html