• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:13 mins read
You are currently viewing Pandas Iterate Over Rows with Examples

Like any other data structure, Pandas DataFrame also has a way to iterate (loop through row by row) over rows and access columns/elements of each row. DataFrame provides methods iterrows(), itertuples() to iterate over each Row.

Related: 10 Ways to Select Pandas Rows based on DataFrame Column Values

1. Using DataFrame.iterrows() to Iterate Over Rows

pandas DataFrame.iterrows() is used to iterate over DataFrame rows. This returns (index, Series) where the index is an index of the Row and the Series is the data or content of each row. To get the data from the series, you should use the column name like row["Fee"]. To learn more about the Series access How to use Series with Examples.

First, let’s create a DataFrame.


import pandas as pd
technologies = ({
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
              })
df = pd.DataFrame(technologies)
print("Create DataFrame:", df)

Yields below result. As you see the DataFrame has 3 columns Courses, Fee and Duration.

pandas iterate over rows

The below example Iterates all rows in a DataFrame using iterrows().


# Iterate all rows using DataFrame.iterrows()
print("After iterating all rows:\n")
for index, row in df.iterrows():
    print (index,row["Fee"], row["Courses"], row["Duration"])

Yields below output.

pandas iterate over rows

Let’s see what a row looks like by printing it.


# Row contains the column name and data
row = next(df.iterrows())[1]
print("Data For First Row :")
print(row)

Yields below output.


# Output:
Data For First Row :
Courses     Spark
Fee         20000
Duration    30day
Name: 0, dtype: object

Note that the Series returned from iterrows() doesn’t contain the datatype (dtype), to access the data type you should use row["Fee"].dttype. If you want data type for each row you should use DataFrame.itertuples().

Note: Pandas document states that “You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.”

2. Using DataFrame.itertuples() to Iterate Over Rows

Pandas DataFrame.itertuples() is the most used method to iterate over rows as it returns all DataFrame elements as an iterator that contains a tuple for each row. itertuples() is faster compared with iterrows() and preserves data type.

Below is the syntax of the itertuples().


# Syntax DataFrame.itertuples()
DataFrame.itertuples(index=True, name='Pandas')
  • index – Defaults to ‘True’. Returns the DataFrame Index as the first element in a tuple. Setting it to False, doesn’t return Index.
  • name – Defaults to ‘Pandas’. You can provide a custom name to your returned tuple.

The below example loop through all elements in a tuple and get the value of each column by using getattr().


# Iterate all rows using DataFrame.itertuples()
for row in df.itertuples(index = True):
    print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))

Yields below output.


# Output:
0 20000 Spark
1 25000 PySpark
2 26000 Hadoop
3 22000 Python
4 24000 Pandas
5 21000 Oracle
6 22000 Java

Let’s provide the custom name to the tuple.


# Display one row from iterator
row = next(df.itertuples(index = True,name='Tution'))
print(row)

Yields below output.


# Output:
Tution(Index=0, Courses='Spark', Fee=20000, Duration='30day')

If you set the index parameter to False, it removes the index as the first element of the tuple.

3. DataFrame.apply() to Iterate

You can also use apply() method of the DataFrame to loop through the rows by using the lambda function. For more details, refer to DataFrame.apply().


# Syntax of DataFrame.apply()
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)

Example:


# Another alternate approach by using DataFrame.apply()
print(df.apply(lambda row: str(row["Fee"]) + " " + str(row["Courses"]), axis = 1))

Yields below output.


# Output:
0      20000 Spark
1    25000 PySpark
2     26000 Hadoop
3     22000 Python
4     24000 Pandas
5     21000 Oracle
6       22000 Java
dtype: object

4. Iterating using for & DataFrame.index

You can also loop through rows by using a for loop. df[‘Fee’][0] returns the first-row value from the column Fee.


# Using DataFrame.index
for idx in df.index:
     print(df['Fee'][idx], df['Courses'][idx])

Yields below output.


# Output:
20000 Spark
25000 PySpark
26000 Hadoop
22000 Python
24000 Pandas
21000 Oracle
22000 Java

5. Using for & DataFrame.loc


# Another alternate approach byusing DataFrame.loc()
for i in range(len(df)) :
  print(df.loc[i, "Fee"], df.loc[i, "Courses"])

Yields the same output as above.

6. Using For & DataFrame.iloc


# Another alternate approach by using DataFrame.iloc()
for i in range(len(df)) :
  print(df.iloc[i, 0], df.iloc[i, 2])

Yields below output.


# Output:
Spark 30day
PySpark 40days
Hadoop 35days
Python 40days
Pandas 60days
Oracle 50days
Java 55days

7. Using DataFrame.items() to Iterate Over Columns

DataFrame.items() are used to iterate over columns (column by column) of pandas DataFrame. This returns a tuple (column name, Series) with the name and the content as Series.

The first value in the returned tuple contains the column label name and the second contains the content/data of DataFrame as a series.


# Iterate over column by column using DataFrame.items()
for label, content in df.items():
    print(f'label: {label}')
    print(f'content: {content}', sep='\n')

Yields below output.


# Output:
label: Courses
content: 0      Spark
1    PySpark
2     Hadoop
3     Python
4     Pandas
5     Oracle
6       Java
Name: Courses, dtype: object
label: Fee
content: 0    20000
1    25000
2    26000
3    22000
4    24000
5    21000
6    22000
Name: Fee, dtype: int64
label: Duration
content: 0     30day
1    40days
2    35days
3    40days
4    60days
5    50days
6    55days
Name: Duration, dtype: object

9. Performance of Iterating DataFrame

Iterating a DataFrame is not advised or recommended to use as the performance would be very bad when iterating over a large dataset. Make sure you use this only when you exhausted all other options. Before using examples mentioned in this article, check if you can use any of these 1) Vectorization, 2) Cython routines, 3) List Comprehensions (vanilla for loop).

pandas iterate over rows
Padas Iterate Rows Performance

10. Complete Example of pandas Iterate over Rows


import pandas as pd
Technologys = ({
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day', '40days' ,'35days', '40days', '60days', '50days', '55days']
              })
df = pd.DataFrame(Technologys)
print(df)

# Using DataFrame.iterrows()
row = next(df.iterrows())[1]
print("Data For First Row :")
print(row)

for index, row in df.iterrows():
    print (index,row["Fee"], row["Courses"])

# Using DataFrame.itertuples()
row = next(df.itertuples(index = True, name='Tution'))
print("Data For First Row :")
print(row)

for row in df.itertuples(index = True):
    print (getattr(row,'Index'),getattr(row, "Fee"), getattr(row, "Courses"))



# Another alternate approach by using DataFrame.apply
print(df.apply(lambda row: str(row["Fee"]) + " " + str(row["Courses"]), axis = 1))

# Using DataFrame.index
for idx in df.index:
     print(df['Fee'][idx], df['Courses'][idx])
     
# Another alternate approach by using DataFrame.loc
for i in range(len(df)) :
  print(df.loc[i, "Fee"], df.loc[i, "Courses"])

# Another alternate approach by using DataFrame.iloc  
for i in range(len(df)) :
  print(df.iloc[i, 0], df.iloc[i, 2])

# Using DataFrame.items
for label, content in df.items():
    print(f'label: {label}')
    print(f'content: {content}', sep='\n')

Conclusion

DataFrame provides several methods to iterate over rows (loop over row by row) and access columns/cells. But it is not recommended to manually loop over the rows as it degrades the performance of the application when used on large datasets. Each example explained in this article behaves differently so depending on your use-case use the one that suits your need.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply