• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:8 mins read
You are currently viewing Pandas Group Rows into List Using groupby()

You can group DataFrame rows into a list by using pandas.DataFrame.groupby() function on the column of interest, select the column you want as a list from group and then use Series.apply(list) to get the list for every group. In this article, I will explain how to group rows into the list using few examples.

1. Quick Examples

Below are some of the good examples to group rows into a list in pandas DataFrame.


# Group Rows on 'Courses' column and get List for 'Fee' column
df2 = df.groupby('Courses')['Fee'].apply(list)

# Assign a Column Name to the groped list
df2 = df.groupby('Courses')['Fee'].apply(list).reset_index(name="Course_Fee")

# Group Rows into List
df2 = df.groupby("Courses").agg({"Discount": lambda x: list(x)})

# Group Rows into List on All columns
df2 = df.groupby("Courses").agg(list)

# Other way
df2 = df.groupby('Courses').agg(pd.Series.tolist)

Now, let’s create a DataFrame with a few rows and columns and execute these examples and validate results. Our DataFrame contains column names Courses, Fee, Duration, and Discount.


import pandas as pd
technologies = ({
     'Courses':["Spark","PySpark","Hadoop","Python","pandas","PySpark","Python","pandas"],
     'Fee' :[24000,25000,25000,24000,24000,25000,25000,24000],
     'Duration':['30day','40days','35days', '40days','60days','50days','55days','35days'],
     'Discount':[1000,2300,1500,1200,2500,2100,2000,2500]
              })
df = pd.DataFrame(technologies)
print(df)

Yields below output.


   Courses    Fee Duration  Discount
0    Spark  24000    30day      1000
1  PySpark  25000   40days      2300
2   Hadoop  25000   35days      1500
3   Python  24000   40days      1200
4   pandas  24000   60days      2500
5  PySpark  25000   50days      2100
6   Python  25000   55days      2000
7   pandas  24000   35days      2500

2. Pandas DataFrame.groupby() To Group Rows into List

By using DataFrame.gropby() function you can group rows on a column, select the column you want as a list from the grouped result and finally convert it to a list for each group using apply(list).


# Group Rows on 'Courses' column and get List for 'Fee' column
df2 = df.groupby('Courses')['Fee'].apply(list)
print(df2)

Yields below output. Note that here df.groupby('Courses')['Fee'] returns a Series object. and we have applied apply(list) on Series object to get you the right result.


Courses
Hadoop            [25000]
PySpark    [25000, 25000]
Python     [24000, 25000]
Spark             [24000]
pandas     [24000, 24000]
Name: Fee, dtype: object

3. Assign Column Name to Gropby List result

On groupby() list results use .reset_index(name="Course_Fee") to assign a column name to the list column.


# Assign a Column Name to the groped list
df2 = df.groupby('Courses')['Fee'].apply(list).reset_index(name="Course_Fee")
print(df2)

Yields below output.


   Courses     Course_Fee
0   Hadoop         [25000]
1  PySpark  [25000, 25000]
2   Python  [24000, 25000]
3    Spark         [24000]
4   pandas  [24000, 24000]

4. Group Rows into List Using agg() & Lambda Function.

Alternatively, you can also do group rows into list using df.groupby("Courses").agg({"Discount":lambda x:list(x)}) function. Use the groupby() method on the Courses and agg() method to apply the aggregation on every group of pandas.DataFrame.


# Group Rows into List
df2 = df.groupby("Courses").agg({"Discount": lambda x: list(x)})
print(df2)

Yields below output.


             Discount
Courses              
Hadoop         [1500]
PySpark  [2300, 2100]
Python   [1200, 2000]
Spark          [1000]
pandas   [2500, 2500]

5. Pandas Group Rows into List on All Columns

Let’s see how to group rows into the list for all DataFrame columns. This results in multiple List columns for every group.


# Group Rows into List on All columns
df2 = df.groupby("Courses").agg(list)
print(df2)

Yields below output.


                    Fee          Duration      Discount
Courses                                                
Hadoop          [25000]          [35days]        [1500]
PySpark  [25000, 25000]  [40days, 50days]  [2300, 2100]
Python   [24000, 25000]  [40days, 55days]  [1200, 2000]
Spark           [24000]           [30day]        [1000]
pandas   [24000, 24000]  [60days, 35days]  [2500, 2500]

You can also get the same results using


# Using .agg(pd.Series.tolist) as the argument on the DataFrame
df2 = df.groupby('Courses').agg(pd.Series.tolist)
print(df2)

6. Complete Example For Reference


import pandas as pd
technologies = ({
     'Courses':["Spark","PySpark","Hadoop","Python","pandas","PySpark","Python","pandas"],
     'Fee' :[24000,25000,25000,24000,24000,25000,25000,24000],
     'Duration':['30day','40days','35days', '40days','60days','50days','55days','35days'],
     'Discount':[1000,2300,1500,1200,2500,2100,2000,2500]
              })
df = pd.DataFrame(technologies)
print(df)

# Use groupby method and apply() method on the DataFrame
df = df.groupby('Courses')['Fee'].apply(list)
print(df)

# Use groupby method and apply() method on the DataFrame
df = df.groupby('Courses')['Fee'].apply(list).reset_index(name="Courses Fee")
print(df)

# Using lambda function on the DataFrame
df = df.groupby("Courses").agg({"Discount": lambda x: list(x)})
print(df)

# Using the list as an argument on the DataFrame
df = df.groupby("Courses").agg(list)
print(df)

# Using .agg(pd.Series.tolist) as the argument on the DataFrame
df = df.groupby('Courses').agg(pd.Series.tolist)
print(df)

Conclusion

In this article, you have learned how to group DataFrame rows into the list in the Pandas by using groupby() and using Series.apply(), Series.agg(). Also, you have learned to group rows into a list on all columns.

Happy Learning !!

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply