How to Split Pandas DataFrame?

You can split the Pandas DataFrame based on rows or columns by using Pandas.DataFrame.iloc[] attribute, groupby().get_group(), sample() functions. It returns some portion of DataFrame when we select the required portion of rows or columns from the DataFrame.

Quick Examples of Split Pandas DataFrame

Following are quick examples of how to split Pandas DataFrame.


# Quick examples of split pandas dataframe

# Example 1: Split the DataFrame 
# Using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]

# Example 2: Split the DataFrame 
# Using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]

# Example 3: Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")

# Example 4: split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)

Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are 'Courses', 'Fee', 'Discount', and 'Duration'.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
    'Fee' :[22000, 25000, 23000, 24000, 26000],
    'Discount':[1000, 2300, 1000, 1200, 2500],
    'Duration':['35days', '35days', '40days', '30days', '25days']
          }

df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days
2   Hadoop  23000      1000   40days
3   Python  24000      1200   30days
4   Pandas  26000      2500   25days

Use iloc[] Split the DataFrame in Pandas

We can use the iloc[] attribute to split the given DataFrame. The iloc[] property is used to select rows and columns by position/index. Pandas loc[] is another property that is used to operate on the column and row labels.

Split DataFrame by Row

Using this property we can select the required portion based on rows from the DataFrame. Here, I will use the iloc[] property, to split the given DataFrame into two smaller DataFrames. Let’s split the DataFrame,


# Split the DataFrame 
# Using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]
print(df1)
print("---------------------------")
print(df2)

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days
---------------------------
  Courses    Fee  Discount Duration
2  Hadoop  23000      1000   40days
3  Python  24000      1200   30days
4  Pandas  26000      2500   25

Split DataFrame by Columns

In the above section, you have learned how to split DataFrame using the iloc[] property based on rows. Now, we will learn how to split DataFrame using the iloc[] property based on columns. It is the same as above but the syntax is a little bit different. Let’s see what is that difference.


# Split the DataFrame 
# Using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]
print(df1)
print("---------------------------")
print(df2)

Yields below output.


# Output:
   Courses    Fee
0    Spark  22000
1  PySpark  25000
2   Hadoop  23000
3   Python  24000
4   Pandas  26000
---------------------------
   Discount Duration
0      1000   35days
1      2300   35days
2      1000   40days
3      1200   30days

Split Pandas DataFrame using groupby() Function

The Pandas.groupby() function is used to split the DataFrame based on some values. First, we can group the DataFrame using the groupby() function after that we can select specified groups using the get_group() function. This is the best function when we want to split a DataFrame based on some column that has unique values.


# Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")
print(df1)

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days

The above example returns a new DataFrame consisting of grouped data with 'Duration' is '35days'.

Split the DataFrame using Pandas Shuffle Rows

By using pandas.DataFrame.sample() function we can split the DataFrame by changing the order of rows. pandas.sample(frac=1) function is used to shuffle the order of rows randomly. The frac keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None just returns 1 random record. frac=.5 returns random 50% of the rows.

Let’s see sample() function how to split our DataFrame with random rows.


# Split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)
print(df1)
print(df1.reset_index())

Yields below output.


# Output:
  index Courses    Fee  Discount Duration
0      3  Python  24000      1200   30days
1      4  Pandas  26000      2500   25days

Conclusion

In conclusion, splitting a Pandas DataFrame is a crucial operation in data analysis, allowing us to segment data based on specific criteria. In this article, we’ll explore various methods to split DataFrames, including using df.iloc[] for precise row and column splitting, df.groupby() for grouping based on column values, and df.sample() for random sampling. With each method, we’ll provide clear examples to enhance understanding.

References

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.rsplit.html

Quick Examples of Split Pandas DataFrame

Use iloc[] Split the DataFrame in Pandas

Split DataFrame by Row

Split DataFrame by Columns

Split Pandas DataFrame using groupby() Function

Split the DataFrame using Pandas Shuffle Rows

Conclusion

Related Articles

References