• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:9 mins read
You are currently viewing How to Split Pandas DataFrame?

We can split the Pandas DataFrame based on rows or columns by using Pandas.DataFrame.iloc[] attribute, groupby().get_group(), sample() functions. It returns some portion of DataFrame when we select the required portion of rows or columns from the DataFrame.

In this article, I will explain how to split a Pandas DataFrame based on a column or row using df.iloc[] and also I can split the DataFrame using df.groupby() and df.sample() function examples.

1. Quick Examples of Split Pandas DataFrame

Following are quick examples of how to split Pandas DataFrame.


# Below are the quick examples
# Example 1: Split the DataFrame using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]

# Example 2: Split the DataFrame using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]

# Example 3: Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")

# Example 4: split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)

Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are 'Courses', 'Fee', 'Discount', and 'Duration'.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
    'Fee' :[22000, 25000, 23000, 24000, 26000],
    'Discount':[1000, 2300, 1000, 1200, 2500],
    'Duration':['35days', '35days', '40days', '30days', '25days']
          }

df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days
2   Hadoop  23000      1000   40days
3   Python  24000      1200   30days
4   Pandas  26000      2500   25days

2. Use iloc[] Split the DataFrame in Pandas

We can use the iloc[] attribute to split the given DataFrame. The iloc[] property is used to select rows and columns by position/index. Pandas loc[] is another property that is used to operate on the column and row labels.

2.1 Split DataFrame by Row

Using this property we can select the required portion based on rows from the DataFrame. Here, I will use the iloc[] property, to split the given DataFrame into two smaller DataFrames. Let’s split the DataFrame,


# Split the DataFrame using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]
print(df1)
print("---------------------------")
print(df2)   

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days
---------------------------
  Courses    Fee  Discount Duration
2  Hadoop  23000      1000   40days
3  Python  24000      1200   30days
4  Pandas  26000      2500   25

2.2 Split DataFrame by Columns

In the above section, you have learned how to split DataFrame using the iloc[] property based on rows. Now, we will learn how to split DataFrame using the iloc[] property based on columns. It is the same as above but the syntax is a little bit different. Let’s see what is that difference.


# Split the DataFrame using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]
print(df1)
print("---------------------------")
print(df2)  

Yields below output.


# Output:
   Courses    Fee
0    Spark  22000
1  PySpark  25000
2   Hadoop  23000
3   Python  24000
4   Pandas  26000
---------------------------
   Discount Duration
0      1000   35days
1      2300   35days
2      1000   40days
3      1200   30days

3. Split Pandas Dataframe using groupby() function

The Pandas.groupby() function is used to split the DataFrame based on some values. First, we can group the DataFrame using the groupby() function after that we can select specified groups using the get_group() function. This is the best function when we want to split a DataFrame based on some column that has unique values.


# Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")
print(df1)

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days

The above example returns a new DataFrame consisting of grouped data with 'Duration' is '35days'.

4. Split the DataFrame using Pandas Shuffle Rows

By using pandas.DataFrame.sample() function we can split the DataFrame by changing the order of rows. pandas.sample(frac=1) function is used to shuffle the order of rows randomly. The frac keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None just returns 1 random record. frac=.5 returns random 50% of the rows.

Let’s see sample() function how to split our DataFrame with random rows.


# Split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)
print(df1)
print(df1.reset_index())

Yields below output.


# Output:
  index Courses    Fee  Discount Duration
0      3  Python  24000      1200   30days
1      4  Pandas  26000      2500   25days

5. Conclusion

In this article, I have explained how to split Pandas DataFrame using df.iloc[] attribute based on both rows and columns and also using df.groupby() function and df.sample() function how we can split the DataFrame with well defined examples.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium