• Post author:
  • Post category:Pandas
  • Post last modified:December 6, 2024
  • Reading time:15 mins read
You are currently viewing How to Split Pandas DataFrame?

You can split the Pandas DataFrame based on rows or columns by using Pandas.DataFrame.iloc[] attribute, groupby().get_group(), sample() functions. It returns some portion of DataFrame when we select the required portion of rows or columns from the DataFrame.

Advertisements

In this article, I will explain how to split a Pandas DataFrame based on a column or row using df.iloc[]. df.groupby() and df.sample() function.

Key Points –

  • Using iloc[]: Split DataFrame by selecting specific rows or columns based on their index position.
  • Using loc[]: Split DataFrame by selecting rows or columns based on labels.
  • Using iloc[] for Columnbased Splitting: Columns can be split by specifying index ranges with iloc[].
  • Using groupby(): Split DataFrame into groups based on a column or multiple columns for aggregation or analysis.
  • Using sample() for Random Splitting: The sample() method can be used to randomly select rows for splitting.

Quick Examples of Split Pandas DataFrame

Following are quick examples of how to split Pandas DataFrame.


# Quick examples of split pandas dataframe

# Example 1: Split the DataFrame 
# Using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]

# Example 2: Split the DataFrame 
# Using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]

# Example 3: Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")

# Example 4: split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)

Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are 'Courses', 'Fee', 'Discount', and 'Duration'.


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
    'Fee' :[22000, 25000, 23000, 24000, 26000],
    'Discount':[1000, 2300, 1000, 1200, 2500],
    'Duration':['35days', '35days', '40days', '30days', '25days']
          }

df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days
2   Hadoop  23000      1000   40days
3   Python  24000      1200   30days
4   Pandas  26000      2500   25days

Use iloc[] Split the DataFrame in Pandas

We can use the iloc[] attribute to split the given DataFrame. The iloc[] property is used to select rows and columns by position/index. Pandas loc[] is another property that is used to operate on the column and row labels.

Split DataFrame by Row

Using this property we can select the required portion based on rows from the DataFrame. Here, I will use the iloc[] property, to split the given DataFrame into two smaller DataFrames. Let’s split the DataFrame,


# Split the DataFrame 
# Using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]
print(df1)
print("---------------------------")
print(df2)   

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days
---------------------------
  Courses    Fee  Discount Duration
2  Hadoop  23000      1000   40days
3  Python  24000      1200   30days
4  Pandas  26000      2500   25

Split DataFrame by Columns

In the above section, you have learned how to split DataFrame using the iloc[] property based on rows. Now, we will learn how to split DataFrame using the iloc[] property based on columns. It is the same as above but the syntax is a little bit different. Let’s see what is that difference.


# Split the DataFrame 
# Using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]
print(df1)
print("---------------------------")
print(df2)  

Yields below output.


# Output:
   Courses    Fee
0    Spark  22000
1  PySpark  25000
2   Hadoop  23000
3   Python  24000
4   Pandas  26000
---------------------------
   Discount Duration
0      1000   35days
1      2300   35days
2      1000   40days
3      1200   30days

Split Pandas DataFrame using groupby() Function

The Pandas.groupby() function is used to split the DataFrame based on some values. First, we can group the DataFrame using the groupby() function after that we can select specified groups using the get_group() function. This is the best function when we want to split a DataFrame based on some column that has unique values.


# Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")
print(df1)

Yields below output.


# Output:
   Courses    Fee  Discount Duration
0    Spark  22000      1000   35days
1  PySpark  25000      2300   35days

The above example returns a new DataFrame consisting of grouped data with 'Duration' is '35days'.

Split the DataFrame using Pandas Shuffle Rows

By using pandas.DataFrame.sample() function we can split the DataFrame by changing the order of rows. pandas.sample(frac=1) function is used to shuffle the order of rows randomly. The frac keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None just returns 1 random record. frac=.5 returns random 50% of the rows.

Let’s see sample() function how to split our DataFrame with random rows.


# Split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)
print(df1)
print(df1.reset_index())

Yields below output.


# Output:
  index Courses    Fee  Discount Duration
0      3  Python  24000      1200   30days
1      4  Pandas  26000      2500   25days

FAQ on How to Split Pandas DataFrame?

How do I split a DataFrame into two parts based on a condition?

To split a Pandas DataFrame into two parts based on a condition, you can use Boolean indexing. This allows you to filter rows that satisfy or do not satisfy the given condition.

Can I split a DataFrame by rows?

You can split a DataFrame by rows in Pandas using slicing or the iloc method. This is useful when you want to divide the DataFrame into smaller parts, such as for training and testing datasets or other analysis tasks.

How do I split a DataFrame by columns?

You can split a Pandas DataFrame by columns by selecting specific columns or dividing the columns into subsets. This is useful for tasks like separating feature columns from target variables in machine learning or for analysis of specific parts of a dataset.

How can I split a DataFrame into groups based on a column?

To split a Pandas DataFrame into groups based on a column, you can use the groupby() method. This is particularly useful when you want to perform operations or analyze data within groups defined by a specific column’s values.

How do I split a DataFrame based on index ranges?

To split a Pandas DataFrame based on index ranges, you can use slicing methods such as iloc, loc, or by defining specific index intervals. This is useful when working with time-series data or when you need to process subsets of rows based on their index positions.

Conclusion

In conclusion, splitting a Pandas DataFrame is a crucial operation in data analysis, allowing us to segment data based on specific criteria. In this article, we’ll explore various methods to split DataFrames, including using df.iloc[] for precise row and column splitting, df.groupby() for grouping based on column values, and df.sample() for random sampling. With each method, we’ll provide clear examples to enhance understanding.

References