We can split the Pandas DataFrame based on rows or columns by using Pandas.DataFrame.iloc[]
attribute, groupby().get_group()
, sample()
functions. It returns some portion of DataFrame when we select the required portion of rows or columns from the DataFrame.
In this article, I will explain how to split a Pandas dataframe based on column or row using df.iloc[]
and also I can split the DataFrame using df.groupby()
and df.sample()
function examples.
1. Quick Examples of Split Pandas DataFrame
Following are quick examples of how to split Pandas DataFrame.
# Below are the quick examples
# Example 1: Split the DataFrame using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]
# Example 2: Split the DataFrame using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]
# Example 3: Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")
# Example 4: split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)
Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are 'Courses'
, 'Fee'
, 'Discount'
, and 'Duration'
.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
'Fee' :[22000, 25000, 23000, 24000, 26000],
'Discount':[1000, 2300, 1000, 1200, 2500],
'Duration':['35days', '35days', '40days', '30days', '25days']
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Discount Duration
0 Spark 22000 1000 35days
1 PySpark 25000 2300 35days
2 Hadoop 23000 1000 40days
3 Python 24000 1200 30days
4 Pandas 26000 2500 25days
2. Use iloc[] Split the DataFrame in Pandas
We can use the iloc
[] attribute to split the given DataFrame. The iloc[] property is used to select rows and columns by position/index. Pandas loc[] is another property that is used to operate on the column and row labels.
2.1 Split DataFrame by Row
Using this property we can select the required portion based on rows from the DataFrame. Here, I will use the iloc[] property, to split the given DataFrame into two smaller DataFrames. Let’s split the DataFrame,
# Split the DataFrame using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]
print(df1)
print("---------------------------")
print(df2)
Yields below output.
# Output:
Courses Fee Discount Duration
0 Spark 22000 1000 35days
1 PySpark 25000 2300 35days
---------------------------
Courses Fee Discount Duration
2 Hadoop 23000 1000 40days
3 Python 24000 1200 30days
4 Pandas 26000 2500 25
2.2 Split DataFrame by Columns
In the above section, you have learned how to split DataFrame using the iloc[] property based on rows. Now, we will learn how to split DataFrame using the iloc[] property based on columns. It is the same as above but the syntax is a little bit different. Let’s see what is that difference.
# Split the DataFrame using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]
print(df1)
print("---------------------------")
print(df2)
Yields below output.
# Output:
Courses Fee
0 Spark 22000
1 PySpark 25000
2 Hadoop 23000
3 Python 24000
4 Pandas 26000
---------------------------
Discount Duration
0 1000 35days
1 2300 35days
2 1000 40days
3 1200 30days
3. Split Pandas Dataframe using groupby() function
The Pandas.groupby() function is used to split the DataFrame based on some values. First, we can group the DataFrame using the groupby() function after that we can select specified groups using the get_group()
function. This is the best function when we want to split a DataFrame based on some column that has unique values.
# Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")
print(df1)
Yields below output.
# Output:
Courses Fee Discount Duration
0 Spark 22000 1000 35days
1 PySpark 25000 2300 35days
The above example returns a new DataFrame consisting of grouped data with 'Duration'
is '35days'
.
4. Split the DataFrame using Pandas Shuffle Rows
By using pandas.DataFrame.sample()
function we can split the DataFrame by changing the order of rows. pandas.sample(frac=1)
function is used to shuffle the order of rows randomly. The frac
keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None
just returns 1 random record. frac=.5
returns random 50% of the rows.
Let’s see sample() function how to split our DataFrame with random rows.
# Split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)
print(df1)
print(df1.reset_index())
Yields below output.
# Output:
index Courses Fee Discount Duration
0 3 Python 24000 1200 30days
1 4 Pandas 26000 2500 25days
5. Conclusion
In this article, I have explained how to split Pandas DataFrame using df.iloc[]
attribute based on both rows and columns and also using df.groupby()
function and df.sample()
function how we can split the DataFrame with well defined examples.