You can split the Pandas DataFrame based on rows or columns by using Pandas.DataFrame.iloc[]
attribute, groupby().get_group()
, sample()
functions. It returns some portion of DataFrame when we select the required portion of rows or columns from the DataFrame.
In this article, I will explain how to split a Pandas DataFrame based on a column or row using df.iloc[]
. df.groupby()
and df.sample()
function.
Key Points –
- Using
iloc[]
: Split DataFrame by selecting specific rows or columns based on their index position. - Using
loc[]
: Split DataFrame by selecting rows or columns based on labels. - Using
iloc[]
for Column–based Splitting: Columns can be split by specifying index ranges withiloc[]
. - Using
groupby()
: Split DataFrame into groups based on a column or multiple columns for aggregation or analysis. - Using
sample()
for Random Splitting: Thesample()
method can be used to randomly select rows for splitting.
Quick Examples of Split Pandas DataFrame
Following are quick examples of how to split Pandas DataFrame.
# Quick examples of split pandas dataframe
# Example 1: Split the DataFrame
# Using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]
# Example 2: Split the DataFrame
# Using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]
# Example 3: Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")
# Example 4: split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)
Let’s create Pandas DataFrame using data from a Python dictionary, where the columns are 'Courses'
, 'Fee'
, 'Discount'
, and 'Duration'
.
import pandas as pd
import numpy as np
technologies= {
'Courses':["Spark", "PySpark", "Hadoop", "Python", "Pandas"],
'Fee' :[22000, 25000, 23000, 24000, 26000],
'Discount':[1000, 2300, 1000, 1200, 2500],
'Duration':['35days', '35days', '40days', '30days', '25days']
}
df = pd.DataFrame(technologies)
print("Create DataFrame:\n", df)
Yields below output.
# Output:
Courses Fee Discount Duration
0 Spark 22000 1000 35days
1 PySpark 25000 2300 35days
2 Hadoop 23000 1000 40days
3 Python 24000 1200 30days
4 Pandas 26000 2500 25days
Use iloc[] Split the DataFrame in Pandas
We can use the iloc[]
attribute to split the given DataFrame. The iloc[]
property is used to select rows and columns by position/index. Pandas loc[]
is another property that is used to operate on the column and row labels.
Split DataFrame by Row
Using this property we can select the required portion based on rows from the DataFrame. Here, I will use the iloc[]
property, to split the given DataFrame into two smaller DataFrames. Let’s split the DataFrame,
# Split the DataFrame
# Using iloc[] by rows
df1 = df.iloc[:2,:]
df2 = df.iloc[2:,:]
print(df1)
print("---------------------------")
print(df2)
Yields below output.
# Output:
Courses Fee Discount Duration
0 Spark 22000 1000 35days
1 PySpark 25000 2300 35days
---------------------------
Courses Fee Discount Duration
2 Hadoop 23000 1000 40days
3 Python 24000 1200 30days
4 Pandas 26000 2500 25
Split DataFrame by Columns
In the above section, you have learned how to split DataFrame using the iloc[]
property based on rows. Now, we will learn how to split DataFrame using the iloc[]
property based on columns. It is the same as above but the syntax is a little bit different. Let’s see what is that difference.
# Split the DataFrame
# Using iloc[] by columns
df1 = df.iloc[:,:2]
df2 = df.iloc[:,2:]
print(df1)
print("---------------------------")
print(df2)
Yields below output.
# Output:
Courses Fee
0 Spark 22000
1 PySpark 25000
2 Hadoop 23000
3 Python 24000
4 Pandas 26000
---------------------------
Discount Duration
0 1000 35days
1 2300 35days
2 1000 40days
3 1200 30days
Split Pandas DataFrame using groupby() Function
The Pandas.groupby() function is used to split the DataFrame based on some values. First, we can group the DataFrame using the groupby()
function after that we can select specified groups using the get_group()
function. This is the best function when we want to split a DataFrame based on some column that has unique values.
# Split Dataframe using groupby() &
# Grouping by particular dataframe column
grouped = df.groupby(df.Duration)
df1 = grouped.get_group("35days")
print(df1)
Yields below output.
# Output:
Courses Fee Discount Duration
0 Spark 22000 1000 35days
1 PySpark 25000 2300 35days
The above example returns a new DataFrame consisting of grouped data with 'Duration'
is '35days'
.
Split the DataFrame using Pandas Shuffle Rows
By using pandas.DataFrame.sample()
function we can split the DataFrame by changing the order of rows. pandas.sample(frac=1)
function is used to shuffle the order of rows randomly. The frac
keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None
just returns 1 random record. frac=.5
returns random 50% of the rows.
Let’s see sample()
function how to split our DataFrame with random rows.
# Split DataFrame using sample()
df1 = df.sample(frac = 0.5, random_state = 200)
print(df1)
print(df1.reset_index())
Yields below output.
# Output:
index Courses Fee Discount Duration
0 3 Python 24000 1200 30days
1 4 Pandas 26000 2500 25days
FAQ on How to Split Pandas DataFrame?
To split a Pandas DataFrame into two parts based on a condition, you can use Boolean indexing. This allows you to filter rows that satisfy or do not satisfy the given condition.
You can split a DataFrame by rows in Pandas using slicing or the iloc
method. This is useful when you want to divide the DataFrame into smaller parts, such as for training and testing datasets or other analysis tasks.
You can split a Pandas DataFrame by columns by selecting specific columns or dividing the columns into subsets. This is useful for tasks like separating feature columns from target variables in machine learning or for analysis of specific parts of a dataset.
To split a Pandas DataFrame into groups based on a column, you can use the groupby()
method. This is particularly useful when you want to perform operations or analyze data within groups defined by a specific column’s values.
To split a Pandas DataFrame based on index ranges, you can use slicing methods such as iloc
, loc
, or by defining specific index intervals. This is useful when working with time-series data or when you need to process subsets of rows based on their index positions.
Conclusion
In conclusion, splitting a Pandas DataFrame is a crucial operation in data analysis, allowing us to segment data based on specific criteria. In this article, we’ll explore various methods to split DataFrames, including using df.iloc[]
for precise row and column splitting, df.groupby()
for grouping based on column values, and df.sample()
for random sampling. With each method, we’ll provide clear examples to enhance understanding.
Related Articles
- Pandas Join DataFrames on Columns
- How to sum Pandas DataFrame rows?
- Pandas convert column to string-type
- How to drop rows of DataFrame by index?
- Split Pandas DataFrame by column value
- Sort Multiple Columns in Pandas DataFrame
- Split column of DataFrame into two columns
- How to Count NaN values in a DataFrame?
- Pandas Drop Multiple Columns From DataFrame
- How to Drop Multiple Columns by Index in Pandas
- Pandas Check Column Contains a Value in DataFrame
- Pandas Extract Column Value Based on Another Column