By using pandas.DataFrame.sample()
method you can shuffle the DataFrame rows randomly, if you are using the NumPy
module you can use the permutation()
method to change the order of the rows also called the shuffle. Python also has other packages like sklearn
that has a method shuffle()
to shuffle the order of rows in DataFrame.
Key Points –
- Shuffling DataFrame rows helps in randomizing the order of data, which can be crucial for certain statistical analyses and machine learning tasks.
- The
DataFrame.sample()
method in Pandas provides a convenient way to shuffle DataFrame rows efficiently without modifying the original DataFrame. - Shuffling DataFrame rows can help in enhancing the diversity of data subsets, thereby improving the generalization ability of machine learning models.
- The
DataFrame.sample()
method facilitates row shuffling with parameters such asfrac
to specify the fraction of rows orn
to define the exact number of rows to sample. - To shuffle the DataFrame in place, use the
DataFrame.sample()
method with thefrac=1
parameter. - For large datasets, shuffling can be memory-intensive, necessitating careful consideration of computational resources, especially in distributed computing environments.
Create a DataFrame with a Dictionary of Lists
Let’s create a Pandas DataFrame with a dictionary of lists, pandas DataFrame columns names Courses
, Fee
, Duration
, Discount
.
# Create a DataFrame with a Dictionary of Lists
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30days','40days','35days','40days','60days','50days','55days'],
'Discount':[1000,2300,1500,1200,2500,2100,2000]
}
df = pd.DataFrame(technologies)
print(df)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 20000 30days 1000
1 PySpark 25000 40days 2300
2 Hadoop 26000 35days 1500
3 Python 22000 40days 1200
4 pandas 24000 60days 2500
5 Oracle 21000 50days 2100
6 Java 22000 55days 2000
Pandas Shuffle DataFrame Rows
Use pandas.DataFrame.sample(frac=1)
method to shuffle the order of rows. The frac
keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None
just returns 1 random record. frac=.5
returns random 50% of the rows.
Note that the sample() method by default returns a new DataFrame after shuffling.
# Shuffle the DataFrame rows & return all rows
df1 = df.sample(frac = 1)
print(df1)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Spark 20000 30days 1000
6 Java 22000 55days 2000
1 PySpark 25000 40days 2300
5 Oracle 21000 50days 2100
2 Hadoop 26000 35days 1500
3 Python 22000 40days 1200
4 pandas 24000 60days 2500
If you wanted to get n random rows use df.sample(n=2)
.
Pandas Shuffle Rows by Setting New Index
As you see above the Index also shuffled along with the rows. If you wanted a new Index starting from 0 by keeping the shuffled Index as-is use reset_index()
.
# Create a new Index starting from zero
df1 = df.sample(frac = 1).reset_index()
print(df1)
Yields below output.
# Output:
index Courses Fee Duration Discount
0 6 Java 22000 55days 2000
1 2 Hadoop 26000 35days 1500
2 4 pandas 24000 60days 2500
3 3 Python 22000 40days 1200
4 5 Oracle 21000 50days 2100
5 0 Spark 20000 30days 1000
6 1 PySpark 25000 40days 2300
In case if you do not want a shuffled Index then use .reset_index(drop=True)
# Drop shuffle Index
df1 = df.sample(frac = 1).reset_index(drop=True)
print(df1)
In this DataFrame df1
, the rows are shuffled, and the index has been reset to start from zero. The previous index has been dropped, as specified by drop=True
in the reset_index()
function. This is a common technique to shuffle rows while maintaining a clean, sequential index.
Using numpy.random.shuffle to Change Order of Rows
You can use numpy.random.shuffle()
to change the order of the DataFrame rows. Make sure you import NumPy
before using this method.
# Using NumPy
import numpy as np
np.random.shuffle(DataFrame.values)
Using permutation() From numpy to Get Random Sample
We can also use NumPy.random.permutation()
method to shuffle to Pandas DataFrame rows. The shuffle indices are used to select rows using the .iloc[]
method. You can shuffle the rows of a DataFrame by indexing with a shuffled index. For instance, df.iloc[np.random.permutation(df.index)].reset_index(drop=True)
.
# Using numpy permutation() method to shuffle DataFrame rows
df1 = df.iloc[np.random.permutation(df.index)].reset_index(drop=True)
print(df1)
Yields below output.
# Output:
Courses Fee Duration Discount
0 pandas 24000 60days 2500
1 Spark 20000 30days 1000
2 Java 22000 55days 2000
3 Oracle 21000 50days 2100
4 Python 22000 40days 1200
5 PySpark 25000 40days 2300
6 Hadoop 26000 35days 1500
Using sklearn shuffle() to Reorder DataFrame Rows
You can also use sklearn.utils.shuffle()
method to shuffle the pandas DataFrame rows. In order to use sklearn
, you need to install it using PIP (Python Package Installer). Also, in order to use it in a program make sure you import it.
# Using sklearn to shuffle rows
from sklearn.utils import shuffle
df = shuffle(df)
Using DataFrame.apply() & numpy.random.permutation() to Shuffle
You can also use df.apply(np.random.permutation,axis=1)
. Yields below output that shuffle the rows, dtype:object
.
# Using apply() method to shuffle the DataFrame rows
import numpy as np
df1 = df.apply(np.random.permutation, axis=1)
print(df1)
Yields below output.
# Output:
0 [30days, Spark, 1000, 20000]
1 [40days, PySpark, 25000, 2300]
2 [1500, Hadoop, 26000, 35days]
3 [40days, 1200, Python, 22000]
4 [60days, pandas, 2500, 24000]
5 [2100, 21000, 50days, Oracle]
6 [2000, Java, 22000, 55days]
dtype: object
Pandas DataFrame Shuffle/Permutating Rows Using Lambda Function
Use df.apply(lambda x: x.sample(frac=1).values
to do sampling independently on each column. Use apply to iterate over each column and .value
to get a NumPy array. frac=1
means all rows of a DataFrame.
# Using lambda method to Shuffle/permutating DataFrame rows
df2 = df.apply(lambda x: x.sample(frac=1).values)
print(df2)
Yields below output.
# Output:
Courses Fee Duration Discount
0 Oracle 20000 40days 1000
1 Hadoop 21000 60days 2300
2 pandas 26000 40days 1500
3 PySpark 24000 30days 1200
4 Spark 22000 35days 2000
5 Java 22000 50days 2500
6 Python 25000 55days 2100
Shuffle DataFrame Randomly by Rows and Columns
You can use df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
to shuffle rows and columns randomly. Your desired DataFrame looks completely randomized. I really don’t know the use case of this but would like to cover it as this is possible with sample() method.
# Using sample() method to shuffle DataFrame rows and columns
df2 = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
print(df2)
Yields below output.
# Output:
Duration Fee Discount Courses
0 60days 24000 2500 pandas
1 55days 22000 2000 Java
2 40days 25000 2300 PySpark
3 40days 22000 1200 Python
4 35days 26000 1500 Hadoop
5 50days 21000 2100 Oracle
6 30days 20000 1000 Spark
Complete Example For Shuffle DataFrame Rows
import pandas as pd
technologies = {
'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
'Fee' :[20000,25000,26000,22000,24000,21000,22000],
'Duration':['30days','40days','35days', '40days','60days','50days','55days'],
'Discount':[1000,2300,1500,1200,2500,2100,2000]
}
df = pd.DataFrame(technologies)
print(df)
# Shuffle the DataFrame rows & return all rows
df1 = df.sample(frac = 1)
print(df1)
# Create a new Index starting from zero
df1 = df.sample(frac = 1).reset_index()
print(df1)
# Using NumPy
import numpy as np
np.random.shuffle(DataFrame.values)
# Using numpy permutation() method to shuffle DataFrame rows
df1 = df.iloc[np.random.permutation(df.index)].reset_index(drop=True)
print(df1)
# Using sklearn to shuffle rows
from sklearn.utils import shuffle
df = shuffle(df)
# Using apply() method to shuffle the DataFrame rows
import numpy as np
df1 = df.apply(np.random.permutation, axis=1)
print(df1)
# Using lambda method to Shuffle/permutating DataFrame rows
df2 = df.apply(lambda x: x.sample(frac=1).values)
print(df2)
# Using sample() method to shuffle DataFrame rows and columns
df2 = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
print(df2)
Frequently Asked Questions on Pandas Shuffle DataFrame Rows
Shuffling DataFrame rows helps in eliminating biases that might arise from the inherent order of data, ensuring fairness in analyses and model training.
You can use the DataFrame.sample()
method in Pandas, specifying parameters such as frac
to indicate the fraction of rows to sample or n
to specify the exact number of rows.
Shuffling rows is crucial, especially for tasks like cross-validation, where the order of data can influence model performance. It aids in producing more reliable and generalizable models.
You can control the randomness of row shuffling in Pandas by specifying a random seed. This ensures that the shuffling process produces the same results when the same seed is used. You can achieve this by providing a value to the random_state
parameter in the DataFrame.sample()
method.
Shuffling DataFrame rows using the DataFrame.sample()
method does not affect the original DataFrame. By default, DataFrame.sample()
returns a new DataFrame with the rows shuffled according to the specified parameters, leaving the original DataFrame unchanged.
Shuffling large datasets can be memory-intensive, particularly in distributed computing environments. It’s essential to manage computational resources efficiently.
Conclusion
In this article, you have learned how to shuffle Pandas DataFrame rows using different approaches DataFrame.sample()
, DataFrame.apply()
, DataFrame.iloc[]
, lambda function. Also, you have learned to shuffle Pandas DataFrame rows using NumPy.random.permutation()
and sklearn.utils.shuffle()
methods.
Happy Learning !!
Related Articles
- What is a Pandas DataFrame explained with examples
- How to drop a list of rows from pandas DataFrame
- Pandas groupby() Method Explained with Examples
- Pandas Drop Column From DataFrame
- How to Split Pandas DataFrame?
- pandas head() – Returns Top N Rows
- How to use Pandas stack() function
- Pandas apply() with Lambda Examples
- Retrieve Number of Rows From Pandas DataFrame
- How to Drop Multiple Columns by Index in pandas
- How to Drop Column(s) by Index in pandas
- Pandas Find Row Values for Column Maximal
- Pandas Create Conditional Column in DataFrame
- Pandas Drop Level From Multi-Level Column Index