Pandas Shuffle DataFrame Rows Examples

  • Post author:
  • Post category:Pandas
  • Post last modified:February 12, 2024
  • Reading time:18 mins read

By using pandas.DataFrame.sample() method you can shuffle the DataFrame rows randomly, if you are using the NumPy module you can use the permutation() method to change the order of the rows also called the shuffle. Python also has other packages like sklearn that has a method shuffle() to shuffle the order of rows in DataFrame.

Key Points –

  • Shuffling DataFrame rows helps in randomizing the order of data, which can be crucial for certain statistical analyses and machine learning tasks.
  • The DataFrame.sample() method in Pandas provides a convenient way to shuffle DataFrame rows efficiently without modifying the original DataFrame.
  • Shuffling DataFrame rows can help in enhancing the diversity of data subsets, thereby improving the generalization ability of machine learning models.
  • The DataFrame.sample() method facilitates row shuffling with parameters such as frac to specify the fraction of rows or n to define the exact number of rows to sample.
  • To shuffle the DataFrame in place, use the DataFrame.sample() method with the frac=1 parameter.
  • For large datasets, shuffling can be memory-intensive, necessitating careful consideration of computational resources, especially in distributed computing environments.

Create a DataFrame with a Dictionary of Lists

Let’s create a Pandas DataFrame with a dictionary of lists, pandas DataFrame columns names Courses, Fee, Duration, Discount.


# Create a DataFrame with a Dictionary of Lists
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30days','40days','35days','40days','60days','50days','55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000    30days      1000
1  PySpark  25000   40days      2300
2   Hadoop  26000   35days      1500
3   Python  22000   40days      1200
4   pandas  24000   60days      2500
5   Oracle  21000   50days      2100
6     Java  22000   55days      2000

Pandas Shuffle DataFrame Rows

Use pandas.DataFrame.sample(frac=1) method to shuffle the order of rows. The frac keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None just returns 1 random record. frac=.5 returns random 50% of the rows.

Note that the sample() method by default returns a new DataFrame after shuffling.


# Shuffle the DataFrame rows & return all rows
df1 = df.sample(frac = 1)
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000    30days      1000
6     Java  22000   55days      2000
1  PySpark  25000   40days      2300
5   Oracle  21000   50days      2100
2   Hadoop  26000   35days      1500
3   Python  22000   40days      1200
4   pandas  24000   60days      2500

If you wanted to get n random rows use df.sample(n=2).

Pandas Shuffle Rows by Setting New Index

As you see above the Index also shuffled along with the rows. If you wanted a new Index starting from 0 by keeping the shuffled Index as-is use reset_index().


# Create a new Index starting from zero
df1 = df.sample(frac = 1).reset_index()
print(df1)

Yields below output.


# Output:
   index  Courses    Fee Duration  Discount
0      6     Java  22000   55days      2000
1      2   Hadoop  26000   35days      1500
2      4   pandas  24000   60days      2500
3      3   Python  22000   40days      1200
4      5   Oracle  21000   50days      2100
5      0    Spark  20000   30days      1000
6      1  PySpark  25000   40days      2300

In case if you do not want a shuffled Index then use .reset_index(drop=True)


# Drop shuffle Index
df1 = df.sample(frac = 1).reset_index(drop=True)
print(df1)

In this DataFrame df1, the rows are shuffled, and the index has been reset to start from zero. The previous index has been dropped, as specified by drop=True in the reset_index() function. This is a common technique to shuffle rows while maintaining a clean, sequential index.

Using numpy.random.shuffle to Change Order of Rows

You can use numpy.random.shuffle() to change the order of the DataFrame rows. Make sure you import NumPy before using this method.


# Using NumPy
import numpy as np
np.random.shuffle(DataFrame.values)

Using permutation() From numpy to Get Random Sample

We can also use NumPy.random.permutation() method to shuffle to Pandas DataFrame rows. The shuffle indices are used to select rows using the .iloc[] method. You can shuffle the rows of a DataFrame by indexing with a shuffled index. For instance, df.iloc[np.random.permutation(df.index)].reset_index(drop=True).


# Using numpy permutation() method to shuffle DataFrame rows
df1 = df.iloc[np.random.permutation(df.index)].reset_index(drop=True)
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0   pandas  24000   60days      2500
1    Spark  20000   30days      1000
2     Java  22000   55days      2000
3   Oracle  21000   50days      2100
4   Python  22000   40days      1200
5  PySpark  25000   40days      2300
6   Hadoop  26000   35days      1500

Using sklearn shuffle() to Reorder DataFrame Rows

You can also use sklearn.utils.shuffle() method to shuffle the pandas DataFrame rows. In order to use sklearn, you need to install it using PIP (Python Package Installer). Also, in order to use it in a program make sure you import it.


# Using sklearn to shuffle rows
from sklearn.utils import shuffle
df = shuffle(df)

Using DataFrame.apply() & numpy.random.permutation() to Shuffle

You can also use df.apply(np.random.permutation,axis=1). Yields below output that shuffle the rows, dtype:object.


# Using apply() method to shuffle the DataFrame rows
import numpy as np
df1 = df.apply(np.random.permutation, axis=1)    
print(df1)

Yields below output.


# Output:
0       [30days, Spark, 1000, 20000]
1    [40days, PySpark, 25000, 2300]
2     [1500, Hadoop, 26000, 35days]
3     [40days, 1200, Python, 22000]
4     [60days, pandas, 2500, 24000]
5     [2100, 21000, 50days, Oracle]
6       [2000, Java, 22000, 55days]
dtype: object

Pandas DataFrame Shuffle/Permutating Rows Using Lambda Function

Use df.apply(lambda x: x.sample(frac=1).values to do sampling independently on each column. Use apply to iterate over each column and .value to get a NumPy array. frac=1 means all rows of a DataFrame.


# Using lambda method to Shuffle/permutating DataFrame rows
df2 = df.apply(lambda x: x.sample(frac=1).values)
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0   Oracle  20000   40days      1000
1   Hadoop  21000   60days      2300
2   pandas  26000   40days      1500
3  PySpark  24000   30days      1200
4    Spark  22000   35days      2000
5     Java  22000   50days      2500
6   Python  25000   55days      2100

Shuffle DataFrame Randomly by Rows and Columns

You can use df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True) to shuffle rows and columns randomly. Your desired DataFrame looks completely randomized. I really don’t know the use case of this but would like to cover it as this is possible with sample() method.


# Using sample() method to shuffle DataFrame rows and columns
df2 = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
print(df2)

Yields below output.


# Output:
  Duration    Fee  Discount  Courses
0   60days  24000      2500   pandas
1   55days  22000      2000     Java
2   40days  25000      2300  PySpark
3   40days  22000      1200   Python
4   35days  26000      1500   Hadoop
5   50days  21000      2100   Oracle
6   30days  20000      1000    Spark

Complete Example For Shuffle DataFrame Rows


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30days','40days','35days', '40days','60days','50days','55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
df = pd.DataFrame(technologies)
print(df)

# Shuffle the DataFrame rows & return all rows
df1 = df.sample(frac = 1)
print(df1)

# Create a new Index starting from zero
df1 = df.sample(frac = 1).reset_index()
print(df1)

# Using NumPy
import numpy as np
np.random.shuffle(DataFrame.values)

# Using numpy permutation() method to shuffle DataFrame rows
df1 = df.iloc[np.random.permutation(df.index)].reset_index(drop=True)
print(df1)

# Using sklearn to shuffle rows
from sklearn.utils import shuffle
df = shuffle(df)

# Using apply() method to shuffle the DataFrame rows
import numpy as np
df1 = df.apply(np.random.permutation, axis=1)    
print(df1)

# Using lambda method to Shuffle/permutating DataFrame rows
df2 = df.apply(lambda x: x.sample(frac=1).values)
print(df2)

# Using sample() method to shuffle DataFrame rows and columns
df2 = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
print(df2)

Frequently Asked Questions on Pandas Shuffle DataFrame Rows

Why should I shuffle DataFrame rows?

Shuffling DataFrame rows helps in eliminating biases that might arise from the inherent order of data, ensuring fairness in analyses and model training.

How do I shuffle DataFrame rows in Pandas?

You can use the DataFrame.sample() method in Pandas, specifying parameters such as frac to indicate the fraction of rows to sample or n to specify the exact number of rows.

Is shuffling rows necessary for machine learning tasks?

Shuffling rows is crucial, especially for tasks like cross-validation, where the order of data can influence model performance. It aids in producing more reliable and generalizable models.

Can I control the randomness of row shuffling?

You can control the randomness of row shuffling in Pandas by specifying a random seed. This ensures that the shuffling process produces the same results when the same seed is used. You can achieve this by providing a value to the random_state parameter in the DataFrame.sample() method.

Does shuffling DataFrame rows affect the original DataFrame?

Shuffling DataFrame rows using the DataFrame.sample() method does not affect the original DataFrame. By default, DataFrame.sample() returns a new DataFrame with the rows shuffled according to the specified parameters, leaving the original DataFrame unchanged.

Are there any considerations for shuffling large datasets?

Shuffling large datasets can be memory-intensive, particularly in distributed computing environments. It’s essential to manage computational resources efficiently.

Conclusion

In this article, you have learned how to shuffle Pandas DataFrame rows using different approaches DataFrame.sample(), DataFrame.apply(), DataFrame.iloc[], lambda function. Also, you have learned to shuffle Pandas DataFrame rows using NumPy.random.permutation() and sklearn.utils.shuffle() methods.

Happy Learning !!

References

Naveen (NNK)

Naveen (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ @ LinkedIn

Leave a Reply