Pandas Shuffle DataFrame Rows Examples

By using pandas.DataFrame.sample() method you can shuffle the DataFrame rows randomly, if you are using the NumPy module you can use the permutation() method to change the order of the rows also called the shuffle. Python also has other packages like sklearn that has a method shuffle() to shuffle the order of rows in DataFrame

1. Create a DataFrame with a Dictionary of Lists

Let’s create a Pandas DataFrame with a dictionary of lists, pandas DataFrame columns names Courses, Fee, Duration, Discount.


# Create a DataFrame with a Dictionary of Lists
import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day','40days','35days','40days','60days','50days','55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000    30day      1000
1  PySpark  25000   40days      2300
2   Hadoop  26000   35days      1500
3   Python  22000   40days      1200
4   pandas  24000   60days      2500
5   Oracle  21000   50days      2100
6     Java  22000   55days      2000

2. Pandas Shuffle DataFrame Rows

Use pandas.DataFrame.sample(frac=1) method to shuffle the order of rows. The frac keyword argument specifies the fraction of rows to return in the random sample DataFrame. frac=None just returns 1 random record. frac=.5 returns random 50% of the rows.

Note that the sample() method by default returns a new DataFrame after shuffling.


# Shuffle the DataFrame rows & return all rows
df1 = df.sample(frac = 1)
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0    Spark  20000    30day      1000
6     Java  22000   55days      2000
1  PySpark  25000   40days      2300
5   Oracle  21000   50days      2100
2   Hadoop  26000   35days      1500
3   Python  22000   40days      1200
4   pandas  24000   60days      2500

If you wanted to get n random rows use df.sample(n=2).

3. Pandas Shuffle Rows by Setting New Index

As you see above the Index also shuffled along with the rows. If you wanted a new Index starting from 0 by keeping the shuffled Index as-is use reset_index().


# Create a new Index starting from zero
df1 = df.sample(frac = 1).reset_index()
print(df1)

Yields below output.


# Output:
   index  Courses    Fee Duration  Discount
0      6     Java  22000   55days      2000
1      2   Hadoop  26000   35days      1500
2      4   pandas  24000   60days      2500
3      3   Python  22000   40days      1200
4      5   Oracle  21000   50days      2100
5      0    Spark  20000    30day      1000
6      1  PySpark  25000   40days      2300

In case if you do not want a shuffled Index then use .reset_index(drop=True)


# Drop shuffle Index
df1 = df.sample(frac = 1).reset_index(drop=True)
print(df1)

4. Using numpy.random.shuffle to Change Order of Rows

You can use numpy.random.shuffle() to change the order of the DataFrame rows. Make sure you import NumPy before using this method.


# Using NumPy
import numpy as np
np.random.shuffle(DataFrame.values)

5. Using permutation() From numpy to Get Random Sample

We can also use NumPy.random.permutation() method to shuffle to Pandas DataFrame rows. The shuffle indices are used to select rows using the .iloc[] method. You can shuffle the rows of a DataFrame by indexing with a shuffled index. For instance, df.iloc[np.random.permutation(df.index)].reset_index(drop=True).


# Using numpy permutation() method to shuffle DataFrame rows
df1 = df.iloc[np.random.permutation(df.index)].reset_index(drop=True)
print(df1)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0   pandas  24000   60days      2500
1    Spark  20000    30day      1000
2     Java  22000   55days      2000
3   Oracle  21000   50days      2100
4   Python  22000   40days      1200
5  PySpark  25000   40days      2300
6   Hadoop  26000   35days      1500

6. Using sklearn shuffle() to Reorder DataFrame Rows

You can also use sklearn.utils.shuffle() method to shuffle the pandas DataFrame rows. In order to use sklearn, you need to install it using PIP (Python Package Installer). Also, in order to use it in a program make sure you import it.


# Using sklearn to shuffle rows
from sklearn.utils import shuffle
df = shuffle(df)

7. Using DataFrame.apply() & numpy.random.permutation() to Shuffle

You can also use df.apply(np.random.permutation,axis=1). Yields below output that shuffle the rows, dtype:object.


# Using apply() method to shuffle the DataFrame rows
import numpy as np
df1 = df.apply(np.random.permutation, axis=1)    
print(df1)

Yields below output.


# Output:
0       [30day, Spark, 1000, 20000]
1    [40days, PySpark, 25000, 2300]
2     [1500, Hadoop, 26000, 35days]
3     [40days, 1200, Python, 22000]
4     [60days, pandas, 2500, 24000]
5     [2100, 21000, 50days, Oracle]
6       [2000, Java, 22000, 55days]
dtype: object

8. Pandas DataFrame Shuffle/Permutating Rows Using Lambda Function

Use df.apply(lambda x: x.sample(frac=1).values to do sampling independently on each column. Use apply to iterate over each column and .value to get a NumPy array. frac=1 means all rows of a DataFrame.


# Using lambda method to Shuffle/permutating DataFrame rows
df2 = df.apply(lambda x: x.sample(frac=1).values)
print(df2)

Yields below output.


# Output:
   Courses    Fee Duration  Discount
0   Oracle  20000   40days      1000
1   Hadoop  21000   60days      2300
2   pandas  26000   40days      1500
3  PySpark  24000    30day      1200
4    Spark  22000   35days      2000
5     Java  22000   50days      2500
6   Python  25000   55days      2100

9. Shuffle DataFrame Randomly by Rows and Columns

You can use df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True) to shuffle rows and columns randomly. Your desired DataFrame looks completely randomized. I really don’t know the use case of this but would like to cover it as this is possible with sample() method.


# Using sample() method to shuffle DataFrame rows and columns
df2 = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
print(df2)

Yields below output.


# Output:
  Duration    Fee  Discount  Courses
0   60days  24000      2500   pandas
1   55days  22000      2000     Java
2   40days  25000      2300  PySpark
3   40days  22000      1200   Python
4   35days  26000      1500   Hadoop
5   50days  21000      2100   Oracle
6    30day  20000      1000    Spark

10. Complete Example For Shuffle DataFrame Rows


import pandas as pd
technologies = {
    'Courses':["Spark","PySpark","Hadoop","Python","pandas","Oracle","Java"],
    'Fee' :[20000,25000,26000,22000,24000,21000,22000],
    'Duration':['30day','40days','35days', '40days','60days','50days','55days'],
    'Discount':[1000,2300,1500,1200,2500,2100,2000]
               }
df = pd.DataFrame(technologies)
print(df)

# Shuffle the DataFrame rows & return all rows
df1 = df.sample(frac = 1)
print(df1)

# Create a new Index starting from zero
df1 = df.sample(frac = 1).reset_index()
print(df1)

# Using NumPy
import numpy as np
np.random.shuffle(DataFrame.values)

# Using numpy permutation() method to shuffle DataFrame rows
df1 = df.iloc[np.random.permutation(df.index)].reset_index(drop=True)
print(df1)

# Using sklearn to shuffle rows
from sklearn.utils import shuffle
df = shuffle(df)

# Using apply() method to shuffle the DataFrame rows
import numpy as np
df1 = df.apply(np.random.permutation, axis=1)    
print(df1)

# Using lambda method to Shuffle/permutating DataFrame rows
df2 = df.apply(lambda x: x.sample(frac=1).values)
print(df2)

# Using sample() method to shuffle DataFrame rows and columns
df2 = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
print(df2)

11. Conclusion

In this article, you have learned how to shuffle Pandas DataFrame rows using different approaches DataFrame.sample(), DataFrame.apply(), DataFrame.iloc[], lambda function. Also, you have learned to shuffle Pandas DataFrame rows using NumPy.random.permutation() and sklearn.utils.shuffle() methods.

Happy Learning !!

References

Naveen (NNK)

I am Naveen (NNK) working as a Principal Engineer. I am a seasoned Apache Spark Engineer with a passion for harnessing the power of big data and distributed computing to drive innovation and deliver data-driven insights. I love to design, optimize, and managing Apache Spark-based solutions that transform raw data into actionable intelligence. I am also passion about sharing my knowledge in Apache Spark, Hive, PySpark, R etc.

Leave a Reply

You are currently viewing Pandas Shuffle DataFrame Rows Examples