Pandas Create Test and Train Samples from DataFrame

  • Post author:
  • Post category:Pandas / Python
  • Post last modified:January 17, 2023
Spread the love

Pandas create different samples for test and train from DataFrame can be achieved by using DataFrame.sample(), and by applying sklearn’s train_test_split() function and model_selection() function. In this article, I will explain how to create test and train samples DataFrame’s by splitting the rows from DataFrame.

The examples explained here will help you split the pandas DataFrame into two random samples (80% and 20%) for training and testing. These samples make sense if you have a large Dataset. To make it simple for this article, I am using quite a simple DataFrame, but you can use these approaches in your real-time projects to split the DataFrame.

1. Quick Examples to Create Test and Train Samples

If you are in hurry below are some quick examples to create test and train samples in pandas DataFrame.



# Using DataFrame.sample() 
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

# Below are some Quick examples
# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]

# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]

Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names CoursesFee and Duration


import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

Yields below output.


Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
3   Python  24000     None
4  PySpark  26000      NaN

2. Using DataFrame.sample() Method To get Test & Train Samples

DataFrame.sample() return a random sample of elements from the DataFrame. You can use this to select the train and test samples.

The random_state parameter controls the shuffling applied to the data before the split. By defining the random_state, we can reproduce the same split of the data across multiple calls.

Using Shuffle parameter to generate random shuffled before splitting.


# Using DataFrame.sample() Method by random_state arg.
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)

Yields below output.


   Courses    Fee Duration
3   Python  24000     None
4  PySpark  26000      NaN
0    Spark  22000   30days
1  PySpark  25000   50days

3. Use sklearn to Create Test and Train Samples

The train_test_split() function of the sklearn library is able to handle Pandas DataFrames as well as arrays. Therefore, we can simply call the corresponding function by providing the dataset and other parameters.

Test_size: This parameter represents the proportion of the dataset that should be included in the test split. The default value for this parameter is set to 0.25, meaning that if we don’t specify the test_size, the resulting split consists of 75% train and 25% test data.


# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)

Yields below output.


   Courses    Fee Duration
4  PySpark  26000      NaN
2    Spark  23000   30days
3   Python  24000     None
1  PySpark  25000   50days

4. Using model_selection() Method

model_selection() is a method for setting and analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results. You need to train your model by using a specific dataset. Then, you test the model against another dataset.


# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]

Yields below output.


     Fee Duration
4  26000      NaN
0  22000   30days
2  23000   30days
3  24000     None

5. Using Numpy.random.rand() Method

np.random.rand() generates random numbers from the standard uniform distribution (i.e., the uniform distribution from 0 to 1), and outputs those numbers as a Numpy array. The np.random.rand() produces random numbers, structured as a Numpy array. A Numpy array is a  data structure that we use for storing and manipulating numeric data.

np.random.rand(len(df)) is an array of size len(df) with randomly and uniformly distributed float values in range [0, 1]. The < 0.8 applies the comparison element-wise and stores the result in place. Thus values < 0.8 become True and value >= 0.8 become False.


# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)

Yields below output.


  Courses    Fee Duration
0   Spark  22000   30days
2   Spark  23000   30days

NOTE: Alternatively, as long as msk is of dtype booldf[msk]df.iloc[msk] and df.loc[msk] always return the same result.

6. Complete Examples of Create Test and Train Samples DataFrame

Below are some complete examples of creating test and train samples of pandas DataFrame.


# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)

# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)

# Using DataFrame.sample() 
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)

# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]

Conclusion

In this article, you have learned how to create test and train samples of pandas DataFrame by using DataFrame.drop(), DataFrame.sample(), and by applying sklearn’s train_test_split() and model_selection() function with examples.

References

Leave a Reply

You are currently viewing Pandas Create Test and Train Samples from DataFrame