• Post author:
  • Post category:Pandas
  • Post last modified:March 27, 2024
  • Reading time:16 mins read
You are currently viewing Pandas Create Test and Train Samples from DataFrame

Pandas create different samples for test and train from DataFrame can be achieved by using DataFrame.sample(), and by applying sklearn’s train_test_split() functions and model_selection() function. In this article, I will explain how to create test and train samples DataFrame’s by splitting the rows from DataFrame.

The examples explained here will help you split the pandas DataFrame into two random samples (80% and 20%) for training and testing. These samples make sense if you have a large Dataset. To make it simple for this article, I am using quite a simple DataFrame, but you can use these approaches in your real-time projects to split the DataFrame.

Key Points –

  • Utilize the sample method in Pandas to randomly select rows from the DataFrame for creating test and train sets.
  • Ensure proper stratification if dealing with imbalanced classes by using stratify parameter in the splitting process to maintain class distribution in both train and test sets.
  • Use the train_test_split function from the sklearn.model_selection module to split the DataFrame into training and testing sets.
  • Define the test size (proportion of the dataset to include in the test split) and optionally set random state for reproducibility.
  • After splitting, you’ll typically have separate DataFrames for training and testing data, which can then be used for model training and evaluation.
  • Specify the features and target variable for the split operation.

1. Quick Examples to Create Test and Train Samples

If you are in hurry below are some quick examples to create test and train samples in Pandas DataFrame.


# Quick examples to create test and train samples

# Using DataFrame.sample() 
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]

# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]

Now, let’s create a pandas DataFrame with a few rows and columns, execute these examples and validate results. Our DataFrame contains column names CoursesFee and Duration


# Create DataFrame
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

Yields below output.


# Output:
Courses    Fee Duration
0    Spark  22000   30days
1  PySpark  25000   50days
2    Spark  23000   30days
3   Python  24000     None
4  PySpark  26000      NaN

2. Using DataFrame.sample() Method To get Test & Train Samples

DataFrame.sample() return a random sample of elements from the DataFrame. You can use this to select the train and test samples.

The random_state parameter controls the shuffling applied to the data before the split. By defining the random_state, we can reproduce the same split of the data across multiple calls.

Using Shuffle parameter to generate random shuffled before splitting.


# Using DataFrame.sample() Method by random_state arg.
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)

Yields below output.


# Output:
   Courses    Fee Duration
3   Python  24000     None
4  PySpark  26000      NaN
0    Spark  22000   30days
1  PySpark  25000   50days

3. Use sklearn to Create Test and Train Samples

The train_test_split() function of the sklearn library is able to handle Pandas DataFrames as well as arrays. Therefore, we can simply call the corresponding function by providing the dataset and other parameters.

Test_size: This parameter represents the proportion of the dataset that should be included in the test split. The default value for this parameter is set to 0.25, meaning that if we don’t specify the test_size, the resulting split consists of 75% train and 25% test data.


# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)

Yields below output.


# Output:
   Courses    Fee Duration
4  PySpark  26000      NaN
2    Spark  23000   30days
3   Python  24000     None
1  PySpark  25000   50days

4. Using model_selection() Method

model_selection() is a method for setting and analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results. You need to train your model by using a specific dataset. Then, you test the model against another dataset.


# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]

Yields below output.


# Output:
     Fee Duration
4  26000      NaN
0  22000   30days
2  23000   30days
3  24000     None

5. Using Numpy.random.rand() Method

np.random.rand() generates random numbers from the standard uniform distribution (i.e., the uniform distribution from 0 to 1), and outputs those numbers as a Numpy array. The np.random.rand() produces random numbers, structured as a Numpy array. A Numpy array is a  data structure that we use for storing and manipulating numeric data.

np.random.rand(len(df)) is an array of size len(df) with randomly and uniformly distributed float values in range [0, 1]. The < 0.8 applies the comparison element-wise and stores the result in place. Thus values < 0.8 become True and value >= 0.8 become False.


# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)

Yields below output.


# Output:
  Courses    Fee Duration
0   Spark  22000   30days
2   Spark  23000   30days

NOTE: Alternatively, as long as msk is of dtype booldf[msk]df.iloc[msk] and df.loc[msk] always return the same result.

6. Complete Examples of Create Test and Train Samples DataFrame

Below are some complete examples of creating test and train samples of pandas DataFrame.


# Create a Pandas DataFrame.
import pandas as pd
import numpy as np
technologies= {
    'Courses':["Spark","PySpark","Spark","Python","PySpark"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan]
          }
df = pd.DataFrame(technologies)
print(df)

# Use train_test_split() Method.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
print(train)

# Using numpy.random.rand() Method.
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test=df[~msk]
print(test)

# Using DataFrame.sample() 
train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)
print(train)

# Using model_selection() method.
from sklearn.model_selection import train_test_split
y = df.pop('Courses')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train]

Frequently Asked Questions on Creating Test and Train Samples from DataFrame

How do I split a DataFrame into train and test sets using Pandas?

You can use various methods, but a common approach is to use the train_test_split function from the sklearn.model_selection module. This function allows you to specify the size of the test set and can handle the splitting process efficiently.

Can I specify the features and target variable separately when creating train and test sets?

You can specify the columns representing your features and the target variable independently when splitting the DataFrame. This allows you to ensure that the train and test sets contain the appropriate data for modeling and evaluation.

What should I consider when splitting data for machine learning tasks?

It’s crucial to consider factors such as data distribution, class imbalance, and the size of your dataset. Techniques like stratified sampling can help maintain the distribution of classes in both train and test sets, ensuring that your model learns effectively and generalizes well to unseen data.

How can I validate the effectiveness of my train-test split?

You can evaluate the performance of your model using metrics such as accuracy, precision, recall, or F1-score on the test set. Additionally, techniques like cross-validation can provide more robust estimates of model performance by averaging results over multiple train-test splits.

Should I shuffle the data before splitting into train and test sets?

Shuffling the data before splitting helps prevent any bias that may arise from the ordering of the dataset. This ensures that both the train and test sets are representative of the overall data distribution and can improve the generalization ability of your model.

Conclusion

In this article, you have learned how to create test and train samples of pandas DataFrame by using DataFrame.drop(), DataFrame.sample(), and by applying sklearn’s train_test_split() and model_selection() function with examples.

References

Naveen Nelamali

Naveen Nelamali (NNK) is a Data Engineer with 20+ years of experience in transforming data into actionable insights. Over the years, He has honed his expertise in designing, implementing, and maintaining data pipelines with frameworks like Apache Spark, PySpark, Pandas, R, Hive and Machine Learning. Naveen journey in the field of data engineering has been a continuous learning, innovation, and a strong commitment to data integrity. In this blog, he shares his experiences with the data as he come across. Follow Naveen @ LinkedIn and Medium

Leave a Reply